So I have been dealing with this for quite a while now, experimenting with time series datasets and neural networks, and would like to share some thoughts here, that others might, hopefully, find helpful to avoid making some unnecessary mistakes.
We know that neural networks can suffer from a phenomenon called covariate shift. For those who don’t know it, very shortly, it is the change of the distribution of output activations when the parameters of the network change. What we would like to see at each layer’s outputs in a network are stable distributions of at the output of each layer of our network. Otherwise, parameters in a current layer become too strong functions of changes in parameters in previous layers, which ain’t good, to say it plain, as it will frequently lead to saturated activation units and consequentially, vanishing gradients, which basically make the model get stuck and inhibit it from learning. Your network will then behave like a dog chasing its own tail ;)
Practically, that’s what batch normalization cures, by normalizing and rescaling the distribution of activation outputs after each layer, thereby enabling more stable, faster, higher-learning-rate training and sloppily said decreasing the interdependence of model layers on previous layers.
When applying batch normalization to time series, things becomes a bit, or should I say weigh more difficult than with usual image data (or similar). In this article, I am mainly dealing with the immense problem called forwardlooking bias. In short, it is the unwanted exploitation of future knowledge about a prediction by an algorithm, which leads to overly good prediction results that cannot be replicated out-of-sample. In time series, especially in financial time series and strategy development, I would consider forwardlooking bias as maybe THE number one pitfall of all. I surmise that FWLB contributes substantially to the famous statement:
“ If something looks to good to be true, it is too good to be true.”
Alright, let’s go with the time series case. Let’s think of this setting. We observe multiple time series
x_i[t] i=1,...,N
in parallel over time, where N
is the total number of time series observed in parallel and x[t] is the vector of all time series’ values at time t. As training input matrix for lets say an MLP Neural Network, we can use matrices of the form:
X[t] = x[t-L:t]
such that the matrix X
has dimensions (N,L)
and only uses data up to t
, with a lookback window L
.
There are basically two different approaches of how to generate batches based on the matrix X
:
- Construct batches of
X
with overlapping lookback windows - Construct batches of
X
with non-overlapping / disjoint windows
Say, the batch contains M
examples.
For case 1), then, a batch B
is the Tensor of matrices X
:
B[t] = Tensor(X[t-i:t-i] | i=M-1,M-2,...,0)
with dimensions (B,N,L)
. The batch is time-ordered such that its last batch sample is the most recent one corresponding to time t
. Similarly, one picks disjoint windows for case 2).
As stated above, guiding you back on track, the main issue that I would like to discuss here is leakage / forward-looking bias, when applying batch norm to the batches. For time series, this is the difference to for instance applying batch norm on image data. In image data, we enjoy the luxury that the latter dimensions of X
(N and L) correspond to height and width of an image, and are all available at the same time, when the image is presented to the model. Thus, batch norm can be straightforwardly applied. In time series data, however, we have the dimension L
corresponding to the time dimension. Thus, when applying a batch norm to the matrix X, we have two leakage sources:
a) Data points earlier in the lookback window are normalized with respect to the mean and std computed over the entire dimension L.
b) If the batch samples have overlapping windows, batches corresponding to earlier end times are normalized wrt to statistics computed based on all batches (including ones with later times)
Clearly, these two mechanisms immediately introduce leakage into a batchnorm-processed dataset, which would be easy to exploit for a model, when trying to predict future values etc. The model will perform best on the early batch samples (as it sees what happens on the later batch samples) and worst on the late batch samples (which are often the most crucial ones). This dilemma basically forbids us to carry out batch norm in that kind of way.
With method 2), the non-overlapping batch sample windows, we have other drawbacks. Firstly, if L
is large and the dataset is limited, this could result in insufficient number of batches. Often, it can be crucial to have long lookback L
though, in order to pick up long term patterns or whatever. Secondly, if we have for instance non-stationary data, i.e. data that with changing statistics such as mean and variance, when normalizing over disjoint windows, we may get inaccurate statistics, because we are computing the batch statistics partly based on very "old" / outdated batch samples.
So, none of the two methods seems to be particularly well combinable with batch normalization. Yet, batch norm, due to its functionality can be crucial in training neural networks.
My current way of dealing with this is to use overlapping windows (with maybe a bit more than one timestep spacing between subsequent batch sample windows) and only take the predictions generated by the model on the LAST batch sample and evaluate these in the loss function. Based on that, I upgrade the loss every few steps.
The employed method enables me to compute the batch statistics based on recent windows while avoiding forward looking bias. A huge drawback of my method is that it only generates one prediction per batch to upgrade the loss function and will therefore lead to very slow convergence in training.
Another drawback, computing the gradient based on a single prediction may lead to very noisy gradients, i.e. disoriented jumps of the optimizer in the parameter space, which additionally slows down, if not inhibits convergence, when the learning rate is too high. This is why, as described above, I accumulate the losses computed by several single predictions (each prediction resulting from a batch) and then update the optimizer every few batches. This is like making batches of batches. Still, the cost of computation is of course high and very inefficient, as the model needs to always evaluate each sample in the batch, in order to come up with a single prediction instead of one prediction per sample.
An alternative normalization scheme that I recently started using normalizes not across batches, but only across the individual time series themselves. In a given sample matrix X
, for each time series n
of N
, I compute the mean mu_n
(scalar) across the lookback dimension L
. This still suffers from the leakage mechanism described in a), but, it still should be ok to do; we can imagine the algo simply receiving a past time series and normalizing it along time dimension, while making only a single prediction at the very end of the time series, which should be a valid thing to do.
Yet, I am wondering whether there are better approaches? Or whether I have missed some standard approaches that are done in time series in that case? Literature seems to be quite sparse here.
Looking forward for feedback!
Best, JZ