Home Machine Learning Introduction to State Space Models as Natural Language Models

Machine Learning

Introduction to State Space Models as Natural Language Models

March 16, 2025

State Space Models (SSMs) use first-order differential equations to represent dynamic systems.

The HiPPO framework provides a mathematical foundation for maintaining continuous representations of time-dependent data, enabling efficient approximation of long-range dependencies in sequence modeling.

Discretization of continuous-time SSMs lays the groundwork for processing natural language and modeling long-range dependencies in a computationally efficient way.

LSSL, S4, and S5 are increasingly sophisticated and efficient sequence-to-sequence state-space models that pave the way for viable SSM-based alternatives to transformer models.

While transformer-based models are in the limelight of the NLP community, a quiet revolution in sequence modeling is underway. State Space Models (SSMs) have the potential to address one of the key challenges of transformers: scaling efficiently with sequence length.

In a series of articles, we’ll introduce the foundations of SSMs, explore their application to sequence-to-sequence language modeling, and provide hands-on guidance for training the state-of-the-art SSMs Mamba and Jamba.

In this first article of the three-part series, we’ll examine the core principles of SSMs, trace their evolution from Linear State Space Layers (LSSL) to the S5 model, and examine their potential to revolutionize sequence modeling with unparalleled efficiency.

Understanding state space models

Before exploring how State Space Models (SSMs) can function as components of large language models (LLMs), we’ll examine their foundational mechanics. This will allow us to understand how SSMs operate within deep neural networks and why they hold promise for efficient sequence modeling.

SSMs are a method for modeling, studying, and controlling the behavior of dynamic systems, which have a state that varies with time. SSMs represent dynamic systems using first-order differential equations, providing a structured framework for analysis and simplifying computations compared to solving higher-order differential equations directly.

Let’s dissect what this means.

Consider a system consisting of a moving car on the road. When we supply a certain input to this system (like pressing the gas pedal), we alter the car’s current state (for example, the amount of gas the engine is burning) and consequently cause the car to move at a certain speed.

Because our system’s state varies with time, it is considered a dynamic system. In this case, we are studying one state variable (the amount of gas the engine burns) in our state (the car’s internals). State variables are the minimum number of variables we can use to understand the system’s behavior through mathematical representation.

A car as a dynamic system. The system has a certain input, which is a foot pressing the gas pedal. This input is supplied to the car, influencing its state. The state variable being changed is the amount of gas the engine is burning. The output of the system is the speed of the car.

In our scenario, the car was already moving, so it was burning gas—a result of the previous force on the gas pedal. The speed we would get if we pressed the pedal in a stationary car differs from the speed we would get if the car were already moving since the engine would need less additional gas (and less additional input force) to reach a certain speed. Thus, when determining the speed, we should also factor in the car’s previous state.

A dynamic system with a previous state as the input. The value of the state variable depends not only on the input but also on the previous state.

There is one more thing to consider. State Space Models also model a “skip connection,” which represents the direct influence of the input on the output. In our case, the skip connection would model an immediate influence of pressing the gas pedal on the car’s speed, regardless of the current state. In the specific case of a car, this direct feedthrough (D) is zero, but we keep it in the model as, generally, systems can (and do) have direct input‐to‐output dependencies.

A dynamic system with a direct connection between input and output. There is a direct relationship between pressing a car’s gas pedal (input) and the car’s speed (output).

Now that we have considered all the possible connections in our system, let’s try to model it mathematically. First, we need representations for the variables in our system. We have the previous state of the model, x(t-1), the input, u(t), the current state of the model, x(t), and the output, y(t).

We also need a notation to represent the relationship between every two variables in the system. Let’s denote the effect of the previous state on the current one by a matrix A, the effect of the input on the current state by a matrix B, the effect of the state on the output by a matrix C, and the direct effect of the input on the output by the matrix D.

State space representation of a dynamic system. The input u(t), the state x(t), the output y(t), and the system’s previous state x(t-1) are connected through matrices A, B, C, and D, respectively. — State space representation of a dynamic system. The input *u(t)*, the state *x(t)*, the output *y(t)*, and the system’s previous state *x(t-1)* are connected through matrices A, B, C, and D, respectively.

From the input u(t), we need to compute two variables:

1. The new state x(t), which considers the effect of the previous state x(t-1) and the input u(t).

2. The output y(t), which considers the effect of the new state x(t) and the direct effect of the input u(t).

Consequently, we can derive the equations for the two variables:

1. The equation for the new state x(t):

2. The equation for the output y(t):

These two equations form our system’s state space representation (SSR). The SSR allows us to study the system’s stability by analyzing the effects of inputs on the system’s state variables and output.

We can model probabilistic dependencies between state variables and the inputs by introducing noise terms into the dynamics and observation equations. These stochastic extensions enable us to account for uncertainties in the system and its environment, providing a foundation for modeling and controlling the system’s behavior in real-world scenarios.

State space models for natural language processing

State Space Models (SSMs), long established in time series analysis, have been utilized as trainable sequence models for decades. Around 2020, their ability to efficiently handle long sequences spurred significant progress in adapting them for natural language processing (NLP).

The exploration of SSMs as trainable sequence models was gradual through multiple contributions that laid the foundation for introducing SSMs in deep learning models as “State Space Layers” (SSLs). In the following sections, we’ll explore key contributions that led to the use of SSMs as NLP models.

Applying SSMs to natural language processing reframes the input as a token, the state as the contextual representation, and the output as the predicted next token.

HiPPO: recurrent memory with optimal polynomial projections

The primary challenge sequence models face is capturing dependencies between two inputs that are far apart in a long sequence.

Let’s say we have a paragraph where the last sentence references something mentioned in the first sentence:

The word ‘Sushi’ in the first sentence is referenced in the last sentence, with a large number of words in between. Thus, understanding the phrase “that name” in the last sentence requires the first sentence for context.

Historically, sequence models, such as traditional RNNs, GRUs, and LSTMs, struggled to retain such long-range dependencies due to problems like vanishing or exploding gradients. The gating mechanisms these algorithms rely on regulate information flow by selectively retaining important features and discarding irrelevant ones, which mitigates issues like short-term memory loss.

However, these mechanisms are insufficient for capturing long-range dependencies because they struggle to preserve information over extended sequences. This is due to capacity constraints, a tendency to prioritize short-term patterns during training, and cumulative errors that degraded information over long sequences. While transformers address many of these issues through their self-attention mechanism, due to the quadratic complexity of attention, they are computationally inefficient for long sequences.

Albert Gu and colleagues at Stanford attempted to solve this problem by introducing HiPPO (short for “High-order Polynomial Projection Operators”). This mathematical framework aims to compress historical information into a fixed-size representation. The fixed-size representation captures the entire processed sequence and enables sequence models to process and utilize long-range dependencies efficiently. Unlike the hidden state in an LSTM or GRU, which is also a fixed-size representation but primarily optimized for short-term memory retention, HiPPO is explicitly designed to capture the entire processed sequence, enabling sequence models to process and utilize long-range dependencies efficiently.

HiPPO works by constructing a set of polynomial bases that are mathematically orthogonal with respect to a specific weighting function. The weighting function w(t) weighs the importance of historical information using one of two variants:

1. Transform HiPPO Matrix Variations: Transform matrices prioritize the latest inputs and change the system’s response continuously with time. The importance of information stored in the sequence history decays over time.

2. Stationary HiPPO Matrix Variations: Stationary matrices are time-invariant and consider all past data with consistent importance. The rate of natural decay of information remains consistent over time, providing a balance between retaining historical information and responding to new inputs.

Gu and colleagues applied the two variants to three different polynomial families referred to as Leg, Lag, and Cheb. The difference between the Leg, Lag, and Cheb is the amount of information retention, which is determined by the variations in the weighting functions w(t) associated with each set of polynomials and their orthogonality properties:

1. HiPPO-Leg is based on the Legendre polynomials. It gives uniform weighting for all the information in the sequence. Thus, the weighting function w(t) = 1. As the sequence length becomes larger, the older parts of the sequence are compressed into a fixed-size representation.

2. HiPPO-Lag is based on the Laguerre polynomials. There is an exponential decay of information over time.

3. HiPPO-Cheb is based on the Chebyshev polynomials. It creates a non-uniform distribution that prioritizes the latest and oldest information.

The storage and prioritization of the sequence’s historical data is due to the mathematical properties of these polynomials. The appendix of the HiPPO paper contains all the equations and mathematical proofs.

The HiPPO matrix is obtained by deriving differential operators that project the input signal onto the specified polynomial basis in real-time. The operators ensure the orthogonality of the states while preserving the defined weighting function. The following equation defines them:

Here, ϕ(t) are the basis functions of the chosen family of orthogonal polynomials (i.e., Legendre, Laguerre, or Chebyshev), ϕ′i is the derivative of the i-th basis function with respect to time t, and w(t) is the weighting function that defines the importance of information over time. i is the index of the current state or basis function being updated, and j is the index of the previous state or basis function contributing to the update. It points to the j-th basis function that is being integrated with respect to w(t). The integral computes the contribution of the j-th basis function to the update of the i-th state, considering the weighting w(t).

This mechanism allows for efficiently updating the model’s hidden state, minimizing the loss of long-range dependencies. Thus, the HiPPO matrix can be used to control the update of a model’s context or hidden state.

This sounds familiar, right? In the previous section, we saw that the representation of the state change (A) for text data would be the context of the text (or sequence). Just like in RNNs and LSTMs, we can use this context (or hidden state) to predict the next word. Since its structure allows it to handle long- and short-range dependencies, HiPPO acts as a template for the matrix A.

Combining recurrent, convolutional, and continuous-time models with linear state-space layers

HiPPO’s inventors collaborated with other Stanford researchers to develop the Structured State Space Sequence model, which uses the HiPPO framework. This model makes significant strides in applying SSMs to sequence modeling tasks.

Their 2021 paper Combining Recurrent, Convolutional, and Continuous-time Models with Linear State-Space Layers aims to combine the best and most efficient properties of all the existing sequence modeling algorithms.

According to the authors, an ideal sequence modeling algorithm would have the following capabilities:

1. Parallelizable training, as is possible with Convolutional Neural Networks (CNNs). This saves computational resources and enables a faster training process.

2. Stateful inference, as provided by Recurrent Neural Networks (RNNs). This allows context to be used as a factor while deciding on the output.

3. Time-scale adaptation, as in Neural Differential Equations (NDEs). This enables the sequence model to adapt to various lengths of input sequences.

In addition to these properties, the model should also be able to handle long-range dependencies in a computationally efficient manner.

Motivated by these goals, the authors explored using State Space Models (SSMs) to develop a computationally efficient and generalizable sequence model suitable for long sequences.

Let’s explore how they did that:

As we learned above, the SSR equations represent a dynamic system with a continuously changing state. To apply SSMs to NLP, we need to adapt these continuous-time models to operate on discrete input sequences. Rather than continuous signals, we’ll now feed strings of individual tokens to the model one by one.

Discretization

We can discretize the continuous SSR equations using numerical methods.

To understand this process, we will return to the example of the continuously moving car. The car’s speed is a continuous signal. To study the variation in the car’s speed, we need to measure it at all times. However, it’s impractical to record every infinitesimal change in speed. Instead, we take measurements at regular intervals—for example, every 30 seconds.

By recording the car’s speed at these specific moments, we convert the continuous speed profile into a series of discrete data points. This process of sampling the continuous signal at regular intervals is called “discretization.” The interval of time we are using to measure the speed is called the time scale Δt, also known as “step size” or “discretization parameter.”

To convert a continuous signal into a discrete signal, it is sampled in fixed intervals Δt. — To convert a continuous signal into a discrete signal, it is sampled in fixed intervals *Δt.*

Similar to discretizing car speed, to adapt SSMs for natural language processing, we start with continuous-time equations that describe how a system evolves. We discretize the equations, converting them into a form that updates at each discrete time step.

The choice of Δt is critical: if it is too large, we risk losing important details of the state dynamics (undersampling):

If Δt is too small, the system might become inefficient or numerically unstable due to excessive computations (oversampling):

In Combining Recurrent, Convolutional, and Continuous-time Models with Linear State-Space Layers, the authors explored several methods for discretizing state-space models to adapt them for sequence modeling tasks. They ultimately selected the Generalized Bilinear Transform (GBT), which effectively balances accuracy (by avoiding oversampling) and stability (by avoiding undersampling). The GBT allows the discrete state-space model to approximate the continuous dynamics while maintaining robustness in numerical computations.

The discrete state equation under GBT is given by:

Here, x is the state representation, Δt is the time step, A is the matrix that represents how the state is influenced by the previous state, B is the matrix that represents the effect of the input on the current state, and I is the identity matrix which ensures that the output has consistent dimensionality.

A critical decision when applying the Generalized Bilinear Transform is the choice of the parameter α, which controls the balance between preserving the characteristics of the continuous-time system and ensuring stability in the discrete domain. The authors selected α = 0.5 as it counterbalances accuracy and numerical stability. The resulting state equation is given by:

The bilinear transform equation is then applied to the initialized continuous-time matrices A and B, discretizing them into A and B respectively.

Now that we have a discretized version of the SSR equations, we can apply them to natural language generation tasks where:

1. u(t) is the input token we feed into the model.

2. x(t) is the context, which is the representation of the sequence’s history thus far.

3. y(t) is the output, the predicted next token.

Thus, we have a representation of SSMs that can handle tokens as input.

State Space Model with discretized matrices A and B. A and B map the current context xt-1 and the input token ut to the new context xt. C maps the context to the output token yt, with D modeling the direct relationship between ut and yt. The direct connection between the input and the output mediated by D is treated as a skip connection and is not explicitly incorporated into the model's internal architecture. — State Space Model with discretized matrices A and B. A and B map the current context x_t-1 and the input token u_t to the new context x_t. C maps the context to the output token y_t, with D modeling the direct relationship between u_t and y_t. The direct connection between the input and the output mediated by D is treated as a skip connection and is not explicitly incorporated into the model’s internal architecture.

The three pillars of SSMs as sequence models

Now that we can use SSMs for NLP tasks, let’s see how they measure up with respect to the other available sequencing algorithms by circling back to the goals the authors stated at the beginning of Combining Recurrent, Convolutional, and Continuous-time Models with Linear State-Space Layers.

Parallelizable training

Parallelizable training would save a considerable amount of computational resources and time. Two widely used sequencing architectures are inherently parallelizable during training:

1. Convolutional Neural Networks (CNNs) are inherently parallelizable because the convolution operation can be applied simultaneously across all positions in the input sequence. In sequence modeling, CNNs process the entire input in parallel by applying convolutional filters over the sequence, allowing for efficient computation during training.

2. Transformers achieve parallelism through the self-attention mechanism, which simultaneously computes attention weights between all pairs of tokens in the sequence. This is possible because the computations involve matrix operations that can be parallelized, allowing the model to process entire sequences at once.

Efficiently distributing the computational workload is crucial for sequence algorithms, especially when training on large datasets. To address this challenge, the authors introduced a convolutional representation of SSMs, which allows these models to process sequences in parallel, similar to CNNs and Transformers.

The author’s idea is to express the SSM as a convolution operation with a specific kernel k derived from the state-space parameters, enabling the model to compute outputs over long sequences efficiently.

To derive the SSR equations as a convolution operation, they assume the SSM model to be time-invariant. This means the matrices A, B, C, and D do not vary with time, the matrix A is stable (which is already achieved by adopting the HiPPO matrix for A that allows a numerically stable update of the context), and the initial state x(0) is 0.

Using the SSR equations mentioned earlier (state equation that derives x(t) and output equation that derives y(t)), the kernel k can be derived in two steps:

1. Solving for the state, we start with the state equation from the SSR equations where x₀= 0:

Solving for the state, we start with the state equation from the SSR equations where x0 = 0

We derived the state x_n, which represents the system’s state at time step n, based on the contributions of past inputs. Similarly, u_k denotes the input to the system at a specific time step k within the sequence. The number of time steps n (i.e., the number of times we sample using Δt) depends on the length of the input sequence, as the state x_n is influenced by all preceding inputs up to time n−1.

2. Substitute the x_nin the SSR output equation with the state that is derived from step 1.

We can simplify this equation by combining the state representations (A, B, C, and D) as the kernel k:

Here, m is the index for summing over past inputs. The result is the following equation for the output at step n:

Thus, we are left with the convolutional representation of State Space Representation: We take the input u_nas a common factor and denote the term multiplied by the input as the kernel k. We obtain the outputs from the input sequence by passing the kernel across it.

Stateful inference

Stateful inference refers to a sequence model’s ability to create, maintain, and utilize a “state,” which includes all the relevant context needed for further computations. This ability is desirable because it eliminates the computational inefficiency of understanding the context whenever a new input token is present.

Transformers capture long-range dependencies and context through the self-attention mechanism. However, recomputing the attention weights and value vectors every time we have a new input token is computationally expensive. We can cache the values of key and value vectors to avoid some recomputation, which makes it slightly more efficient. Still, it does not solve the problem of transformers scaling quadratically.

RNNs achieve stateful inference through a hidden state that is only updated and not recomputed for every input token. However, RNNs struggle to retain information from earlier tokens in long sequences. This limitation arises because, during backpropagation, gradients associated with long-range dependencies diminish exponentially as they are propagated through many layers (or time steps), a phenomenon known as the vanishing gradient problem. As a result, RNNs cannot effectively model long-range dependencies between tokens.

Thanks to their state equation, SSMs achieve stateful inference. They inherently maintain a state containing the sequence’s context, making them more computationally efficient than transformer-based models.

To handle long-range dependencies, the authors of Combining Recurrent, Convolutional, and Continuous-time Models with Linear State-Space Layers use the HiPPO-LegS (Stationary form of HiPPO-Leg) formulation to parameterize A.

Time-scale adaptation

Time-scale adaptation refers to a sequence model’s ability to capture dependencies for the input token in different parts of the input sequence. In technical terms, this means the context can retain dependencies that occur over different temporal distances within the same sequence. Time-scale adaptation enables effective capturing of both short-term (immediate) and long-term (distant) relationships between elements in the data.

A model’s context representation is crucial for its ability to capture the internal dependencies within a sequence. SSMs represent the context as the matrix A. Thus, an SSM’s ability to update the state based on the new input through the state equation allows the model to adapt to the contextual dependencies within a sequence, allowing it to handle both long and short-range dependencies.

Linear state space layers (LSSLs)

So far, we’ve seen that State Space Models are efficient sequence models. In their paper Combining Recurrent, Convolutional, and Continuous-time Models with Linear State-Space Layers, Gu and colleagues introduced the Linear State Space Layer (LSSL) utilizing both the discretized recurrent and convolutional forms of State Space Representation equations. This layer is integrated into deep learning architectures to introduce efficient handling of long-range dependencies and structured sequence representations.

Like RNNs, SSMs are recurrent. They update the context by combining the previous state with the new state. This recurrent form is very slow to train because we need to wait for the previous output to be available before computing the next one. To address this problem, the authors devised the convolutional representation of the SSM equations that we discussed in the previous sections.

While the convolutional representation of SSMs enables training parallelization, it is not without its own problems. The key issue is the fixed size of the kernel. The kernel we are using to process the input sequence is determined by the model parameters (matrices A, B, C, and D) and sequence length, as we saw in the first step of the kernel derivation. However, natural language sequences vary in length. Thus, the kernel would be recomputed during inference based on the input sequence, which is inefficient.

Although recurrent representations are inefficient to train, they can handle varying sequence lengths. Thus, to have a computationally efficient model, we seem to need the properties of both the convolutional and recurrent representations. Gu and colleagues devised a “best of both worlds” approach, using the convolutional representation during training and the recurrent representation during inference.

Comparison of the continuous-time, recurrent, and convolutional forms of SSMs. The Linear State Space Layer adopts both the recurrent and convolutional forms of the SSM representation to leverage their complementary advantages. The recurrent form is used during inference, and the convolutional form during training. | Source

In their paper, Gu and collaborators describe the LSSL architecture as a “deep neural network that involves stacking LSSL layers connected with normalization layers and residual connections.” Similar to the attention layers in the transformer architecture, each LSSL layer is preceded by a normalization layer and followed by a GeLU activation function. Then, through a residual connection, the output is added to the normalized output of a position-wise feedforward layer.

Architecture of a Linear State Space Layer. Each input has H features (the size of the token’s embedding vector) that are processed by independent copies of the SSM as one-dimensional inputs in parallel. Each SSM copy produces an M-dimensional output for each feature. The combined outputs are fed through a GeLU activation function and a position-wise feed-forward layer.

Efficiently modeling long sequences with state structured spaces

The LSSL model performed impressively well on sequence data but was not widely adopted due to computational complexities and memory bottlenecks.

Results of testing the original LSSL model on the sequential MNIST, permuted MNIST, and sequential CIFAR tasks, which are popular benchmarks originally designed to test theability of recurrent models to capture long-term dependencies of length up to1k. LSSL sets SoTA on sCIFAR by more than 10 points.

In the paper Efficiently Modeling Long Sequences with State Structured Spaces, Gu, together with close collaborators Karan Goel and Christopher Ré, advanced the LSSL to reduce the computational complexity and accuracy of the training process.

Improvements on the state matrix A

In the previous section, we explored how the original LSSL relied on a fixed, predefined form of the HiPPO matrix to serve as the state matrix A. While this representation was successful in compressing information, it was computationally inefficient due to the full (dense) matrix representation of A. Gu, Goel, and Ré described this implementation as “infeasible to use in practice because of prohibitive computation and memory requirements induced by the state representation.”

In the LSSL, the state is multiplied by the matrix A to produce the updated version of the state. The most computationally efficient form of the matrix A for multiplication would be a diagonal matrix. Unfortunately, the HiPPO matrix could not be reformed as a diagonal matrix since it does not have a full set of eigenvectors.

However, the authors were able to dissect the matrix into a diagonal plus low-rank decomposition (DPLR). The diagonal matrix has nonzero entries only on the main diagonal, which makes the multiplication process more efficient by requiring only a single multiplication per vector element. The low-rank matrix can be represented as the product of two much smaller matrices. Because of this factorization, the operations needed to multiply by the vector are greatly reduced compared to a full-rank matrix of the same size.

The original LSSL architecture required O(N²L) operations, where N is the state dimension, and L is the sequence length. After the transformation of the matrix A into its diagonal plus low-rank (DPLR) form, both the recursive and convolutional forms’ computational complexity were reduced:

1. For the recurrent form, the DLPR form has only O(NL) matrix-vector multiplications.

2. For the convolutional form, the convolutional kernel was reduced to require only O(N log L + L log L) operations. This was achieved by changing the technique used to derive the kernel, which included using the inverse Fast Fourier Transform (iFFT) and applying the Woodbury identity to reduce the low-rank term of matrix A.

This is a considerable leap in computational efficiency, significantly reducing the scaling with sequence length and bringing SSMs closer to linear time complexity, in contrast to the quadratic scaling of transformers.

Improvements in the training implementation

After tackling the LSSL’s computational complexity, the authors found another significant improvement, which is making the matrix A (partially) learnable. In the LSSL, the matrix was fixed and not updated during the training process. Rather, the matrices B and C were responsible for the update and learnability of the SSM blocks.

Keeping the matrix A fixed ensures computational efficiency, but it limits the model’s ability to capture complex dynamics and underlying patterns in the sequence. A fully learnable matrix A offers the flexibility to adapt to arbitrary dynamics. However, it comes with trade-offs: more parameters to optimize, slower training, and higher computational costs during inference.

To balance these competing demands, the modified LSSL – dubbed S4 – adopts a partially learnable A. By maintaining the DPLR structure of A, the model retains computational efficiency, while the introduction of learnable parameters enhances its ability to capture richer, domain-specific behaviors. By introducing learnable parameters into A, a model can adjust the state dynamics during training and update sequence-specific internal representations in the state.

Additionally, Efficiently Modeling Long Sequences with State Structured Spaces introduces techniques for implementing bidirectional state-space models. These models can process sequences in both the forward and backward directions, capturing dependencies from past and future contexts.

Simplified state space layers for sequence modeling

In Simplified State Space Layers for Sequence Modeling, Jimmy Smith, Andrew Warrington, and Scott Linderman proposed multiple improvements to the S4 architecture to enhance performance while maintaining the same computational complexity.

While the improvements of S4 over the original LSSL mainly focus on reducing the model’s computational complexity, S5 aimed to simplify the architecture, making it more efficient and easier to implement while maintaining or improving performance.

Using parallel associative scan

Parallel scan, also known as parallel associative scan, is an algorithm that allows parallel computation through pre-computing cumulative operations (in this case, products) up to each position in the sequence so they can be selected during the processing step instead of processed one at a time.

Using a parallel associative scan, Smith and colleagues were able to parallelize the training process of recurrent SSMs, removing the need for the use of the convolutional representation.

Thus, the S5 layer operates only in the time domain instead of having the convolutional and frequency domain. This is an important improvement because it allows the time complexity per layer to be O(N log ⁡L) instead of O(NL), leveraging parallel computation over the sequence length while reducing the memory overhead.

Allowing multi-input-multi-output

LSSL and S4 are Single-Input-Single-Output (SISO) models. Allowing Multi-Input-Multi-Output (MIMO) was computationally infeasible since the computations inside LSSL and S4 were designed under the assumption of having one input at a time. For example, adapting the convolutional representation to operate on matrices instead of vectors would have significantly increased the computational cost, making the approach impractical.

Smith and collaborators discretized the MIMO SSM equations instead of the SISO SSM equations. Using the same SSR equations, they extended the discretization process to handle m-dimensional inputs and n-dimensional outputs. Assuming the state has N dimensions, this change makes B an N x m matrix instead of N x 1, and C an n x N matrix instead of 1 x N.

S5’s support for MIMO allows it to handle multidimensional data, such as multivariate and multi-channel time series data, process multiple sequences simultaneously, and produce multiple outputs. This reduces computational overhead by allowing multiple sequences to be processed at the same time instead of having m copies of the SSM.

Diagonalized parametrization

As we discussed above, HiPPO-LegS could not be diagonalized. However, the parallel scan approach requires a diagonal matrix A. Through experimentation, Smith and colleagues discovered that they could represent the HiPPO-LegS matrix as a normal plus low-rank (NLPR) matrix, where the normal component is referred to as HiPPO-N, which can be diagonalized.

They showed that removing the low-rank terms and initializing the HiPPO-N matrix had similar results by proving that HiPPO-N and HiPPO-LegS produced the same dynamics. (A proof is given in the appendix of the paper.) However, if they were to use the diagonal matrix from the DPLR approximation, the approximation would have produced very different dynamics than the original structure.

Using a diagonalized version of the HiPPO-N matrix reduced the model’s computational complexity by removing the need to convert the HiPPO-LegS matrix into its DPLR approximation.

Similar to how using a structured parametrization for matrix A decreased the computational overhead, S5 uses a low-rank representation of matrices B and C, further reducing the number of parameters.

The computational components of an S5 layer, which uses a parallel scan on a diagonalized linear SSM to compute the SSM outputs. A nonlinear activation function is applied to the SSM outputs to produce the layer outputs. | Source

Conclusion and outlook

The evolution of State Space Models (SSMs) as sequence-to-sequence models has highlighted their growing importance in the NLP domain, particularly for tasks requiring the modeling of long-term dependencies. Innovations such as LSSL, S4, and S5 have advanced the field by enhancing computational efficiency, scalability, and expressiveness.

Despite the advancements made by the S5 model, it still lacks the ability to be context-aware. The S5 can efficiently train and infer in the time domain and retain information for long-range dependencies, but it does not explicitly filter or focus on specific parts of the sequence, as Transformers do with attention mechanisms.

Hence, a key next step is to incorporate a mechanism into SSMs that enables them to focus on the most relevant parts of the state rather than processing the entire state uniformly. This is what the Mamba model architecture addresses, which we’ll explore in the upcoming second part of the series.

Introduction to State Space Models as Natural Language Models

Understanding state space models

State space models for natural language processing

HiPPO: recurrent memory with optimal polynomial projections

Combining recurrent, convolutional, and continuous-time models with linear state-space layers

Discretization

The three pillars of SSMs as sequence models

Parallelizable training

Stateful inference

Time-scale adaptation

Linear state space layers (LSSLs)

Efficiently modeling long sequences with state structured spaces

Improvements on the state matrix A

Improvements in the training implementation

Simplified state space layers for sequence modeling

Using parallel associative scan

Allowing multi-input-multi-output

Diagonalized parametrization

Conclusion and outlook

Was the article useful?

Explore more content topics:

Popular Posts

These SUVs have lousy headlights

Storm surges put 6.5 million homes at risk

America’s NAFTA nemesis: Canada, not Mexico

Aetna lied about why it dropped some Obamacare coverage

My Favorites

Wine – The Perfect Combination

Sharp drop in international student visas worries some US colleges

This is how tax reform could hinder corporate innovation in the...

WatchGuard unveils FireCloud Internet Access

Popular Categories

Essential Review Papers on Physics-Informed Neural Networks: A Curated Guide for...