What Comes After Transformers? An Exploration of AI’s Next Frontier
- Cluedo Tech
- Sep 12, 2024
- 7 min read
The Rise and Reign of Transformers
Since their introduction in 2017 by Vaswani et al., transformers have been the cornerstone of some of the groundbreaking developments in artificial intelligence (AI). From language models like GPT-4 to vision applications like ViT (Vision Transformers), this architecture’s versatility and scalability have powered results across various domains. But as we push the boundaries of what’s possible in AI, we’re confronted with questions about its limitations and the need for alternatives that address scalability, efficiency, and specialized applications.
While the transformer architecture has proven highly effective, its widespread use comes with drawbacks: high computational costs, inefficient use of resources for certain tasks, and inherent challenges like hallucinations in language models. This blog explores what might come next, the potential successors to transformers, novel ideas in deep learning, and what they mean for the future of AI.

The Good, the Bad, and the Evolution
The Core of Transformers: Self-Attention and Parallelization
Transformers revolutionized AI by introducing the self-attention mechanism, allowing models to process sequences as a whole rather than step-by-step, as done in Recurrent Neural Networks (RNNs) and Long Short-Term Memory Networks (LSTMs). This parallelization made transformers highly scalable and effective for handling long-range dependencies, especially in natural language processing (NLP) and computer vision. However, the computational demand of this approach is immense, particularly because self-attention scales quadratically with the input sequence length.
From BERT to GPT-4: The Evolution of Transformer-Based Models
Since the original transformer, numerous models have extended its principles:
BERT (Bidirectional Encoder Representations from Transformers): Focused on understanding context in both directions, making it highly effective for tasks like question answering and sentiment analysis.
GPT Series (Generative Pre-trained Transformers): The GPT models moved from unsupervised pre-training to fine-tuning for specific tasks, with GPT-3 and GPT-4 pushing the boundaries with massive datasets and parameter counts.
Despite these advancements, the transformer architecture remains fundamentally the same, relying heavily on self-attention and dense layers, which are computationally expensive. This has prompted the exploration of new architectures and approaches that could lead to more efficient and specialized models.
What’s on the Horizon?
While transformers have dominated the AI landscape, researchers and engineers have been exploring various alternatives that promise to overcome some of the architecture’s limitations.
Combining Sequential Models and Transformers – A Hybrid Approach
One promising area of research is integrating classical sequence models like RNNs and LSTMs with transformers to achieve the best of both worlds:
xLSTM (Extended LSTM): Enhances traditional LSTM models with exponential gating and new memory structures, allowing for more parallel training. Though early results are promising, scaling these models to larger datasets remains a challenge.
RWKV (Receptance Weighted Key Value Models): These models aim to combine the parallel training advantages of transformers with the efficient inference capabilities of RNNs. By focusing on channel attention rather than token attention, RWKV models achieve linear time complexity, making them scalable for long sequences without the heavy computational burden of transformers.
These hybrid models are particularly promising for tasks where inference efficiency is critical, such as real-time applications in finance or IoT devices.
State-Space Models – Blending Classical Techniques with Modern AI
State-space models (SSMs) draw from control theory and offer a different approach to sequence modeling by merging elements of convolutional and recurrent models. The most notable innovation in this space is the Structured State Space Sequence (S4) model, which efficiently handles long-range dependencies through low-rank corrections and hardware-aware optimizations.
SSMs have shown potential in small language models, but their real promise lies in applications requiring robustness and stability, such as autonomous systems and industrial control. The next wave of AI could see SSMs becoming key components in systems where reliability and low-latency processing are paramount.
MLP-Based Models – A Simpler but Effective Approach
Multi-Layer Perceptron (MLP) models like the MLP-Mixer offer a minimalist alternative by relying solely on MLP layers, without convolutions or attention mechanisms. While initially surprising, MLP-based models have achieved competitive performance in image classification tasks. Recent innovations, such as Hyena, leverage long convolutions and data-controlled gating to offer drop-in replacements for attention mechanisms.
For businesses looking to deploy AI at scale, MLP-based models could be game-changers. Their simplicity allows for easier deployment in edge devices, where resource constraints make traditional transformers infeasible. These models could drive applications like real-time video processing in autonomous vehicles or lightweight AI assistants in smart devices.
Beyond Conventional Models
While the aforementioned approaches offer incremental improvements, some researchers are exploring more radical ideas that could redefine how we think about AI models altogether.
Capsule Networks – Reimagining Feature Representation
Proposed by Hinton et al., capsule networks represent features as vectors instead of scalars, capturing the relationships between parts of an object more effectively. For example, in a facial recognition system, a capsule might represent the relationship between a nose and a mouth, making it more resilient to variations in pose or lighting.
Despite initial struggles with scalability, capsule networks hold promise for tasks where understanding the hierarchical structure of objects is essential, such as medical imaging or 3D modeling. As AI systems evolve, the ability to understand complex relationships between features could become critical for advanced robotics and AR/VR applications.
Spiking Neural Networks – Adding Temporality to AI
Spiking neural networks (SNNs) mimic the timing-based behavior of biological neurons, offering a temporal dimension that traditional neural networks lack. By firing spikes only when a certain threshold is reached, SNNs can process temporal patterns efficiently, making them ideal for applications like neuromorphic computing or real-time sensor processing in autonomous systems.
As industries increasingly rely on AI for real-time decision-making (e.g., self-driving cars or real-time market trading), SNNs could offer the low-latency and energy-efficient solutions needed to process streams of data quickly and accurately.
Neurosymbolic AI – Combining Logic and Learning
Neurosymbolic AI blends deep learning with symbolic reasoning, allowing systems to perform tasks that require both pattern recognition and logical inference. This hybrid approach is particularly valuable in areas like legal analysis, where the ability to understand nuanced language and apply logical rules is crucial.
For businesses, neurosymbolic AI could enable more interpretable and trustworthy systems. As regulatory environments tighten, especially in finance and healthcare, the need for AI that can justify its decisions and adhere to explicit rules will grow. Neurosymbolic systems could be the key to creating compliant and transparent AI solutions.
Mathematical Inspirations – Tying AI to Fundamental Concepts
Some of the most innovative ideas in AI come from rethinking neural network architectures through the lens of mathematics. For instance, researchers have drawn parallels between residual networks (ResNets) and differential equations, leading to more efficient optimization techniques. These insights could pave the way for models that are not only more computationally efficient but also more interpretable, bridging the gap between machine learning and traditional scientific models.
Key Patterns in AI Innovation – What Works and Why?
Across the diverse landscape of emerging AI models, several patterns have emerged as consistently effective strategies:
Multi-X Strategies: Techniques like multi-head attention or multi-expert models have proven that repeating similar components in parallel configurations can boost performance and robustness.
Higher-Order Layers: Advanced layers like Mish activation functions or data-controlled gating enable deeper, more nuanced processing of inputs.
Decomposition and Simplification: Breaking down complex operations into simpler components, as seen in large kernel attention or Tucker decompositions, has led to more efficient architectures.
These strategies underscore the incremental yet impactful nature of most AI innovations. While transformers have dominated the AI landscape for years, the most groundbreaking future developments might come from recombining existing ideas or applying them in novel contexts.
Why Should We Care About the Future of AI Beyond Transformers?
The business implications of moving beyond transformers are profound. Here’s why exploring alternatives is crucial:
Scalability and Cost Efficiency: As businesses deploy AI at scale, the high costs associated with transformer models become a limiting factor. Alternatives like state-space models or MLP-based architectures could offer more cost-effective solutions for tasks like personalized recommendations or real-time analytics.
Specialized Applications: Transformers excel in general-purpose tasks, but they’re not the best fit for everything. For instance, neurosymbolic AI could be transformative in industries requiring both pattern recognition and logical reasoning, such as legal tech or automated contract analysis.
Sustainability and Resource Optimization: The environmental impact of training massive transformer models is a growing concern. Simpler models like MLP-Mixers or spiking networks could lead to greener AI, helping businesses reduce their carbon footprint while still delivering high performance.
Future-Proofing AI Investments: As regulatory scrutiny increases, especially around explainability and fairness, businesses need AI systems that can adapt to changing requirements. Models that prioritize interpretability, such as capsule networks or neurosymbolic approaches, are likely to be more resilient to future regulatory challenges.
Unlocking New Use Cases: From AI-driven creativity to advanced robotics, the next generation of AI architectures could unlock use cases that are currently out of reach. Whether it’s creating intelligent assistants that can plan and reason or enabling AI to work alongside humans in manufacturing environments, the possibilities are vast.
Conclusion: The Path Forward in AI Architecture
The world of AI is at a pivotal moment. While transformers have brought us far, they are not the endgame. The survey by Johannes Schneider suggests that the real breakthroughs in AI might come from rethinking foundational assumptions, combining ideas from across disciplines, and daring to explore paths that go beyond incremental improvements.
For AI practitioners, businesses, and researchers, the future lies in embracing creativity, whether through hybrid models, new architectural paradigms, or the integration of symbolic logic with deep learning. By staying open to alternative approaches, we can push the boundaries of what AI can achieve, creating systems that are not only more efficient but also more aligned with human values and needs.
As AI becomes an increasingly integral part of our world, those who lead in adopting and developing these next-generation models will be the ones who define the future of technology, business, and society.
Cluedo Tech can help you with your AI strategy, discovery, development, and execution using the AWS AI Platform. Request a meeting.