Where does AI really fit in - and where doesn't it? An in-depth guide for tech decision-makers

Artificial intelligenceA coder's perspective

25 Jun.

06 | 2025 Jari Huilla, CTO & partner Kipinä

Artificial intelligence in many forms has been part of digital solutions for a long time. Now the pace of progress is really fast, but have you stopped to think why many models are still only able to genuinely process around 50 pages of text at a time?

CTO Jari Huilla will give those working on AI solutions a deeper insight into what happens at the core of language models and why their internal transformer architecture exactly limits the size of the context a model can internalise at once.

In this article, we dive deep into the guts of language models through technical descriptions to get to the root of the challenges. Almost all organisations are using AI at some level, but when building business-critical solutions around AI, a deeper understanding of the technology helps to understand the limits and constraints of the models.

If you're just skimming through the article or want to stay closer to the practice, you can jump straight to the tips at the end of the article on how to take the constraints of language models into account in practice when designing solutions. A focused reader will be rewarded with an interesting in-depth reflection on how language models really work under the banner.

(This article was conceived using AI, but written and illustrated by hand.)

To start with - yes, AI is more than just broad language models

Various methods that can be classified as artificial intelligence have long been used to solve optimisation and other constrained problems in particular. The limitations of more traditional mathematical and logical models have been relatively easy to understand, because the problem had to be well analysed before the model to be applied was chosen and the models themselves were highly specialised.

For example, when planning production or traffic flows, mathematical modelling and linear optimisation of the problem will force you to become familiar with the constraints of the modelling approach. For linear models suitable for scheduling, the issues to consider are:

what decision variables are involved in the problem? (e.g. how many people each charter bus should pick up)
how to calculate the goodness of the solution (e.g. how many stops each charter bus has to make on average)
what constraints must the solution meet to be acceptable? (e.g. the bus cannot carry a negative number of passengers and the number of passengers must be an integer)
what data is needed to calculate the solution? (e.g. passengers registered at different stops, maximum duration of the bus route)
do we need an optimal solution, or is a solution that meets the constraints sufficient?

This makes it relatively clear from the modelling stage what constraints the solution will have. However, in machine learning models, and especially in deep learning models, the understanding of constraints becomes much more complex.

This is even more pronounced in large pre-trained models such as large language models, especially since they also involve unexpected "emergent" abilities and it is easy to ask a natural language model anything. So how can the limitations of large language models be addressed?

A peek under the hood - what are the underlying constraints of current broad language models?

Existing broad language models have been trained with huge amounts of data and methods for linking enterprise-specific data to them are well advanced and constantly evolving. Yet, for example, agent-based editing of large code bases may hit a wall that the agent can no longer get past, or a manually prompted inference model may return an error when the context runs out of space. What is the root cause of context size constraints and what can be done about them?

A small content warning: this sneak peek contains maths, but only as much as is necessary! If you don't need such an in-depth look right now, you can take a peek at the more practical tips at the end of the article.

Neuroverkot

As a prelude to why the transformer architecture behind today's models was originally needed, a quick introduction to neural networks - or a refresher if you're already familiar with them. A neural network is a graph with nodes called neurons and arcs called synapses. Neurons are arranged in successive layers, and if there are other (hidden) layers between the so-called input and output layers, it is a deep learning neural network. The idea of a neural network is to mimic the structure of the biological brain, where electrical impulses travel through a network of nerve cells.

A small neural network with four layers. This network has three input neurons in the input layer and two output neurons in the output layer. In real life, even the smallest neural networks are typically on the order of a hundred neurons and the largest are several million.

In the brain, the connections between neurons change and develop in response to experiences, thinking and external stimuli. This is modelled in the training phase of neural networks by assigning a weighting factor to each synapse in the network, which affects how much each neuron in the previous layer is activated and how much each neuron in the next layer is affected.

In addition, each neuron has its own constant term (bias), which can be used to help control how much influence a single neuron has on the activation of the next layer.

*All neurons in the previous layer affect each neuron in the next layer in proportion to the weight coefficient generated in training.*

The activation of each neuron is therefore calculated as a weighted sum as follows:

a_{0}^{(1)}=w_{0,0}a_{0}^{(0)}+w_{0,1}a_{1}^{(0)}+w_{0,2}a_{2}^{(0)}+\dots+w_{0,n}a_{n}^{(0)}+b_{0}, where b_{0} is the constant term, i.e. bias

To enhance the learning power of the network, the weighted sum is run through another kind of normalization function, but it is not relevant here. The above sum is more generally formulated by arranging the activations of the previous layer into a single vector and the weights into a matrix. In this case, the activations of the entire a(1) layer can be represented as a matrix-vector result.

The matrix on the left contains the weights of all synapses leading from the previous layer to this layer. The first vector contains the activations of neurons in the previous layer and the second vector contains the constant terms.

This notation will help us to understand a little later why the context window of a large language model has the current size constraints. Moreover - in case you hadn't thought of it - this is the reason why the graphics card manufacturer Nvidia has risen with a bang in the list of the world's most valuable companies: matrices and vectors are not only essential for artificial intelligence, but also for 3D graphics!

These types of neural networks are very useful for applications such as image recognition, but even large-scale neural networks of this type have not proven to be effective in practice for natural language processing and especially for language translation. This is because this type of network practically "forgets" what was mentioned at the beginning of the sentence when it reaches the end of the sentence (see vanishing-gradient problem). Various approaches to this problem were tried, but the real breakthrough came with the now legendary Attention Is All You Need paper in 2017.

Tokens, vectorisation and embeddings

Before getting to the heart of that famous paper, let's go over what broad language models do in practice and how to convert natural language into vector form in the first place.

Extensive language models take as input text(prompts), which are divided into tokens. Tokens are short pieces of text, such as syllables, punctuation marks and single letters, but for simplicity, tokens are presented here as if they were complete words.

The language model predicts a sequel to "Kipinä is 80% experts". It considers the words "owned", "operated" and "managed" to be the best candidates. — *The language model calculates, as it were, the probabilities of the next possible word. Here, the probability is represented by the length of the bar preceding the word.*

Based on the input it receives, the language model predicts the next word, which, based on its teaching material, would be a continuation of the previous text. The model then predicts the next word, and so on. If you have noticed, for example in ChatGPT or Copilot, that the answer appears on the screen in chunks, it is because of this.

A small amount of randomness has been added to the language models to make the answers they produce more varied and natural. The amount of randomness can usually be adjusted through interfaces when using the models.

In order to do any computation at all with words, a vector representation is computed for them during the model training phase, which for example in the case of GPT-4 models has 12-16 thousand dimensions. This representation also tends to store semantic meaning, and the vector is therefore called an embedding vector. In this case, for example, the vector representations of the words "cat" and "dog" end up being relatively close to each other in vector space.

But what is it in the current models that makes the word "six", for example, be understood differently, depending on the context, to mean either a number or a type of tree?

Related words such as "cat" and "dog" end up close together in vector space, while unrelated words such as "fork" are further apart in space.

Transformer architecture and attention mechanisms

The real breakthrough in understanding the text of large language models, and especially in translating language, came when a suitable mechanism was found to allow the whole context to influence the vector representation of words. For each input word, an independent vector representation of the so-called embedding matrix generated during model training is taken as a basis:

*The vector representations taken from the embedding matrix do not yet contain any information about each other.*

These individual input word vector representations are combined into a matrix, which is then passed through two types of layers in the transformer architecture:

Attention-layers, where words in context also affect each other
Feed-forward layers, where words do not affect each other but pass through a virtually identical neural network

There can be an arbitrary number of these types of layers in a row, but what they have in common is that the attention and feed forward layers vary.

It is in these attention layers that the actual "understanding" of the context of the words in the language model is created. This is the mechanism by which language models today distinguish between the different meanings of the word "six", for example, and which greatly improved the performance of models in translation tasks. If you have the words "six" and "six" in context at the same time, one meaning a tree and the other meaning a number, they end up in different parts of the vector space in the attention layers, even though their content is otherwise the same!

*In the Attention layers, other words in the context influence the interpretation of other words, in effect moving them to a more "correct" place in the vector space.*

However, it is this attention mechanism that currently imposes the most severe constraints on the size of the context of the models. We will not go into the different types of attention in depth here, but attention models that genuinely take into account all words in context have a certain computational burden in common.

For all the words in the context and their locations in the context, so-called query and key vectors are calculated. Query vectors can be thought of as encoding questions about a word and its location - such as "are there adjectives describing this word in front of me?" and key vectors are the answers to these questions. Typically, these query and key vectors are not as multidimensional as the model itself, but can be, for example, 128-dimensional.

For the example context "A fir tree is softer than a pine tree", the query and key vectors are first calculated. Then the scores are calculated between them, i.e. a context of 4 words gives 16 scores.

To find out for each pair of words how well each query vector (in figure Q) corresponds to each key vector (in figure K), the scores of all query vectors with all key vectors are calculated. This means that the number of required scores increases exponentially with the size of the context window! So, for example, if the length of the input quadruples, the number of required scores increases 16-fold.

For this reason, at the time of writing (6/2025), some models still limit the context window to 128 000 tokens, for example, and in previous generations it was still common to talk about limits of 4, 8 or 16 thousand tokens. In earlier models, the context limit was also more commonly encountered in e.g. ChatGPT conversations, where the model seemed to forget parts of the conversation. These constraints have eased somewhat, but when working with large code bases or using inference chains, for example, the contextual limitations of the current models are also apparent.

For comparison, when converted to pages of text in a book, the original GPT-3.5 turbo model could only internalise 1.5 pages of text at a time. In the GPT-4o, the limit is in the order of 50 pages.

At the end of the article, we'll discuss how some current models, such as GPT-4.1 and Gemini 2.5, offer support for 1-2 million tokens (equivalent to 400-800 text pages), but before that we'll look at what to do to condense the context anyway - calls with large contexts can quickly become expensive even when the model allows them in the first place.

What helps - optimising the context

Fortunately, however, the language model does not need to be fed with all the data at the same time. One technique used with language models is Retrieval Augmented Generation (RAG), which was actually created to give language models access to data other than that used in the training phase. This makes it possible to use up-to-date or non-public data with the language model. However, RAG also helps with context size: when the actual data retrieval is done outside the language model, for example using vector databases, the most relevant data possible is already pre-collected in the context of the language model. In practice, therefore, there is no need(and no way) to try to feed the language model with, for example, the entire Slack history of the company or Sharepoint documents, but only the messages or documents that are relevant can be added to the context.

Other techniques related to context optimisation include:

pre-compaction: the source material is compressed once or even recursively if necessary
contextual chunking: the aim is to chunk source material into semantically meaningful sequences, rather than mechanically into chunks of a certain size
external memory and/or external tools: use the external memory of the model or allow the use of external tools if necessary

What does the future look like?

As noted earlier, the number of computations required by the attention mechanism increases exponentially with the length of the context window to account for the effect of all context tokens on the interpretation of all other tokens. In other words, it is still possible for a language model to take into account every single word in the interpretation of, say, a 25-page document, but for a 100-page document this would not be possible today, because the required computational power is 16 times higher!

A good question that follows from this is, do we really need to take every single word into account when interpreting all the other words? Or even if they are, do we have to do all the calculations again each time, or could they be cached?

Indeed, current models combine different techniques whereby words on the same page (to use a literal example) have more relevance to each other than words dozens of pages apart. There are also caching techniques, where it is essential to control memory usage so that it does not explode in turn.

With caching, the memory requirements can easily get out of hand, because naively, all token pairs would have to store both the query vector, the key vector, and the value vector resulting from the point return.
In view of these problems, different models have already tried different ways to achieve sufficiently good results even if only some token pairs interact. On the other hand, there have been studies on how to reduce the recalculation so that the memory requirements of the cache do not get out of hand. These methods include.
- sparse attention, where only a subset of token pairs are counted
- local + global attention, where tokens close to each other have more impact and only a subset of tokens affect the rest of the context
- multi-query attention and grouped-query attention, which can improve the cache computation of key and value vectors while keeping memory requirements more manageable
- rotary positional embeddings (RoPE), where token position information (how many context tokens are involved) is stored by "rotating" each pair of tokens in different directions. This has been particularly effective for the encoding of long sequences
- multi-head latent attention (MLA), which caches lower-dimensional latent vectors than in standard multi-head attention (MHA). This approach was explored during the development phase of DeepSeek V2 and is used in DeepSeek V3 and elsewhere.
In addition, the above techniques are combined with other techniques. For example, Mixture of Experts (MoE) is a technique where instead of one large-scale network, the model is internally composed of several more specialised networks. In the Transformer architecture, it is often used to choose which feed-forward layers to run on the input and thus reduces the computational capacity required for inference, but does not in itself facilitate the computation required by the attention mechanism. It is, however, well suited as an adjunct to MLA, for example.

Finally: practical tips for better results

So, how does all this under the hood then affect how AI solutions should be designed in practice? I'm sure your organisation has already conducted various AI experiments or is already using large-scale language models, but below are our tips for situations where AI solutions are going to be scaled up and made productive:

As noted at the beginning, many problems of a bounded and mathematical nature already have solutions, and for many of these LLMs are not practically applicable at all. For example, do not try to solve route or shift planning with a broad language model, as these will produce incorrect results. Instead, select or, if necessary, train a specialised model that is appropriate to the problem.
The solution is usually more predictable, more robust, more observable and cheaper if the problem to be solved is broken down into smaller pieces. This also allows you to use other types of models where appropriate, where they are better, and keeps the context size of individual LLM calls under control.
Even broad language models work better the more relevant the information is to the limited context. Both the quality of the data and how well the models can access the data directly affect the quality of the final output. Don't reinvent the wheel: make use of tools such as the Model Context Protocol (MCP) and the pre-built tools built on top of it.
Reasoning models are great for ad-hoc coding tasks, but don't overuse them when you're producing a schematic solution. In this case, you should start by exploring the problem with reasoning models but then break the problem itself into smaller chunks once workable solutions start to emerge. Indeed, reasoning also requires its own contextual space and you may run into token constraints or unexpected costs when using reasoning in vain.
Both the size of the context and other features are evolving rapidly all the time, so the familiar family of models is not always the best option. For example, if you normally use OpenAI models, when you encounter limitations, take a look at what features the latest Gemini, Claude, Mistral or DeepSeek offer. Gemini 1.5 Pro, for example, offered a million token context window a few months before it became possible in OpenAI models with GPT-4.1.

Here are tips on how to tackle under-the-hood constraints to take your AI solutions from the ground up to scalable and production-ready! By systematically approaching your situation and challenges, we at Kipin provide a strong vision and cutting-edge technical expertise to create next-generation AI solutions!

When you need sparring on how to raise the bar for AI solutions or a vision for holistic digital development - contact Jari!

Jari Huilla, CTO & partner Kipinä

Jari Huilla is Kipinen's new CTO with an exceptionally long and diverse background in technology, having found his first job at Nokia Research Center at the age of 15. Over the years, he has worked as a developer, leader and builder of growth companies. With Kipinä, Jari brings together deep technical knowledge and business-oriented thinking. In particular, he is interested in how to make AI solutions not only technically functional but also truly fit for purpose - and how to understand their limitations, not just ignore them.

Artificial intelligenceDigital development

Jari Huilla