Revealing Secrets Of Large Language Models And Generative AI Via Old-Fashioned Markov Chain Mathematics

Revealing Secrets Of Large Language Models And Generative AI Via Old-Fashioned Markov Chain Mathematics

Using the Markov chain statistical analysis technique might be a vital tool toward unlocking what is … [+] happening inside generative AI and large language models.

getty

In today’s column, I closely examine an innovative way of exploring the inner secrets of generative AI and large language models (LLMs) via using a rather old-fashioned mathematical modeling technique known as a Markov chain. You might have heard about or even learned how to make use of Markov chains in a college-level course on statistics and probabilities. No worries if you’ve never dealt with Markov chains since I will be bringing you fully up-to-speed on the absorbing topic.

The crux is that recent work by AI researchers suggests that we can gain important insights about generative AI and LLMs through the application of Markov chains. That’s quite exciting news.

Let’s talk about it.

This analysis of an innovative proposition is part of my ongoing Forbes.com column coverage on the latest in AI including identifying and explaining various impactful AI complexities (see the link here).

Modeling Of Current States Or Steps In A Process

First, I will do a quick introduction about Markov chains.

Consider a recent lived experience scenario of mine that involves a series of actions or steps. This will provide a vivid means of laying out what the mathematical or statistical modeling technique known as Markov chains is all about.

Let’s start with everyday life. I recently went to the DMV to get some paperwork filed regarding my car. I went into the DMV office and checked in at a main window. The clerk looked at the paperwork that I had preliminarily filled in. Based on the completeness status, the clerk would either send me to the processing window to get my paperwork filed or to a secondary clean-up window where someone would assist in fixing anything wrong on my document. My hope was to be sent straight to the processing window. Fingers dearly crossed.

By and large, this is pretty typical of what happens when you visit any bureaucracy of one kind or another. It is also typical of any kind of ordinary process that encompasses a series of steps or states of action.

The main window at the DMV office is where people check in. The clerk dispatches people to one of two follow-on windows, either the processing window or the clean-up window. At the initial start of my DMV journey, I began at the main window.

You could say that at the start of my DMV journey, I was in a condition or state of waiting at the main window. From that initial state, my next state or condition would be to move to or transition over to the processing window or over to the clean-up window. To recap, we have some notable elements in this scenario, consisting of specific states, such a current state and a next potential state, along with transitions to get from one state to another.

Which of those two windows might I be sent to?

Well, it all depends on the outcome of my effort at the main window.

Mull that over. What would you say are the odds that I will go to the processing window? What are the odds that I will be sent to the clean-up window? Well, all that we can say for sure is that there are only two places I might be sent to, either the processing window or the clean-up window. Since those are the only two options available, you could say that my chances of going to either one is evenly split, namely there is a 50% chance of the processing window and a 50% chance of the clean-up window.

Aha, we are getting somewhere via my endearing tale. Keep going.

Andrey Markov Lays Out A Handy Modeling Technique

A famous Russian mathematician named Andrey Markov delivered a groundbreaking lecture in 1913 that said there are lots of processes that involve a series of steps or states, and that you tend to proceed from one state to another state based on a statistical or probabilistic chance of doing so. The steps or states are essentially arranged in a linked chain or series. By laying out the various states, along with their accompanying transitions and probabilities, you can model how a process works.

This was a clever idea at the time and his modeling technique became known as a Markov chain.

Boom, drop the mic.

Based on such a model, you can garner all manner of helpful insights. Are all the states or steps really required? Do some transitions of one state to the next state occur a lot more frequently than others? At the DMV, for example, we might use that analysis to figure out whether we have enough workers and where those workers need to be assigned.

In the case of my DMV visit, there weren’t many steps or states that I would be going through. But I’m sure you’ve been to places where they have had dozens of various steps and each step would lead to some other step, depending upon what happened at the prior step. We can use the Markov chain on seemingly any sized process that we want to model. Perhaps the number of states is small such as my DMV visit, or immense in that the count of states might be in the dozens, hundreds, thousands or more.

You now know the essence of Markov chains, congrats.

Markov Tells Us In His Own Words What He Came Up With

I have something to share with you that I think you will find very intriguing.

I found online a written copy or translation of Andrey Markov’s lecture from 1913, and I thought you might like to go back in time to see what his groundbreaking speech contained. His example wasn’t about the DMV (thank goodness), but instead he took a well-known book of poetic verses and wanted to figure out the pattern of how letters led to other letters. Since this work was done by hand, he opted to use a sampling of text that amounted to 20,000 letters in size.

Without further ado, here are excerpts from “An Example of Statistical Investigation of the Text Eugene Onegin Concerning the Connection of Samples in Chains” by Andrey Markov, Lecture at the Royal Academy of Sciences, January 23, 1913, translation reprinted in Science In Context, Cambridge University Press, 2006.

  • “This study investigates a text excerpt containing 20,000 Russian letters of the alphabet, excluding two special letters from Pushkin’s novel Eugene Onegin — the entire first chapter and sixteen stanzas of the second.”
  • “This sequence provides us with 20,000 connected trials, which are either a vowel or a consonant.”
  • “Accordingly, we assume the existence of an unknown constant probability p that the observed letter is a vowel. We determine the approximate value of p by observation, by counting all the vowels and consonants.”
  • “Apart from p, we shall find — also through observation — the approximate values of two numbers p1 and p0, and four numbers p1.1, p1.0, p0.1, and p0.0. They represent the following probabilities: p1 a vowel follows another vowel; p0 a vowel follows a consonant; p1.1 a vowel follows two vowels; p1.0 a vowel follows a consonant that is preceded by a vowel; p0.1 a vowel follows a vowel that is preceded by a consonant; and, finally, p0.0 a vowel follows two consonants.”

I trust that you can see how his example parallels my story of the DMV.

Markov was focused in his speech on states of letters such that if you were at a given letter, you could potentially predict what the next letter would be. There would be a probability associated with which state you would transition to. He was able to calculate these probabilities by doing relatively straightforward arithmetic calculations based on inspecting the words and letters, along with their sequences.

You certainly have to give him modern-day props for doing the hard work to both elucidate and simultaneously showcase his modeling technique. His name lives on as evidence of his cleverness and arduous deliberations.

Generative AI And Large Language Models Are Processes

I’ve got a twist for you.

Now that you are familiar with Markov chains, you might be wondering what the heck this has to do with generative AI and large language models. How can something from 1913 possibly have any relevance to the latest and greatest advances in current times AI?

I’m super glad that you asked, hang in there.

Let’s start with what generative AI and LLMs consist of.

You likely have used generative AI and large language models such as ChatGPT and GPT-4o by OpenAI, or Anthropic Claude, Google Gemini, Meta Llama, or other similar AI apps. Usage is through the roof right now. There are about 250 million weekly active users of ChatGPT alone, and many hundreds of millions more users of AI when you include the other major generative AI wares.

Generative AI and LLMs are based on computational pattern-matching. AI makers scan content on the Internet that contains essays, narratives, poems, and the like, and use advanced mathematics and computational approaches to pattern on how humans write. The result is that the computer can mimic human writing. You carry on a conversation with such AI, and it remarkably mimics human natural language usage such as interacting in English. For my in-depth explanation of what’s taking place under the hood, see the link here.

Most people assume that AI researchers and AI developers know precisely what is going on inside generative AI and LLMs. The thing is, that’s not especially the case. The internal computations are so massive that though you can trace numbers from here to there, it is very hard to explain why the AI works so well. I’ve covered the extensive efforts underway to crack open the underlying logical semblance of meaning that underlies the convoluted sets of values and calculations, see my discussion at the link here and the link here.

A kind of Holy Grail entails someone figuring out in a meaningful fashion the hidden mysterious secret of what makes generative AI and LLMs seemingly appear to be fluent in natural language.

Okay, that brings us to the twist.

Some suggest that Markov chains could be a useful key toward unlocking the deepest secrets within generative AI and large language models. How so? Because you could assert that within those computational machinations are processes. The internal processes consist of states, and the states transition from state to state, based on probabilities. It might be extremely helpful to try and model the likes of generative AI and LLMs via the old-fashioned but still relevant and somewhat powerful Markov chains.

Seems like a match made in heaven.

Toying With Marko Chains And GPTs

There has been a lot of talk for quite a while among AI insiders about using Markov chains to investigate generative AI and LLMs. You might say it is an obvious consideration. Most AI experts are likely already familiar with Markov chains because of studying computer science and statistics while getting their college degrees. It is an easy-peasy notion to contemplate leveraging Markov chains. No huge mental leap required.

Last year, the well-known AI scientist Andrej Karpathy tweeted on April 9, 2023, about his somewhat idle or curious interest in exploring generative AI such as GPTs via the use of Markov chains and gave this example (excerpts):

  • “This is a baby GPT with two tokens 0/1 and context length of 3, viewing it as a finite state Markov chain. It was trained on the sequence “111101111011110” for 50 iterations. The parameters and the architecture of the Transformer modifies the probabilities on the arrows.”
  • “E.g. we can see that: (a) state 101 deterministically transitions to 011 in the training data, so the probability of that transition becomes higher (79%). Not near 100% because we only did 50 steps of optimization. (b) state 111 goes to 111 and 110 with 50% probability each, which the model almost learns (45%, 55%). (c) states like 000 are never encountered during training, but have relatively sharp transition probabilities, e.g. 73% of going to 001. This is a consequence of inductive biases in the Transformer. One might imagine wanting this to be 50%, except in a real deployment almost every input sequence is unique, not present in the training data verbatim.”
  • “Not really sure where I was going with this :D, I think it’s interesting to train/study tiny GPTs because it becomes tractable to visualize and get an intuitive sense of the entire dynamical system.”

The gist of his example was that he said to envision you have a series of 0s and 1s that are arrayed in a sequence, and you decide to divide them up into groups of three digits. You might have a group of three digits such as 101 that has next in sequence the three digits of 011. How often does that happen? Well, you could readily see how many times that occurs in the total set and calculate the chance or probability of it occurring. Those are said-to-be states, their transitions, and calculated probabilities of moving from one state to another.

Just like Markov’s analysis of letters in a book of poetry.

Turns out, that’s a lot like what happens inside generative AI and LLMs.

When you enter words into your favored generative AI app, the words are first converted into numbers, referred to as tokens. The AI is going to mathematically and computationally try to figure out what numbers should be generated in response to your entered prompt. In a sense, it is all numbers. The tokens are the words you entered (converted into numbers), and after the AI identifies the response, which is initially in numbers, those numbers are converted back into words. For more about the details of tokens and tokenization, see my coverage at the link here.

This overall process of word completion is similar to the autocomplete function in a word processing package. You enter the words “The big dog began to” and the autocomplete might mathematically and computationally predict that the next word is “bark”, such that your sentence is going to be completed as “The big dog began to bark.”

Research Digs Deeper Into The Markov Chain Unlocking

Some AI researchers are excited about applying Markov chains to generative AI and LLMs. Others are skeptical and not convinced that the effort is worthwhile. Can Markov chains actually tell us something that we don’t already know? Will Markov chains scale to the massive size of what the most advanced LLMs and generative AI consist of? Is the use of Markov chains overhyped? Or is it undervalued?

Lots of pressing questions abound.

I was spurred to bring up this topic due to a recently posted study that caught my attention. I thought you might like to know the latest thinking on the heated topic. You can decide whether this is a valuable pursuit or a spinning of the wheels. Judge away.

In a recent study entitled “Large Language Models As Markov Chains” by Oussama Zekri, Ambroise Odonnat, Abdelhakim Benechehab, Linus Bleistein, Nicolas Boulle, Ievgen Redko, arXiv, October 3, 2024, these notable points were made (excerpts):

  • “Large language models (LLMs) have proven to be remarkably efficient, both across a wide range of natural language processing tasks and well beyond them.
  • “Although successful in practice, the origins of the impressive performance of LLMs remain elusive, as there is no widely accepted agreement in the scientific community on how they achieve remarkable reasoning capabilities that go far beyond their training data.”
  • “This work takes a step towards bridging the knowledge gap mentioned above by providing an explicit characterization of the LLM’s inference capabilities. For this, we adopt an intuitive, yet overlooked, approach that interprets LLMs as Markov chains operating on a finite state space of sequences and tokens.”
  • “A key insight is that despite the seeming infinity of LLMs generating capacity, they have a limited vocabulary and context window making all their possible input and output sequences enumerable. We show that despite the prohibitively large size of the latter set, it exhibits a structure that makes it amenable to theoretical analysis.”
  • “We derive several surprising findings related to the existence of a stationary distribution of Markov chains that capture the inference power of LLMs, their speed of convergence to it, and the influence of the temperature on the latter.”
  • “Finally, we illustrate our theoretical guarantees with experiments on several recent LLMs to highlight how they capture the behavior observed in practice.”

Their study does a yeoman’s job of providing various mathematical proofs, thus covering the theory side of the matter, plus they performed experiments using several contemporary LLMs. I always welcome research that tries to tackle both the theoretical side of a topic and the proverbial rubber meets the road practical aspects too.

State Of The Art On States And Transitions

There is a lot of possible unpacking about the insights gleaned from that research study, along with other studies that are also pursuing a similar line of inquiry about the application of Markov chains in this realm. Lamentedly, I don’t have the space here to cover all the fascinating results.

If there is reader interest, I’ll gladly do a follow-up piece to cover the nitty-gritty.

Allow me at least a brief moment to make some overall remarks. I do so with the realization that there are trolls that will whine about any conclusions or interpretations of these weighty matters. Things are still up in the air. I make that abundantly clear. Please consider these comments as tentative and merely thoughts as useful for consideration, thanks.

Here are three thought-provoking takeaways:

  • (1) Promising Usage. Under specific conditions or narrowing factors such as the vocabulary involved and other settable parameters, LLM behavior can apparently be approximated via the behavior of a finite Markov chain, subject to limits and exceptions.
  • (2) Scaling Laws. Interestingly, it appears that as the context window and vocabulary increase for LLMs, they seem to follow scaling laws similar to Markov chains, thus performance apparently scales in a seemingly revealing and anticipated way (maybe).
  • (3) Adaptability. It seems that LLMs are able to generalize across Markovian dependencies, which suggests that LLMs are indeed more adaptable and efficient at these efforts than traditional Markov models, which has been assumed as an intuitive hunch, but it is useful to see this demonstrated in mathematical and empirical means. Note this also suggests that non-traditional Markov models are worthy of additional comparison to LLMs.

On the adaptability topic, I recently examined the importance of generative AI and LLMs of being adaptable if we are going to make continued advancement and progress, see the link here.

Generative AI And LLMs Are Still On Top

One nut to crack is that Markov chains are customarily constrained to consider only a designated current state and a next state, such that prior states are not at play when the next state decision is made.

Here’s what that means, we’ll use my DMV story.

When I was at the DMV, I was sent to the processing window rather than the clean-up window, hurrah. After filing my document at the processing window, they asked me if I wanted to go to another window that would print out my new registration. If I didn’t want to wait, they could mail the registration to me. In that sense, I had two possibilities, go to the printout window or have the registration mailed to me.

The decision of which was my next state had nothing to do with my prior state, whereby I had been initially at the main window. The clerk at the processing window didn’t know and didn’t care how I got to the processing window. There was nothing about my prior states that the clerk cared about. In that same sense, Marko chains traditionally only are concerned with whatever the present state is. From that present state, your next state options are calculated and considered.

Shift to thinking about generative AI and LLMs.

One of the most notable breakthroughs that really makes generative AI so astounding is that you can have long passages of text, and the AI will consider that lengthy material when generating a response. It isn’t just the latest word that leads to the next word. The GPT or transformer mechanism such as self-attention explores lengthy elements of input sequences and is not rigidly limited to state-to-state transitions that are typical of Markov chains.

A nagging qualm is still that Markov chains though perhaps able to mimic some generative AI facets, won’t cut the mustard when it comes to the full depth, flexibility, and shall we say sophistication of LLMs that generate contextually relevant language over lengthy spans of text.

No worries. Consider the illustrative comments of Albert Einstein about the noble pursuit of science: “The important thing is to never stop questioning.” Yes, we need to keep questioning and answering questions on this meaty topic.

Stay tuned as I’ll be keeping my eye on Markov chains and let you know what the next state of affairs is. There is a 100% chance of my doing so.

Leave a Reply

Your email address will not be published. Required fields are marked *