Understanding Tokens and Context in Language Models

What is a Model and Why Are There So Many?

That is a reasonable, almost expected first take if you are exploring any of the playground environments available to us today. In this post, we’re going to start with the basics. What is a language model?

In simple terms, it’s software that understands and generates human language. The numerous models that exist today are because each one is designed for different purposes and contexts, much like different versions and editions of software.

The tl;dr and the goal of today is to show why the latest & greatest is not quite an absolute.

Organization Product Path to Market Hierarchy

OpenAI & Microsoft have a strange relationship. A tangled web of dependencies that result in them being frenemies at best. Despite the synergies the two embrace, there are stark parallels in their role in enabling a vast ecosystem of solutions.

Parallelism Amongst Partners

OpenAI & Microsoft are both organizations that actively develop & maintain a proprietary technology that has revolutionized much of the way the world operates.

Many of the world’s computers are running on Microsoft’s proprietary operating system family, Windows.

Through Windows is how we interact their flagship product, Microsoft Office

Conversely, ChatGPT is the manifestation of OpenAI’s proprietary technology.

Just like Windows has had many versions—Windows 10, Windows 11, and so on—OpenAI has developed various iterations of their language models, like GPT-3.5 and GPT-4. Each of which powering the user interface we know of as ChatGPT.

Visualization by Travis Vasceannie

Let’s intentionally ignore the booming (and bubbling) ecosystem of aftermarket applications and tools based upon OpenAI & Microsoft.

To pivot back to OpenAI’s playground, you’ll find many more models than those mentioned above like Ada, Davinci, and Whisper. These models cater to different needs – similarly to the choice between Windows Home, Pro, or Server editions.

Now, imagine several alternatives to OpenAI entering the marketplace, each and every one with its own offerings bearing unique strengths & weaknesses.

Overwhelmed, yet?

Good, because I certainly was. Instead of just pointing out which model is the best, we’re going to spell out the individual uses cases and the methodology for evaluating which is the right one.

Diving Deeper: The OpenAI Ecosystem

Purpose

Ideal for applications requiring quick responses and lower computational costs

Previous slide

Next slide

The Specialist Models

DALL-E

A model that generates and edits images based on natural language prompts. Used for creative tasks such as generating artwork, designing, and visual content creation

Whisper

A model that converts audio into text. Useful for transcription, translation, and audio analysis tasks.

TTS (Text-to-Speech)

Models that convert text into natural-sounding spoken audio. Used for applications needing voice synthesis, such as virtual assistants and accessibility tools

Embeddings

Models extended via API that convert text into numerical vectors to facilitate tasks like pattern matching, clustering, and search optimization.

Moderation

A fine-tuned model that detects sensitive or unsafe text as a content moderation filter to ensure safety and compliance in online platforms

Tokens

Each of the models above serve a unique purpose. Among OpenAI’s base (untrained) models, Davinci is often used for creative writing and complex problem-solving, while Ada is preferred for tasks that require fast responses but don’t need heavy computational power. Alas we segue into tokens & context.

Tokens are pieces of words that the model uses to understand and generate text. For instance, the word “fantastic” might be broken down into “fan”, “tas”, and “tic” as tokens.

				
					"fantastic" → ["fan", "tas", "tic"]
"unhappy" → ["un", "happy"]

Understanding tokens is crucial because it directly impacts the model’s ability to process and generate language accurately. This is where we balance our use case with its potential token consumption against context window.

Context

Context Window

The context window is the amount of text the model can “see” at once, or its short-term memory. For example, GPT-3 has a context window of 2048 tokens, while GPT-4 can handle much more. This means GPT-4 can consider more context, leading to more coherent and relevant responses, especially in longer conversations.

GPT-3: 2048 tokens
GPT-4: Up to 32,768 tokens (depending on the version)

Max Output Tokens

Max output tokens are the maximum number of tokens the model can generate in a single response. This is crucial for tasks that require long-form content generation. For instance, if you’re generating a report or an article, a higher max output token limit allows the model to produce more comprehensive content in one go

GPT-3.5 Turbo: Max output of 4096 tokens
GPT-4: Max output of 8192 tokens

Token Costs

To give you an idea of token costs, let’s look at a simple exchange and a complex one. A brief conversation might consume just a few tokens, costing a fraction of a cent.

However, generating a detailed report could consume thousands of tokens, increasing the cost significantly using GPT-4o as opposed to GPT-4o-mini, depending on the complexity of the request. See the latest rates from OpenAI:

GPT-3.5 Turbo: $0.002 per 1K tokens (input and output)
GPT-4 8K context: $0.03 per 1K tokens (input), $0.06 per 1K tokens (output)
GPT-4 32K context: $0.06 per 1K tokens (input), $0.12 per 1K tokens (output)

Clear and non-ambiguous prompts such as:
- What’s the weather
- Fact check
- Math or logical operations

Complex or unclear prompts such as:
- How will AI affect industry?
- Analyze the familiarity of these texts
- Draft me a quarterly report

Understanding the cost variance between different models is essential. For instance, GPT-4o (optimized version) might offer a balance between performance and cost, while GPT-4o-mini would be more cost-effective but potentially less powerful for complex tasks.

Choosing the Right Model

Different tasks require different models. That much is known along with some expected costs – so let’s break down the specifics of the decision process: Here’s a quick guide to help you pick the right one.

Multimodal

Models handling complex tasks that involve multiple inputs (text, audio & images) or multistep processes.

Multistep

For use with complex tasks involving multi-step processes. Also primarily benefits from a long context size.

Lightweights

For simpler tasks or when computational efficiency is crucial, lightweight models are your best bet.

Media Interface

Models like DALL-E, TTS (Text-to-Speech), and Whisper specialize in media tasks, such as generating images from text, converting text to speech, and transcribing audio to text, respectively.

Turbo

Turbo models, like GPT-4 Turbo, offer enhanced performance with quicker response times and lower computational costs compared to their standard counterparts. A fair comparison would be a premium tier or mid-generation rereleases of game consoles (i.e., PlayStation 5 & PlayStation 5 Pro).

Edge Cases

Base models free of training are designed for specific, often niche, tasks.

Which is right for me?

Note Many of these scenarios only apply to those with data privacy requirements or product development. Excluding that, there is much more to gain using ChatGPT - OpenAI's front end.

Marketing Professional

For demand generation, a model like GPT-4 can create compelling, contextually aware content.

Implementation Consultant

To visualize data, a model like GPT-3.5 Turbo can help generate insights and create visual aids.

Product Owner

For handling confidential IP, using a robust and secure model like Davinci ensures data integrity and privacy.

Product / Platform Arena

We’ve spent a lot of time focusing on OpenAI’s product offering. But what defines which platform you go to is influcened by many factors. The largest of which is its training data. Below are some benchmarks used to asses the performance of language models:

GLUE: General Language Understanding Evaluation.
- GLUE is a benchmark designed to evaluate the performance of models across a diverse set of natural language understanding (NLU) tasks. High scores here translate to an increased likelihood that your spirit animal is a Swiss army knife.
SuperGLUE: Super General Language Understanding Evaluation.
- An extension of GLUE that presents more challenging tasks to evaluate the performance of NLU models. Similarly to New Game+ (without an initial playthrough either).
SQuAD: Stanford Question Answering Dataset
- Assesses reading comprehension skills via questions posed on a set of Wikipedia articles, where the answer to each question is a segment of text from the corresponding passage. High scores here are correlated to a lower likelihood that a user will experience a hallucination.
BLEU: Bilingual Evaluation Understudy
- An algorithm for evaluating the quality of text that has been machine-translated from one language to another. While the connotation suggests spoken language, this broadly covers programming languages as well.
ROUGE: Recall-Oriented Understudy for Gisting Evaluation.
- A set of metrics used for evaluating automatic summarization and machine translation software. One of the better measures of content generation and information synthesis.

Company	Model	GLUE Score	Context Window	Max Output Tokens	Token Cost
OpenAI	GPT-4	90.0	8192/32768	8192	High
OpenAI	GPT-3.5	88.4	4096	4096	Moderate
Anthropic	Claude 2	90.2	100,000	100,000	Moderate
Google	PaLM 2	90.4	Varies	Varies	N/A
Meta	LLaMA 2	80.2	4096	4096	Low

As for me & my current preference, I still find GPT-4o mini’s speed and cost to be incredible with my own personal use cases dropping over 99% in token costs. But a subtle reality is that these scores won’t mean much to a consumer for long, nor do they stay accurate. This is especially the case as we inch towards a world where all viable solutions exist as homogenous copies of each other.

To bring it home with an example, Anthropic’s Claude was enjoying 1st place for a few days and in the 48 hours since the weekend began, surrendered the crown to Google’s Gemini who is likely on track to return it to OpenAI upon the release of GPT-4o-long. An experimental model with double the output length.

So one last time, the question is never “which is the best”, you look at your circumstances and ask yourself:

Do I need a jack-of-all trades? ChatGPT
Am I unable to endure long periods of FOMO? ChatGPT
Could I trade web search for additional privacy? Claude
Is there value in seeing the creation in real time? Claude
Does my work have a bigger picture to it? Definitely not Copilot. But almost certainly Claude
Is my content generative and I can spot hallucinations? Gemini/PaLM
Are my needs based on research? Perplexity
Do I have access to an RTX 4090 or better? LLaMa
Will I need to work in an offline, local environment? DeepSeek
Could I get access to Groq to host my own language model? Huggingface, Mistral, or LLaMa

Understanding Tokens and Context in Language Models

Table of Contents

What is a Model and Why Are There So Many?

Organization Product Path to Market Hierarchy

Parallelism Amongst Partners

Diving Deeper: The OpenAI Ecosystem

GPT: The Flagship Family

Description

Purpose

Description

Purpose

Description

Purpose

Description

Purpose

Description

Purpose

The Specialist Models

DALL-E

Whisper

TTS (Text-to-Speech)

Embeddings

Moderation

Tokens

Context

Context Window

Max Output Tokens

Token Costs

Choosing the Right Model

Which is right for me?

Marketing Professional

Implementation Consultant

Product Owner

Product / Platform Arena

Leave a Reply Cancel reply

Things With Stuff

Contact

Site is a WIP

Join the club