Search
Close this search box.

Understanding Tokens and Context in Language Models

Table of Contents

What is a Model and Why Are There So Many?

That is a reasonable, almost expected first take if you are exploring any of the playground environments available to us today. In this post, we’re going to start with the basics. What is a language model? 

In simple terms, it’s software that understands and generates human language. The numerous models that exist today are because each one is designed for different purposes and contexts, much like different versions and editions of software.

The tl;dr and the goal of today is to show why the latest & greatest is not quite an absolute.

Organization Product Path to Market Hierarchy

OpenAI & Microsoft have a strange relationship. A tangled web of dependencies that result in them being frenemies at best. Despite the synergies the two embrace, there are stark parallels in their role in enabling a vast ecosystem of solutions. 

Parallelism Amongst Partners

OpenAI & Microsoft are both organizations that actively develop & maintain a proprietary technology that has revolutionized much of the way the world operates.

Many of the world’s computers are running on Microsoft’s proprietary operating system family, Windows. 

Through Windows is how we interact their flagship product, Microsoft Office

Conversely, ChatGPT is the manifestation of OpenAI’s proprietary technology.

Just like Windows has had many versions—Windows 10, Windows 11, and so on—OpenAI has developed various iterations of their language models, like GPT-3.5 and GPT-4. Each of which powering the user interface we know of as ChatGPT.

Visualization by Travis Vasceannie

Let’s intentionally ignore the booming (and bubbling) ecosystem of aftermarket applications and tools based upon OpenAI & Microsoft. 

To pivot back to OpenAI’s playground, you’ll find many more models than those mentioned above like Ada, Davinci, and Whisper. These models cater to different needs – similarly to the choice between Windows Home, Pro, or Server editions.

Now, imagine several alternatives to OpenAI entering the marketplace, each and every one with its own offerings bearing unique strengths & weaknesses.

Overwhelmed, yet?

Good, because I certainly was. Instead of just pointing out which model is the best, we’re going to spell out the individual uses cases and the methodology for evaluating which is the right one.

Diving Deeper: The OpenAI Ecosystem

GPT: The Flagship Family
GPT-4o mini
Description
A smaller, more efficient version of GPT-4o, designed to handle less complex tasks while maintaining high performance.

Purpose
Ideal for applications requiring a balance between computational efficiency and task complexity, such as lightweight conversational agents and quick content generation.
GPT-4o
Description
The most advanced model, capable of handling complex, multi-step tasks. It is multimodal, accepting both text and image inputs while outputting text. It is faster and cheaper than previous models.

Purpose
Ideal for high-intelligence tasks requiring complex reasoning and understanding across multiple languages and vision tasks.
GPT-4 Turbo
Description
A high-performance multimodal model optimized for chat and traditional text completion tasks.

Purpose
Suitable for applications needing high accuracy and responsiveness in text and image processing.
GPT-4
Description
An improvement over GPT-3.5, capable of understanding and generating natural language and code.

Purpose
Used for a wide range of natural language processing tasks, including conversation, content creation, and coding.
GPT-3.5 Turbo
Description
A fast and cost-effective model for simpler tasks.

Purpose
Ideal for applications requiring quick responses and lower computational costs
Previous slide
Next slide
The Specialist Models
DALL-E

A model that generates and edits images based on natural language prompts. Used for creative tasks such as generating artwork, designing, and visual content creation

Whisper

A model that converts audio into text.  Useful for transcription, translation, and audio analysis tasks.

TTS (Text-to-Speech)

Models that convert text into natural-sounding spoken audio. Used for applications needing voice synthesis, such as virtual assistants and accessibility tools

Embeddings

Models extended via API that convert text into numerical vectors to facilitate tasks like pattern matching, clustering, and search optimization.

 

 

 

Moderation

A fine-tuned model that detects sensitive or unsafe text as a content moderation filter to ensure safety and compliance in online platforms

Tokens

Each of the models above serve a unique purpose. Among OpenAI’s base (untrained) models, Davinci is often used for creative writing and complex problem-solving, while Ada is preferred for tasks that require fast responses but don’t need heavy computational power. Alas we segue into tokens & context.

Tokens are pieces of words that the model uses to understand and generate text. For instance, the word “fantastic” might be broken down into “fan”, “tas”, and “tic” as tokens.

				
					"fantastic" → ["fan", "tas", "tic"]
"unhappy" → ["un", "happy"]

				
			

Understanding tokens is crucial because it directly impacts the model’s ability to process and generate language accurately. This is where we balance our use case with its potential token consumption against context window.

Context

Context Window

The context window is the amount of text the model can “see” at once, or its short-term memory. For example, GPT-3 has a context window of 2048 tokens, while GPT-4 can handle much more. This means GPT-4 can consider more context, leading to more coherent and relevant responses, especially in longer conversations.

  • GPT-3: 2048 tokens
  • GPT-4: Up to 32,768 tokens (depending on the version)
Max Output Tokens

Max output tokens are the maximum number of tokens the model can generate in a single response. This is crucial for tasks that require long-form content generation. For instance, if you’re generating a report or an article, a higher max output token limit allows the model to produce more comprehensive content in one go

  • GPT-3.5 Turbo: Max output of 4096 tokens
  • GPT-4: Max output of 8192 tokens
Token Costs

To give you an idea of token costs, let’s look at a simple exchange and a complex one. A brief conversation might consume just a few tokens, costing a fraction of a cent.

However, generating a detailed report could consume thousands of tokens, increasing the cost significantly using GPT-4o as opposed to GPT-4o-mini, depending on the complexity of the request. See the latest rates from OpenAI:

  • GPT-3.5 Turbo: $0.002 per 1K tokens (input and output)
  • GPT-4 8K context: $0.03 per 1K tokens (input), $0.06 per 1K tokens (output)
  • GPT-4 32K context: $0.06 per 1K tokens (input), $0.12 per 1K tokens (output)
  • Clear and non-ambiguous prompts such as:
    • What’s the weather
    • Fact check
    • Math or logical operations
  • Complex or unclear prompts such as:
    • How will AI affect industry?
    • Analyze the familiarity of these texts
    • Draft me a quarterly report

Understanding the cost variance between different models is essential. For instance, GPT-4o (optimized version) might offer a balance between performance and cost, while GPT-4o-mini would be more cost-effective but potentially less powerful for complex tasks.

Choosing the Right Model

Different tasks require different models. That much is known along with some expected costs – so let’s break down the specifics of the decision process: Here’s a quick guide to help you pick the right one.

Multimodal

Models handling complex tasks that involve multiple inputs (text, audio & images) or multistep processes.

For use with complex tasks involving multi-step processes. Also primarily benefits from a long context size.

For simpler tasks or when computational efficiency is crucial, lightweight models are your best bet.

Models like DALL-E, TTS (Text-to-Speech), and Whisper specialize in media tasks, such as generating images from text, converting text to speech, and transcribing audio to text, respectively.

Turbo models, like GPT-4 Turbo, offer enhanced performance with quicker response times and lower computational costs compared to their standard counterparts. A fair comparison would be a premium tier or mid-generation rereleases of game consoles (i.e., PlayStation 5 & PlayStation 5 Pro).

Base models free of training are designed for specific, often niche, tasks.

Which is right for me?
Marketing Professional

 For demand generation, a model like GPT-4 can create compelling, contextually aware content.

Implementation Consultant

To visualize data, a model like GPT-3.5 Turbo can help generate insights and create visual aids.

Product Owner

For handling confidential IP, using a robust and secure model like Davinci ensures data integrity and privacy.

 

 

 

Product / Platform Arena

We’ve spent a lot of time focusing on OpenAI’s product offering. But what defines which platform you go to is influcened by many factors. The largest of which is its training data. Below are some benchmarks used to asses the performance of language models:

  • GLUE: General Language Understanding Evaluation.
    • GLUE is a benchmark designed to evaluate the performance of models across a diverse set of natural language understanding (NLU) tasks. High scores here translate to an increased likelihood that your spirit animal is a Swiss army knife.
  • SuperGLUE: Super General Language Understanding Evaluation.
    • An extension of GLUE that presents more challenging tasks to evaluate the performance of NLU models. Similarly to New Game+ (without an initial playthrough either).
  • SQuAD: Stanford Question Answering Dataset
    • Assesses reading comprehension skills via questions posed on a set of Wikipedia articles, where the answer to each question is a segment of text from the corresponding passage. High scores here are correlated to a lower likelihood that a user will experience a hallucination.
  • BLEU: Bilingual Evaluation Understudy
    • An algorithm for evaluating the quality of text that has been machine-translated from one language to another. While the connotation suggests spoken language, this broadly covers programming languages as well.
  • ROUGE: Recall-Oriented Understudy for Gisting Evaluation.
    • A set of metrics used for evaluating automatic summarization and machine translation software. One of the better measures of content generation and information synthesis.

 

Company Model GLUE Score Context Window Max Output Tokens Token Cost
OpenAI GPT-4 90.0 8192/32768 8192 High
OpenAI GPT-3.5 88.4 4096 4096 Moderate
Anthropic Claude 2 90.2 100,000 100,000 Moderate
Google PaLM 2 90.4 Varies Varies N/A
Meta LLaMA 2 80.2 4096 4096 Low

As for me & my current preference, I still find GPT-4o mini’s speed and cost to be incredible with my own personal use cases dropping over 99% in token costs. But a subtle reality is that these scores won’t mean much to a consumer for long, nor do they stay accurate. This is especially the case as we inch towards a world where all viable solutions exist as homogenous copies of each other.

To bring it home with an example, Anthropic’s Claude was enjoying 1st place for a few days and in the 48 hours since the weekend began, surrendered the crown to Google’s Gemini who is likely on track to return it to OpenAI upon the release of GPT-4o-long. An experimental model with double the output length. 

So one last time, the question is never “which is the best”, you look at your circumstances and ask yourself:

  • Do I need a jack-of-all trades? ChatGPT
  • Am I unable to endure long periods of FOMO? ChatGPT
  • Could I trade web search for additional privacy? Claude
  • Is there value in seeing the creation in real time? Claude
  • Does my work have a bigger picture to it? Definitely not Copilot. But almost certainly Claude
  • Is my content generative and I can spot hallucinations? Gemini/PaLM
  • Are my needs based on research? Perplexity
  • Do I have access to an RTX 4090 or better? LLaMa
  • Will I need to work in an offline, local environment? DeepSeek
  • Could I get access to Groq to host my own language model? Huggingface, Mistral, or LLaMa

Leave a Reply

Your email address will not be published. Required fields are marked *

Share the Post: