Autonomía digital y tecnológica

Código e ideas para una internet distribuida

Linkoteca. LLM


A proposal to standardise on using an /llms.txt file to provide information to help LLMs use a website at inference time.

Large language models increasingly rely on website information, but face a critical limitation: context windows are too small to handle most websites in their entirety. Converting complex HTML pages with navigation, ads, and JavaScript into LLM-friendly plain text is both difficult and imprecise.

While websites serve both human readers and LLMs, the latter benefit from more concise, expert-level information gathered in a single, accessible location. This is particularly important for use cases like development environments, where LLMs need quick access to programming documentation and APIs.

We propose adding a /llms.txt markdown file to websites to provide LLM-friendly content. This file offers brief background information, guidance, and links to detailed markdown files.

Captura de pantalla de platform.openai.com/tokenizer

OpenAI’s large language models (sometimes referred to as GPT’s) process text using tokens, which are common sequences of characters found in a set of text. The models learn to understand the statistical relationships between these tokens, and excel at producing the next token in a sequence of tokens.

You can think of tokens as the “letters” that make up the “words” and “sentences” that AI systems use to communicate.

A helpful rule of thumb is that one token generally corresponds to ~4 characters of text for common English text. This translates to roughly ¾ of a word (so 100 tokens ~= 75 words).

The process of breaking text down into tokens is called tokenization. This allows the AI to analyze and “digest” human language into a form it can understand. Tokens become the data used to train, improve, and run the AI systems.

Why Do Tokens Matter? There are two main reasons tokens are important to understand:

  1. Token Limits: All LLMs have a maximum number of tokens they can handle per input or response. This limit ranges from a few thousand for smaller models up to tens of thousands for large commercial ones. Exceeding the token limit can lead to errors, confusion, and poor quality responses from the AI.
  2. Cost: Companies like OpenAI, Anthropic, Alphabet, and Microsoft charge based on token usage when people access their AI services. Typically pricing is per 1000 tokens. So the more tokens fed into the system, the higher the cost to generate responses. Token limits help control expenses.

Strategies for Managing Tokens

Because tokens are central to how LLMs work, it’s important to learn strategies to make the most of them:

  • Keep prompts concise and focused on a single topic or question. Don’t overload the AI with tangents.
  • Break long conversations into shorter exchanges before hitting token limits.
  • Avoid huge blocks of text. Summarize previous parts of a chat before moving on.
  • Use a tokenizer tool to count tokens and estimate costs.
  • Experiment with different wording to express ideas in fewer tokens.
  • For complex requests, try a step-by-step approach vs. cramming everything into one prompt.