Transformers
The Transformer architecture, introduced in a 2017 paper titled “Attention Is All You Need,” is the fundamental innovation behind the current AI boom. Before transformers, models processed text sequentially, like reading a book one word at a time. This was slow and inefficient; the model would often “forget” the beginning of a sentence by the time it reached the end.
Transformers changed the game by processing every word in a text at the same time. The secret to this is the self-attention mechanism.
Analogy: A Courtroom of Gossips
Imagine a courtroom where a key piece of testimony is being read: “The defendant, Mr. Smith, put the document in the briefcase, and then he left with it.”
Now, imagine the courtroom is filled with gossips. Each gossip is assigned one word from that sentence.
- The “it” gossip needs to figure out what “it” refers to. They shout “Who should I pay attention to?”
- The “briefcase” gossip hears this, recognizes its importance, and shouts back, “Me! Pay attention to me!”
- The “document” gossip also shouts, “Me too! I’m important!”
- The “defendant” gossip stays quiet, knowing “it” is less likely to refer to a person in this context.
After all this shouting, the “it” gossip has a clear picture: “it” most likely refers to the “briefcase” or the “document.” They create a map of these relationships.
This is what the self-attention mechanism does. For every single word (or token) in the input, the model calculates “attention scores” to determine how related it is to every other word. It’s a network of connections where the model learns, from massive amounts of data, which words tend to give context to others. This is done in parallel across many “attention heads,” each looking for different types of relationships (e.g., one head might track pronouns, another might track causal links).
The Legal and Technical Flaws
This architecture is powerful, but it also introduces specific, legally relevant failure modes that go beyond a simple “black box” explanation.
-
Over-Attention Creates Regurgitation: The attention mechanism can “overfit” on unique or rare patterns in the training data. If a specific poem or legal disclaimer appears in the training set, the model’s attention heads can learn to link the words in that phrase so strongly that if you prompt it with the first few words, it will reproduce the rest verbatim. This is not a search query; it’s a learned, high-probability pattern of attention. It is the technical mechanism behind much of the copyright infringement we see.
-
Context is Not Understanding: The model doesn’t “understand” that a briefcase is a physical object. It only knows, based on statistical analysis of trillions of words, that the token “briefcase” has a high attention score when linked to the token “it” in this structure. This is why models can be so easily tricked. They can be led down a path of statistically plausible but factually incorrect or nonsensical statements, a phenomenon known as hallucination.
-
Attention Can Be Biased: The attention scores are learned from the training data. If the data repeatedly associates the word “CEO” with male pronouns, the model will learn to pay more attention to male pronouns when it sees “CEO.” The attention mechanism doesn’t just learn grammar; it learns, quantifies, and reinforces the biases present in its training data.
Understanding the transformer architecture and its attention mechanism allows a litigator to ask much more specific questions during discovery. Instead of asking “Why did your AI say this?” you can ask, “Can you provide the attention scores for the final layer when the model produced this output? Which attention heads were most active, and can you trace their behavior back to specific training data?” This moves the conversation from abstract capabilities to concrete, evidence-based technical details.