The Shape of Language — Voidwolf Blog

This week I set out to make a language model ten times smaller.

Not its output. The model itself. Fewer parameters, less memory, less compute, same quality of answer. I'm a solo engineer with a single graphics card, not a lab with a data centre, so the only realistic way I was going to make progress on this was to try something the big labs weren't trying. That usually means going sideways into the math.

I want to write about what I tried, why it seemed like a reasonable idea, and why it didn't work. I've put the technical version on GitHub for anyone who wants the full nerdy story. This one is the human version.

Why smaller matters

The language models most people use today are enormous. Training the biggest ones costs tens or hundreds of millions of dollars and running them takes entire data centres. This is fine if you are Google or OpenAI. It is less fine if you are a small business, an underfunded researcher, or someone who doesn't want every private conversation floating through someone else's servers.

If you could shrink these models by an order of magnitude - ten times smaller at the same quality - a lot of things start to open up. A genuinely useful model runs on a normal computer. Individuals can afford to train their own. You can put a capable model on a phone, a robot, a satellite. The entire shape of who gets to use this technology changes. AI gets taken out of the hands of corporations and moves into the hands of people who are more focused on the good of society than profit.

There are plenty of people working on shrinking things incrementally. There are fewer people asking whether we could be doing it in a fundamentally different way. That was the question I wanted to chase.

Language is shaped like a tree

Here is the idea that got me interested.

If you think about how words relate to each other, they have a shape. "Animal" contains "mammal" contains "dog" contains "golden retriever". "Food" contains "fruit" contains "apple" contains "Granny Smith". Language is full of these branching hierarchies - general things at the top, specific things at the bottom, like a tree.

A computer represents words as points in a kind of invisible space. Most language models use what mathematicians call flat space - the same geometry you learned about in school, where parallel lines stay parallel forever and a circle's area is always π times the radius squared. Flat space is great for most things. It's not great for trees or hierarchies.

Here's why. Imagine trying to draw a family tree on a normal sheet of paper. At the top, one couple. Below them, three kids. Below them, nine grandkids. Nine becomes twenty-seven, then eighty-one, then two hundred and forty-three. By the time you're ten generations deep you have nearly sixty thousand descendants and you ran off the edge of the paper several generations ago. Trees grow exponentially. Flat paper just doesn't have room.

There's another kind of geometry - mathematicians call it hyperbolic - where the space itself grows exponentially as you move outward from the centre. It is the opposite of a sphere. A Pringle chip has a hyperbolic-shaped surface. So do M. C. Escher's famous woodcuts of angels and devils packed into a circle and growing smaller as they spiral toward the edge. In hyperbolic space, a circle's area grows exponentially with its radius, not just quadratically.

Failure

What that means in practice is that hyperbolic space is paper that was made for trees. You can draw your family tree at full size, all ten generations, all sixty thousand descendants, and everyone fits comfortably with room to spare.

The hypothesis that's been floating around for a few years: if language is tree-shaped, and hyperbolic space is tree-shaped, maybe language models that live in hyperbolic space are more efficient. Less wasted room. Better geometry-of-meaning. Maybe even that tenfold efficiency.

What I tried

Specifically, I wanted to test something that nobody else had, as far as I could tell.

The part of a language model that decides which words matter and map to each other is called attention. When the model reads "the dog chased the ball and it bounced away", attention is the thing that figures out that "it" probably refers to the ball, not the dog. Modern models have many attention mechanisms running in parallel - called "heads" - each specialising in different kinds of relationships. One head might track grammar. Another might track which characters are in a story. Another might track big thematic connections across paragraphs.

The existing work on hyperbolic language models gave every attention head the same fixed amount of curviness regardless of where it sat on the curve. My idea was: what if we let each head pick its own? Maybe the grammar head wants nearly flat space. Maybe the long-range thematic head wants very curved space. Let them figure it out.

I built a tiny model. I wrote down in advance exactly what I was going to test, so I couldn't fool myself later by quietly moving the goalposts. I ran careful experiments on prose from Wikipedia and then on Python source code, because code has an even more tree-like structure than English does. I ran three increasingly fair versions of the same comparison, each one fixing a flaw I'd found in the last.

It didn't work

The flat, ordinary model won. Every time.

On Wikipedia prose, my hyperbolic version closed some of the gap when I tuned it carefully, but it never caught up. On Python code - the most tree-shaped data I had - the gap between hyperbolic and flat actually got wider, not narrower. Which was the exact opposite of what the theory predicted.

I did find a few interesting things along the way. When I let each attention head choose its own curvature, a strange pattern showed up: the second-to-last layer of the model always wanted the most curved space, and the first layer always wanted the flattest. This happened in three separate training runs with different settings. As far as I can tell, nobody else could have noticed this before, because nobody else had built a model where each head was allowed to pick its own curvature.

I also found that using more precise math (double precision floating point, for anyone curious) actually made the hyperbolic model worse, not better. The imprecision of the cheaper math had been accidentally preventing the model from memorising things it shouldn't. That's a counterintuitive little fact that someone else might find useful one day.

But the headline didn't change. My approach was not the tenfold breakthrough.

Sitting with a negative result

There is a particular feeling to putting lots of work into something, running it properly, and having it simply not be true.

It is not the same as failing. Failing implies you made a mistake. This is more like asking a careful question and getting an answer you didn't want. The question was: is this the way to much more efficient language models? The answer was: no, not this way. Not at this scale.

Part of me is disappointed. A larger part is genuinely relieved, because the alternative was to keep pouring effort into something that wasn't going anywhere. I know now. I can stop and my GPU get's to take a well deserved break.

There's also the slightly humbling realisation that while I was working on this, a team at Yale had been doing something very similar at much larger scale and with much more infrastructure. They got further than me, published a paper at one of the big AI conferences, released their code. The specific gap I thought I was working in turned out to be less of a frontier and more of a neighbourhood with a lot of people already living in it.

That's a normal part of research. It does mean I'm better off picking a different street next time.

Why I'm glad I did it

Despite all of that, I'm glad I did this.

I learned an enormous amount about how these models actually work under the hood, in a way I couldn't have learned from reading papers. I have a map of a territory now - what hyperbolic approaches can do, where they fail, what the current state of the art looks like, what the open problems are that nobody has touched yet. If I want to come back to this in a year when the field settles, I'll know exactly where to pick it up.

The work is on GitHub. If anyone else is curious about this direction and doesn't want to spend ages scouting, they can start from my notes instead of from scratch. That's worth something on its own.

And the bigger question - whether we can make language models an order of magnitude more efficient - is still open. I still believe it's possible. I just believe it with a little more humility about which specific path will get us there.

There's a different direction forming in my head already. I'll write about it when I have something to show.

The full technical write-up and all the code is on GitHub for anyone who wants the nerdy version.