On The Curse of Dimensionality

Or what the difficulties of splitting a bill can teach us about working with LLMs

2024.11.17

LXXIV

The trick is to just make the problem simpler… but not sooooo simple that you miss the point.

Curse of Dimensionality

All things considered, it is easier to solve problems with fewer variables than more variables.

If you are at dinner with friends, and you pay the bill to collect credit card points, you (or someone) needs to figure out how much everyone owes you. 

If you all ordered the same things, there are no variables. Everyone owes the total cost divided by the number of people. If you order different things, it gets a little bit more complicated. There’s a variable–what percentage of the total does everyone owe?

And now, if you wanted to prorate who owes what based on the proportion of appetizers eaten, it gets even worse–that adds one new variable for each appetizer. And then, god forbid, if you wanted everyone to pay a different proportion of the tip based on comparative satisfaction, it would get even more complicated. 

This notion of the more variables involved in a question, the more challenging it is, generalizes pretty well across life. Formally, this is known as the curse of dimensionality. As you add dimensions to some problem, the space of all possibilities increases non linearly. The data you have to base your models of the space on never increases as fast as all possibilities. 

In other words, as we add more variables to a problem, the possible solutions grow faster than the information we have to analyze them.

So, perhaps a working solution is to manage the number of variables you’re dealing with when you can. This seems to work in both sales and AI.

BAMFAM

While I’m still no sales expert, one of the highest leverage and simple habits I’ve picked up so far is BAMFAM–Book A Meeting From A Meeting. 

This just means that if I plan on continuing the sales process with a prospect, I’ll make sure that before our meeting is over, I pick a time with them for our next meeting. I couldn’t tell you where I learned this because it was probably in every sales book I’ve ever read.*

The idea here is to keep the follow up with the prospect as simple as possible. You’re likely already asking them to do something–use your platform, intro you to someone else on the team, review a contract, or maybe pay you. Why would you add any more complexity by also making them book a meeting with you via a back and forth email convo? 

To come at it from another angle, you’re reducing the dimensionality of the problem for the prospect. You want them to do the thing that will move the sale forward, not worry about logistics. Having a meeting already on the calendar somewhat locks in one dimension of the problem (am I going to chat with Noah again?) and lets them focus on the more important thing—is it worth it for them to purchase your product or service?

Of course, I’m still learning to extend that to the whole sales process and remove other unneeded variables. 

*If I recall correctly, I also learned it in an unnamed dating forum I embarrassingly read in HS–book a date from a date. I haven’t followed it in that regard, however.

LLMs

I don’t know a whole lot about how LLMs actually work. Still, two weeks ago, I started fine tuning some for BirdDog

Fine tuning means you take a language model, give it a bunch of example inputs and outputs, and “train” it to be able to get the “answer” to new, similar inputs, right. You can do this to help a smaller model perform as good or better than a larger model on a specific type of question or task. And, the smaller the model, the less expensive it is to run the model. 

Quite simply, for our use case, we wanted to know if context would help answer some question. Here’s a simple example using animals:

Context: Felines eat mice, bugs, carrion, fish, and quite a wide range of other foods. However, food such as chocolate, onions, and grapes are dangerous and sometimes deadly for cats to consume. 

Question: Can cats eat chocolate?

Answer: True

This is a little tricky, because we’re not asking if cats can eat chocolate. We’re asking if the question can be answered by the context. 

With a total data set of 10,000 examples, we trained the model on 8,500 samples* and got it to perform to a 98% accuracy and 89% F1 Score** on the 1,500 questions we kept outside of the training set. This was super exciting, because before we trained the model, it had an original performance of 44% accuracy and 22% F1 Score.

However, we found that the model performed terribly in the “wild,” to the point of being useless. 

We realized that whenever the animal in the question we asked had not been seen in the training set, that the model was totally incompetent. So, a question like this might fail:

Context: Non athletic canines have been known to swim for 10 minutes, while trained canines can swim for closer to 30 minutes.

Question: Can dogs swim?

While we thought we were training the model to determine if the context was useful for answering a question, we were training it to determine if the context was useful for answering a question about a specific entity.

In other words, the question, context, the entity in the question, and the entity in the context were all variables. And, because the model’s seen feline and cat, it knows the two are the same, but it might not know that about canines and dogs.***

The solution? I haven’t done it yet, so I’ll report back in a couple weeks, but I have been told that if I break the task into two parts, we’ll get much better results:

  1. Mask the entities (cats, dogs, feline, canine) and ask if the context answers the question. This reduces it to the two variables I wanted to train it on—question and context.

  2. Unmask the entities and determine if the two pieces of text are about the same entities–we might not even need a language model for this part…

In this way, we descope the variables in the problem so that the model doesn’t have as much to deal with.

*7,000 samples for training & 1,500 for validation

**A metric that considers false positives and false negatives

***A base model may already be able to understand the similarity between a canine and a dog, but not between the entities we actually cared about

Naive Worldview

While this notion of reducing variables reduces the difficulty of the problem, when you are modeling the world, you have to be careful to not oversimplify it. 

Understanding a beehive via the bee-havior of each of the individuals is a fools errand. You typically look at higher level variables, such as the shape the drawn out cone makes or the aggression level of the bees collectively.

If a worldview is a function with variables, that is the constant challenge, I suppose—have a worldview that is as simple as possible while explaining as much of the world as possible. A worldview with one variable that only explains 1% of your experience may not be so useful, just as a model with 100 Trillion variables that explains 90% of things also may not be so useful. 

Hmm, I was awfully quiet about efficiency this time. Already onto dimensionality, are we? We’ll see if I can’t succinctly connect them at some point in the future…

Live Deeply,