Ranking 7b GGUF for Comprehensive Bulleted Notes With Ollama: Go Home Model Rankings, You're Drunk!

Forward

Let me start out by saying I mean no offense to the creators of Large Language Models (LLM), the tools for evaluating them, or the individuals\organizations who rank them and create leader-boards. As a relative newcomer to this ecosystem, I am indebted to your work and grateful that the way has been made easy for me to accomplish tasks that would otherwise be laborious.

Clearly, my next post must be exploring various methods for model ranking, to get a little more detail into how the evals work.

Introduction

If you haven’t read my previous article PrivateGPT for Book Summarization: Testing and Ranking Configuration Variables then you may find it beneficial to review, as I’ve defined terms and explained the means by which I came to various practices and beliefs.

If you did read that article, then you will be aware that I’ve been refining my processes, for a few months, using Large Language Models (LLM) for the purpose of summarizing books. I measured a series of parameters including prompt templates, system prompts, user prompts, etcetera.

From that preliminary round of model rankings and collecting data on the use of configuration variables, I found mistral-7b-instruct-v0.2.Q8_0.gguf to produce the highest quality bulleted notes, and have been searching for one to best it, that fits on my 12GB 3060, ever since.

I double dare you!! Show me a 7b outperforming Mistral for this task.

For this ranking, I’m using that base of knowledge to assess a variety of leading 7b models. This time I’m using Ollama, as I find it simpler to use and quite performant.

I chose the following models because I found them ranking above Mistral 7b Instruct 0.2 on various leader-boards, or were self-proclaimed as best 7b. (chat templates tested in parenthesis)

openchat-3.5-0106.Q8_0.gguf (OpenChat)
snorkel-mistral-pairrm-dpo.Q8_0.gguf (Mistral)
dolphin-2.6-mistral-7b.Q8_0.gguf (Mistral)
supermario-v2.Q8_0.gguf (ChatML)
openhermes-2.5-mistral-7b.Q8_0.gguf (ChatML)
openhermes-2.5-neural-chat-7b-v3-1-7b.Q8_0.gguf (ChatML)
openhermes-2.5-neural-chat-v3-3-slerp.Q8_0.gguf (ChatML)
WestLake-7B-v2-Q8_0.gguf (ChatML, Mistral)
MBX-7B-v3-DPO.q8_0.gguf (ChatML, Mistral)
neuralbeagle14-7b.q8_0.gguf (ChatML, Mistral)
omnibeagle-7b-q8_0.gguf (ChatML, Mistral)

For some models, where I wasn’t getting the desired results, as they are mostly Mistral derived, I tested the Mistral template even though they list ChatML as their preferred input.

Bullet Point Notes With Headings and Terms in Bold

Write comprehensive bulleted notes summarizing the following text, with headings, terms, and key concepts in bold.\n\nTEXT:

While GPT3.5 isn’t my personal baseline, it is something of an industry standard, and I would expect it to produce better results than most 7b Q8 GGUFs.

An example response from GPT3.5

While there are no key concepts of terms in bold, the headings are in bold, and overall, this is quite easy to read compared to blocks of paragraphs. Also, whether or not we find terms in bold may depend on the input text itself, where a bullet point summary should always include bolded headings.

I’m Looking for Models That Produces Notes:

faster
with more detail, less filler
with comparable detail with longer context (currently stretching these capabilities around 2.5k context)

I see this as a fundamental task for any Instruct model. Ideally, developers will train their models to generate these types of ideal bulleted notes. I have tons of data, with some books trained already, but it’s relatively simple to generate these notes for a book (Using Mistral 7b Instruct 0.2 with the text semantically chunked, by hand, into parts below 2.5k tokens, each).

If it’s a 300-600 page book, then it can usually be done in a single day, including pre and post-processing.

Eventually, I might experiment with some fine-tuning in an attempt to improve their capacities myself.

The Rankings

Previously, I tried to give each ranking a score. It’s really hard to give a numerical score. In the future, I think I’ll try to get an LLM to rank the summaries. This time, I’ll just leave a comment on where it falls short, and what I like, without giving a numerical score to each model.

I tested each of the following models on a single book chapter, divided into 6 chunks from 1900-3000 tokens each. I’ll share a representative example output from each, and the full data will be available on GitHub, as always.

Mistral 7b Instruct 0.2 Q8 GGUF

I’m sure you realize by now that, in my opinion, Mistral has the 7b to beat.

Modelfile

Ollama has a feature where you input the model location, template, and parameters to a Model file, which it uses to save a copy of your LLM using your specified configuration. This makes it easy to demo various models without having to always be fussing around with parameters.

I’ve kept the parameters the same for all models except the chat template, but I will share with you the template I’m using for each, so you can see precisely how I use the template. You can let me know if I’d get better results from the following models using a differently configured Modelfile.

TEMPLATE """
<s></s>[INST] {{ .Prompt }} [/INST]
"""
PARAMETER num_ctx 8000
PARAMETER num_gpu -1
PARAMETER num_predict 4000

Mistral 7b Instruct v0.2 Result

I won’t say that Mistral does it perfectly every single time, but more often than not, this is my result. And if you look back to the GPT3.5 response, you might agree that this is better.

7b GOAT?

OpenChat 3.5 0106 Q8 GGUF

I was pleasantly surprised by OpenChat’s 0106. Here is a model that claims to have the best 7b model, and at least is competitive with Mistral 7b.

Modelfile

TEMPLATE """
GPT4 Correct User:  {{ .Prompt }}<|end_of_turn|>GPT4 Correct Assistant:
"""
PARAMETER num_ctx 8000
PARAMETER num_gpu -1
PARAMETER num_predict 4000

OpenChat 3.5 0106 Result

In this small sample, it gave bold headings 4/6 times. Later, I will review it along with any other top contenders using a more detailed analysis.

I like what I see, but it needs a deeper examination

Snorkel Mistral Pairrm DPO Q8 GGUF

Obviously, I’m biased, here, as Snorkel was trained on Mistral 7b Instruct 0.2. Regardless, I am cautiously optimistic and look forward to more releases from Snorkel.ai.

Modelfile

TEMPLATE """
<s></s>[INST] {{ .Prompt }} [/INST]
"""
PARAMETER num_ctx 8000
PARAMETER num_gpu -1
PARAMETER num_predict 4000

Snorkel Mistral Pairrm DPO Result

4/6 of these summaries are spot on, but others contain irregularities such as super long lists of key terms and headings instead of just bolding them inline as part of the summary.

The dark horse of this race.

Dolphin 2.6 Mistral 7B Q8 GGUF

Here is another mistral derivative that’s well regarded.

Modelfile

TEMPLATE """
<|im_start|>system
You are a helpful AI writing assistant.<|im_end|>
<|im_start|>user
{{ .Prompt }} <|im_end|>
<|im_start|>assistant
{{ .Response }}<|im_end|>
"""
PARAMETER num_ctx 8000
PARAMETER num_gpu -1
PARAMETER num_predict 4000

Dolphin 2.6 Mistral 7B Result

This is another decent model that’s almost as good as Mistral 7b Instruct 0.2. Three out of 6 summaries gave proper format and bold headings, another had good format with no bold, but 2/6 were bad form all around.

Bad form

OpenHermes 2.5 Mistral-7B Q8 GGUF

This model is quite popular, both on leaderboards and among “the people” in unassociated discord chats. I want it to be a leader in this ranking, but it’s just not.

Modelfile

TEMPLATE """
<|im_start|>system
You are a helpful AI writing assistant.<|im_end|>
<|im_start|>user
{{ .Prompt }} <|im_end|>
<|im_start|>assistant
{{ .Response }}<|im_end|>
"""
PARAMETER num_ctx 8000
PARAMETER num_gpu -1
PARAMETER num_predict 4000

OpenHermes 2.5 Mistral Result

3/6 results produce proper structure, but no bold text. One of them has both structure and bold text. The other two had more big blocks of text \ and poor structure.

Just not "there", for me.

OpenHermes 2.5 Neural Chat 7b v3.1 7B Q8 GGUF

I also tried a few high-ranking derivatives of OpenHermes 2.5 Mistral to see if I could get better results. Unfortunately, that was not the case.

Modelfile

TEMPLATE """
<|im_start|>system
You are a helpful AI writing assistant.<|im_end|>
<|im_start|>user
{{ .Prompt }} <|im_end|>
<|im_start|>assistant
{{ .Response }}<|im_end|>
"""
PARAMETER num_ctx 8000
PARAMETER num_gpu -1
PARAMETER num_predict 4000

OpenHermes 2.5 Neural Chat 7b v3.1 Result

None of these results were desirable.

If I pay you $20 will you do it?

OpenHermes 2.5 Neural-Chat v3.3 Slerp Q8 GGUF

Whatever they did, these derivatives did not improve upon the original.

Modelfile

TEMPLATE """
<|im_start|>system
You are a helpful AI writing assistant.<|im_end|>
<|im_start|>user
{{ .Prompt }} <|im_end|>
<|im_start|>assistant
{{ .Response }}<|im_end|>
"""
PARAMETER num_ctx 8000
PARAMETER num_gpu -1
PARAMETER num_predict 4000

OpenHermes 2.5 neural-chat v3.3 Slerp Result

It’s just getting worse with each new version!

I'm a very sad rater of leading language models.

Super Mario V2 Q8

I wasn’t expecting much from Mario, but it shows some promise. Meanwhile, V3 and V4 are available, but I haven’t found GGUF for those, yet.

Modelfile

TEMPLATE """
<|im_start|>system
You are a helpful AI writing assistant.<|im_end|>
<|im_start|>user
{{ .Prompt }} <|im_end|>
<|im_start|>assistant
{{ .Response }}<|im_end|>
"""
PARAMETER num_ctx 8000
PARAMETER num_gpu -1
PARAMETER num_predict 4000

Super Mario V2 Result

Its first result was deceptively good. However, each of the following summaries deviated from the desired pattern. I’ll be on the lookout for GGUF of the newer releases. You can see here we got blocks of paragraphs with an initial bolded heading. Not really what I asked for.

Example of what I don't want.

Honorable Mentions

omnibeagle-7b (ChatML) - This one is actually producing a decent format but no bolded text.

neuralbeagle14-7b (ChatML, Mistral) - Works better with the mistral template. “OK” results but too much confusion around prompt templates for my liking.

WestLake-7B-v2 (ChatML) - I’ve seen worse

MBX-7B-v3-DPO (ChatML) - No consistency in format.

Conclusion

I wish I had better news to share. My ideal headline is that there is an abundance of leading models that produce quality output when creating comprehensive bulleted note summaries, and it’s just so hard for me to choose among them. Unfortunately, that is not the case.

Maybe they outperform Mistral 0.2 in full form but only are trailing in GGUF format? I think it’s quite likely that none of our existing evals target this type of output, but I would certainly argue that it’s a task that any leading 7b gguf model should be able to manage.

Another thing to consider is that Mistral 7b Instruct v0.2 came out soon after Mixtral, amidst a bunch of fanfare. I think that release slipped under the radar. In fact, many of the “leading” models I’ve looked at are based on 0.1 Mistral.

Maybe things will change, and the world will realize that their best models still can’t top Mistral? Then again, maybe all those models are really good at all the other tasks I’m not asking for.

I’m Willing to Help, and I’m Willing to be Proven Wrong

I have data, I have a pipeline, and I have an endless need to create bulleted note summaries. If you want to work with me, please reach out.

You are also welcome to check out my GitHub, check the data, and try out your own version of this experiment. I’m happy to be proven wrong.

Discussion (20)

Not yet any reply