Forward
Let me start out by saying I mean no offense to the creators of Large Language Models (LLM), the tools for evaluating them, or the individuals\organizations who rank them and create leader-boards. As a relative newcomer to this ecosystem, I am indebted to your work and grateful that the way has been made easy for me to accomplish tasks that would otherwise be laborious.
Clearly, my next post must be exploring various methods for model ranking, to get a little more detail into how the evals work.
Introduction
If you haven’t read my previous article PrivateGPT for Book Summarization: Testing and Ranking Configuration Variables then you may find it beneficial to review, as I’ve defined terms and explained the means by which I came to various practices and beliefs.
If you did read that article, then you will be aware that I’ve been refining my processes, for a few months, using Large Language Models (LLM) for the purpose of summarizing books. I measured a series of parameters including prompt templates, system prompts, user prompts, etcetera.
From that preliminary round of model rankings and collecting data on the use of configuration variables, I found mistral-7b-instruct-v0.2.Q8_0.gguf to produce the highest quality bulleted notes, and have been searching for one to best it, that fits on my 12GB 3060, ever since.
For this ranking, I’m using that base of knowledge to assess a variety of leading 7b models. This time I’m using Ollama, as I find it simpler to use and quite performant.
I chose the following models because I found them ranking above Mistral 7b Instruct 0.2 on various leader-boards, or were self-proclaimed as best 7b. (chat templates tested in parenthesis)
-
openchat-3.5-0106.Q8_0.gguf (OpenChat)
-
snorkel-mistral-pairrm-dpo.Q8_0.gguf (Mistral)
-
dolphin-2.6-mistral-7b.Q8_0.gguf (Mistral)
-
supermario-v2.Q8_0.gguf (ChatML)
-
openhermes-2.5-mistral-7b.Q8_0.gguf (ChatML)
-
openhermes-2.5-neural-chat-7b-v3-1-7b.Q8_0.gguf (ChatML)
-
openhermes-2.5-neural-chat-v3-3-slerp.Q8_0.gguf (ChatML)
-
WestLake-7B-v2-Q8_0.gguf (ChatML, Mistral)
-
MBX-7B-v3-DPO.q8_0.gguf (ChatML, Mistral)
-
neuralbeagle14-7b.q8_0.gguf (ChatML, Mistral)
-
omnibeagle-7b-q8_0.gguf (ChatML, Mistral)
For some models, where I wasn’t getting the desired results, as they are mostly Mistral derived, I tested the Mistral template even though they list ChatML as their preferred input.
Bullet Point Notes With Headings and Terms in Bold
Write comprehensive bulleted notes summarizing the following text, with headings, terms, and key concepts in bold.\n\nTEXT:
While GPT3.5 isn’t my personal baseline, it is something of an industry standard, and I would expect it to produce better results than most 7b Q8 GGUFs.
While there are no key concepts of terms in bold, the headings are in bold, and overall, this is quite easy to read compared to blocks of paragraphs. Also, whether or not we find terms in bold may depend on the input text itself, where a bullet point summary should always include bolded headings.
I’m Looking for Models That Produces Notes:
- faster
- with more detail, less filler
- with comparable detail with longer context (currently stretching these capabilities around 2.5k context)
I see this as a fundamental task for any Instruct model. Ideally, developers will train their models to generate these types of ideal bulleted notes. I have tons of data, with some books trained already, but it’s relatively simple to generate these notes for a book (Using Mistral 7b Instruct 0.2 with the text semantically chunked, by hand, into parts below 2.5k tokens, each).
If it’s a 300-600 page book, then it can usually be done in a single day, including pre and post-processing.
Eventually, I might experiment with some fine-tuning in an attempt to improve their capacities myself.
The Rankings
Previously, I tried to give each ranking a score. It’s really hard to give a numerical score. In the future, I think I’ll try to get an LLM to rank the summaries. This time, I’ll just leave a comment on where it falls short, and what I like, without giving a numerical score to each model.
I tested each of the following models on a single book chapter, divided into 6 chunks from 1900-3000 tokens each. I’ll share a representative example output from each, and the full data will be available on GitHub, as always.
Mistral 7b Instruct 0.2 Q8 GGUF
I’m sure you realize by now that, in my opinion, Mistral has the 7b to beat.
Modelfile
Ollama has a feature where you input the model location, template, and parameters to a Model file, which it uses to save a copy of your LLM using your specified configuration. This makes it easy to demo various models without having to always be fussing around with parameters.
I’ve kept the parameters the same for all models except the chat template, but I will share with you the template I’m using for each, so you can see precisely how I use the template. You can let me know if I’d get better results from the following models using a differently configured Modelfile.
TEMPLATE """
<s></s>[INST] {{ .Prompt }} [/INST]
"""
PARAMETER num_ctx 8000
PARAMETER num_gpu -1
PARAMETER num_predict 4000
Mistral 7b Instruct v0.2 Result
I won’t say that Mistral does it perfectly every single time, but more often than not, this is my result. And if you look back to the GPT3.5 response, you might agree that this is better.
OpenChat 3.5 0106 Q8 GGUF
I was pleasantly surprised by OpenChat’s 0106. Here is a model that claims to have the best 7b model, and at least is competitive with Mistral 7b.
Modelfile
TEMPLATE """
GPT4 Correct User: {{ .Prompt }}<|end_of_turn|>GPT4 Correct Assistant:
"""
PARAMETER num_ctx 8000
PARAMETER num_gpu -1
PARAMETER num_predict 4000
OpenChat 3.5 0106 Result
In this small sample, it gave bold headings 4/6 times. Later, I will review it along with any other top contenders using a more detailed analysis.
Snorkel Mistral Pairrm DPO Q8 GGUF
Obviously, I’m biased, here, as Snorkel was trained on Mistral 7b Instruct 0.2. Regardless, I am cautiously optimistic and look forward to more releases from Snorkel.ai.
Modelfile
TEMPLATE """
<s></s>[INST] {{ .Prompt }} [/INST]
"""
PARAMETER num_ctx 8000
PARAMETER num_gpu -1
PARAMETER num_predict 4000
Snorkel Mistral Pairrm DPO Result
4/6 of these summaries are spot on, but others contain irregularities such as super long lists of key terms and headings instead of just bolding them inline as part of the summary.
Dolphin 2.6 Mistral 7B Q8 GGUF
Here is another mistral derivative that’s well regarded.
Modelfile
TEMPLATE """
<|im_start|>system
You are a helpful AI writing assistant.<|im_end|>
<|im_start|>user
{{ .Prompt }} <|im_end|>
<|im_start|>assistant
{{ .Response }}<|im_end|>
"""
PARAMETER num_ctx 8000
PARAMETER num_gpu -1
PARAMETER num_predict 4000
Dolphin 2.6 Mistral 7B Result
This is another decent model that’s almost as good as Mistral 7b Instruct 0.2. Three out of 6 summaries gave proper format and bold headings, another had good format with no bold, but 2/6 were bad form all around.
OpenHermes 2.5 Mistral-7B Q8 GGUF
This model is quite popular, both on leaderboards and among “the people” in unassociated discord chats. I want it to be a leader in this ranking, but it’s just not.
Modelfile
TEMPLATE """
<|im_start|>system
You are a helpful AI writing assistant.<|im_end|>
<|im_start|>user
{{ .Prompt }} <|im_end|>
<|im_start|>assistant
{{ .Response }}<|im_end|>
"""
PARAMETER num_ctx 8000
PARAMETER num_gpu -1
PARAMETER num_predict 4000
OpenHermes 2.5 Mistral Result
3/6 results produce proper structure, but no bold text. One of them has both structure and bold text. The other two had more big blocks of text \ and poor structure.
OpenHermes 2.5 Neural Chat 7b v3.1 7B Q8 GGUF
I also tried a few high-ranking derivatives of OpenHermes 2.5 Mistral to see if I could get better results. Unfortunately, that was not the case.
Modelfile
TEMPLATE """
<|im_start|>system
You are a helpful AI writing assistant.<|im_end|>
<|im_start|>user
{{ .Prompt }} <|im_end|>
<|im_start|>assistant
{{ .Response }}<|im_end|>
"""
PARAMETER num_ctx 8000
PARAMETER num_gpu -1
PARAMETER num_predict 4000
OpenHermes 2.5 Neural Chat 7b v3.1 Result
None of these results were desirable.
OpenHermes 2.5 Neural-Chat v3.3 Slerp Q8 GGUF
Whatever they did, these derivatives did not improve upon the original.
Modelfile
TEMPLATE """
<|im_start|>system
You are a helpful AI writing assistant.<|im_end|>
<|im_start|>user
{{ .Prompt }} <|im_end|>
<|im_start|>assistant
{{ .Response }}<|im_end|>
"""
PARAMETER num_ctx 8000
PARAMETER num_gpu -1
PARAMETER num_predict 4000
OpenHermes 2.5 neural-chat v3.3 Slerp Result
It’s just getting worse with each new version!
Super Mario V2 Q8
I wasn’t expecting much from Mario, but it shows some promise. Meanwhile, V3 and V4 are available, but I haven’t found GGUF for those, yet.
Modelfile
TEMPLATE """
<|im_start|>system
You are a helpful AI writing assistant.<|im_end|>
<|im_start|>user
{{ .Prompt }} <|im_end|>
<|im_start|>assistant
{{ .Response }}<|im_end|>
"""
PARAMETER num_ctx 8000
PARAMETER num_gpu -1
PARAMETER num_predict 4000
Super Mario V2 Result
Its first result was deceptively good. However, each of the following summaries deviated from the desired pattern. I’ll be on the lookout for GGUF of the newer releases. You can see here we got blocks of paragraphs with an initial bolded heading. Not really what I asked for.
Honorable Mentions
- omnibeagle-7b (ChatML) - This one is actually producing a decent format but no bolded text.
- neuralbeagle14-7b (ChatML, Mistral) - Works better with the mistral template. “OK” results but too much confusion around prompt templates for my liking.
- WestLake-7B-v2 (ChatML) - I’ve seen worse
- MBX-7B-v3-DPO (ChatML) - No consistency in format.
Conclusion
I wish I had better news to share. My ideal headline is that there is an abundance of leading models that produce quality output when creating comprehensive bulleted note summaries, and it’s just so hard for me to choose among them. Unfortunately, that is not the case.
Maybe they outperform Mistral 0.2 in full form but only are trailing in GGUF format? I think it’s quite likely that none of our existing evals target this type of output, but I would certainly argue that it’s a task that any leading 7b gguf model should be able to manage.
Another thing to consider is that Mistral 7b Instruct v0.2 came out soon after Mixtral, amidst a bunch of fanfare. I think that release slipped under the radar. In fact, many of the “leading” models I’ve looked at are based on 0.1 Mistral.
Maybe things will change, and the world will realize that their best models still can’t top Mistral? Then again, maybe all those models are really good at all the other tasks I’m not asking for.
I’m Willing to Help, and I’m Willing to be Proven Wrong
I have data, I have a pipeline, and I have an endless need to create bulleted note summaries. If you want to work with me, please reach out.
You are also welcome to check out my GitHub, check the data, and try out your own version of this experiment. I’m happy to be proven wrong.