I started to summarize a dozen books by hand and found it was going to take me weeks for each summary. Then I remembered about this AI revolution happening and decided I was long past due to jump into these waters.
When I began exploring the use of large language models (LLM) for summarizing large texts, I found no clear direction on how to do so.
- Some pages give example prompts to give GPT4 with the idea that it will magically know the contents of whatever book you want summarized. (NOT)
- Some people suggested i need to find a model with a large context that can process my whole text in one go. (Not Yet)
- Some open source tools are available that allow you to upload documents to a database and answer questions based on the contents of that database. (Getting Closer)
- Others have suggested that you must first divide the book into sections and feed them into the LLM for summarization one at a time. (Now we’re talking)
- Beyond making that determination, there are numerous variables which must be accounted for when implementing a given LLM.
- I quickly realized, despite any recommendations or model rankings available, I was getting different results than what others have. Whether its my use-case, the model format, quantization, compression, prompt styles, or what? I don’t know. All I know is, do your own model rankings under your own working conditions. Don’t just believe some chart you read online.
This guide provides some specifics into my process of determination and testing out the details of above mentioned variables.
Find the complete ranking data, walkthrough, and resulting summaries on GitHub.
Background
Key Terms
Some of these terms are used in different ways, depending on the context (no pun intended).
-
Large Language Model (LLM): (AKA Model) A type of Artificial Intelligence that has been trained upon massive datasets to understand and generate human language.
Example: OpenAI’s GPT3.5 and GPT4 which have taken the world by storm. (In our case we are choosing among open source and\or freely downloadable models found on Hugging Face.)
-
Retrieval Augmented Generation (RAG): A technique, developed by Meta AI, of storing documents in a database that the LLM searches among to find an answer for a given user query (Document Q/A).
-
User Instructions: (AKA Prompt, or Context) is the query provided by the user.
Example: “Summarize the following text :
{ text }
” -
System Prompt: Special instructions given before the user prompt, that helps shape the personality of your assistant.
Example: “You are a helpful AI Assistant.”
-
Context: User instructions, and possibly a system prompt, and possibly previous rounds of question\answer pairs. (Previous Q/A pairs are also referred to simply as context).
-
Prompt Style: These are special character combinations that a LLM is trained with to recognize the difference between user instructions, system prompt and context from previous questions.
Example:
<s>[INST] {systemPrompt} [INST] [INST] {previousQuestion} [/INST] {answer} </s> [INST] {userInstructions} [/INST]
-
7B: Indicates the number of parameters in a given model (higher is generally better). Parameters are the internal variables that the model learns during training and are used to make predictions. For my purposes, 7B models are likely to fit on a my GPU with 12GB VRAM.
-
GGUF: This is a specific format for LLM designed for consumer hardware (CPU/GPU). Whatever model you are interested in, for use in PrivateGPT, you must find its GGUF version (commonly made by TheBloke).
-
Q2-Q8 0, K_M or K_S: When browsing the files of a GGUF repository you will see different versions of the same model. A higher number means less compressed, and better quality. The M in K_M means “Medium” and the S in K_S means “Small”.
-
VRAM: This is the memory capacity of your GPU. To load it completely to GPU, you will want a model smaller size than your available VRAM.
-
Tokens: This is the metric LLM weighs language with. Each token consists of roughly 4 characters.
What is PrivateGPT?
PrivateGPT (pgpt) is an open source project that provides a user-interface and programmable API enabling users to use LLM with own hardware, at home. It allows you to upload documents to your own local database for RAG supported Document Q/A.
PrivateGPT Documentation - Overview:
PrivateGPT provides an API containing all the building blocks required to build private, context-aware AI applications. The API follows and extends OpenAI API standard, and supports both normal and streaming responses. That means that, if you can use OpenAI API in one of your tools, you can use your own PrivateGPT API instead, with no code changes, and for free if you are running privateGPT in
local
mode.
Overview
-
I began by just asking questions to book chapters, using the PrivateGPT UI\RAG.
Then tried pre-selecting text for summarization. This was the inspiration for Round 1 rankings, to see how big of a difference my results would show. (Summarizing Pre-selected Sections.)
-
Next I wanted to find which models would do the best with this task, which led to Round 2 rankings, where Mistral-7B-Instruct-v0.2 was the clear winner.
-
Then I wanted to get the best results from this model by ranking prompt styles, and writing code to get the exact prompt style expected.
-
After that, of course, I had to test out various system prompts to see which would perform the best.
-
Next, I tried a few, user prompts, to determine what is the exact best prompt to generate summaries that require the least post-processing, by me.
-
Ultimately, this type of testing should be conducted for each LLM, and for determining the effectiveness of any refinement in our processes. In my opinion, only once each model has been targeted to its most ideal conditions can they be properly ranked against each-other.
Rankings
When i began testing various LLM variants, mistral-7b-instruct-v0.1.Q4_K_M.gguf
came as part of PrivateGPT's default setup (made to run on your CPU). Here, I've preferred the Q8_0 variants.
While I've tried 50+ different LLM for this same task, Mistral-7B-Instruct is still among the best, especially since v0.2 was released I haven’t found any better.
TLDR: Mistral-7B-Instruct-v0.2 - is my current leader for summarization tasks.
Round 1 - Q/A vs Summary
I quickly discovered when doing Q/A is that I get much better results when uploading smaller chunks of data into the database, and starting with a clean slate each time. So I began splitting PDF into chapters for Q/A purposes.
For my first analysis I tested out 5 different LLM for the following tasks:
- Asking the same 30 questions to a 70 page book chapter.
- Summarizing that same 70 page book chapter, divided into 30 chunks.
Question / Answer Ranking
- Hermes Trismegistus Mistral 7b - My favorite, during these tests, but when actually editing the summaries I decided it was too verbose.
- SynthIA 7B V2 - Became my favorite of models tested in this round.
- Mistral 7b Instruct v0.1 - Not as good as I’d like.
- CollectiveCognition v1.1 Mistral 7b Alot of filler and took the longest amount of time of them all. It scored a bit higher than mistral on quality\usefulness, but the amount of filler just made it less enjoyable to read.
- KAI 7b Instruct the answers were too short, and made its BS stand out a little more. A good model, but not for detailed book summaries.
Shown, for each model
- Number of seconds required to generate the answer
- Sum of Subjective Usefulness\Quality Ratings
- How many characters were generated?
- Sum of context context chunks found in target range.
- Number of qualities listed below found in text generated:
- Filler (Extra words with less value)
- Short (Too short, not enough to work with.)
- BS (Not from this book and not helpful.)
- Good BS (Not from the targeted section but valid.)
Model |
Rating |
Search Accuracy |
Characters |
Seconds |
BS |
Filler |
Short |
Good BS |
---|---|---|---|---|---|---|---|---|
hermes-trismegistus-mistral-7b |
68 |
56 |
62141 |
298 |
3 |
4 |
0 |
6 |
synthia-7b-v2.0 |
63 |
59 |
28087 |
188 |
1 |
7 |
7 |
0 |
mistral-7b-instruct-v0.1 |
51 |
56 |
21131 |
144 |
3 |
0 |
17 |
1 |
collectivecognition-v1.1-mistral-7b |
56 |
57 |
59453 |
377 |
3 |
10 |
0 |
0 |
kai-7b-instruct |
44 |
56 |
21480 |
117 |
5 |
0 |
18 |
0 |
Summary Ranking
For this first round I split the chapter contents in to sections with a range of
900-14000 characters each (or 225-3500 tokens).
NOTE: Despite the numerous large context models being released, for now, I still believe smaller context results in better summaries. I don’t prefer any more than 2750 tokens (11000 characters) per summarization task.
- Hermes Trismegistus Mistral 7b - Still in the lead. It's verbose, with some filler. I can use these results.
- SynthIA 7B - Pretty good, but too concise. Many of the answers were perfect, but 7 were too short\incomplete for use.
- Mistral 7b Instruct v0.1 - Just too short.
- KAI 7b Instruct - Just too short.
- CollectiveCognition v1.1 Mistral 7b - Lots of garbage. Some of the summaries were super detailed and perfect, but over half of the responses were a set of questions based on the text, not a summary.
Not surprisingly, summaries performed much better than Q/A, but they also had a more finely targeted context.
Name |
Score |
Characters Generated |
% Diff from OG |
Seconds to Generate |
Short |
Garbage |
BS |
Fill |
Questions |
Detailed |
---|---|---|---|---|---|---|---|---|---|---|
hermes-trismegistus-mistral-7b |
74 |
45870 |
-61 |
274 |
0 |
1 |
1 |
3 |
0 |
0 |
synthia-7b-v2.0 |
60 |
26849 |
-77 |
171 |
7 |
1 |
0 |
0 |
0 |
1 |
mistral-7b-instruct-v0.1 |
58 |
25797 |
-78 |
174 |
7 |
2 |
0 |
0 |
0 |
0 |
kai-7b-instruct |
59 |
25057 |
-79 |
168 |
5 |
1 |
0 |
0 |
0 |
0 |
collectivecognition-v1.1-mistral-7b |
31 |
29509 |
-75 |
214 |
0 |
1 |
1 |
2 |
17 |
8 |
Find the full data and rankings on Google Docs or on GitHub: QA Scores, Summary Rankings.
Round 2: Summarization - Model Ranking
Again, I prefer Q8 versions of 7B models.
Finding that Mistral 7b Instruct v0.2 had been released was well worth a new round of testing.
I also decided to test the prompt style. PrivateGPT didn’t come packaged with the Mistral prompt, and while Mistral prompt is similar to Llama2 Prompt, it seemed to perform better with the default (llama-index) prompt.
- SynthIA-7B-v2.0-GGUF - This model had become my favorite, so I used it as a benchmark.
- Mistral-7B-Instruct-v0.2 (Llama-index Prompt) Star of the show here, quite impressive.
- Mistral-7B-Instruct-v0.2 (Llama2 Prompt) Still good, but not as good as using llama-index prompt
- Tess-7B-v1.4 - Another by the same creator as Synthia v2. Good, but not as good.
- Llama-2-7B-32K-Instruct-GGUF - worked ok, but slowly, with llama-index prompt. Just bad with llama2 prompt. (Should test again with Llama2 "Instruct Only" style)
Summary Ranking
Only summaries, Q/A is just less efficient for book summarization.
Model |
% Difference |
Score |
Comment |
---|---|---|---|
Synthia 7b V2 |
-64.43790093 |
28 |
Good |
Mistral 7b Instruct v0.2 (Default Prompt) |
-60.81878508 |
33 |
VGood |
Mistral 7b Instruct v0.2 (Llama2 Prompt) |
-64.5871483 |
28 |
Good |
Tess 7b v1.4 |
-62.12938978 |
29 |
Less Structured |
Llama 2 7b 32k Instruct (Default) |
-61.39890553 |
27 |
Less Structured. Slow |
Find the full data and rankings on Google Docs or on GitHub.
Round 3: Prompt Style
In the previous round, I noticed Mistral 7b Instruct v0.2 was performing much better with default prompt than llama2.
Well, actually, the mistral prompt is quite similar to llama2, but not exactly the same.
- llama_index (default)
system: {{systemPrompt}}
user: {{userInstructions}}
assistant: {{assistantResponse}}
- llama2:
<s> [INST] <<SYS>>
{{systemPrompt}}
<</SYS>>
{{userInstructions}} [/INST]
- mistral:
<s>[INST] {{systemPrompt}} [/INST]</s>[INST] {{userInstructions}} [/INST]
I began testing output with the default
, then llama2
prompt styles. Next I went to work coding the mistral template.
The results of that ranking gave me confidence that I coded correctly.
Prompt Style |
% Difference |
Score |
Note |
---|---|---|---|
Mistral |
-50% |
51 |
Perfect! |
Default (llama-index) |
-42% |
43 |
Bad headings |
Llama2 |
-47% |
48 |
No Structure |
Find the full data and rankings on Google Docs or on GitHub.
Round 4: System Prompts
Once I got the prompt style dialed in, I tried a few different system prompts, and was surprised by the result!
Name |
System Prompt |
Change |
Score |
Comment |
---|---|---|---|---|
None |
|
-49.8 |
51 |
Perfect |
Default Prompt |
You are a helpful, respectful and honest assistant. \nAlways answer as helpfully as possible and follow ALL given instructions. \nDo not speculate or make up information. \nDo not reference any given instructions or context." |
-58.5 |
39 |
Less Nice |
MyPrompt1 |
"You are Loved. Act as an expert on summarization, outlining and structuring. \nYour style of writing should be informative and logical." |
-54.4 |
44 |
Less Nice |
Simple |
"You are a helpful AI assistant. Don't include any user instructions, or system context, as part of your output." |
-52.5 |
42 |
Less Nice |
In the end, I find that Mistral 7b Instruct v0.2 works best for my summaries without any system prompt.
Maybe would have different results for a different task, or maybe better prompting, but this works good so I'm not messing with it.
Find the full data and rankings on Google Docs or on GitHub.
Round 5: User Prompt
What I already began to suspect is that I’m getting better results with less words in the prompt. Since I found the best system prompt, for Mistral 7b Instruct v0.2, I also tested which user prompt suits it best.
|
Prompt |
vs OG |
score |
note |
---|---|---|---|---|
Propmt0 |
Write concise, yet comprehensive, notes summarizing the following text. Use nested bullet points: with headings, terms, and key concepts in bold. Focus on essential knowledge from this text without adding any external information. |
43% |
11 |
|
Prompt1 |
Write concise, yet comprehensive, notes summarizing the following text. Use nested bullet points: with headings, terms, and key concepts in bold. Focus on essential knowledge from this text without adding any external information. |
46% |
11 |
Extra Notes |
Prompt2 |
Write comprehensive notes summarizing the following text. Use nested bullet points: with headings, terms, and key concepts in bold. |
58% |
15 |
|
Prompt3 |
Create concise bullet-point notes summarizing the important parts of the following text. Use nested bullet points, with headings terms and key concepts in bold, including whitespace to ensure readability. Avoid Repetition. |
43% |
10 |
|
Prompt4 |
Write concise notes summarizing the following text. Use nested bullet points: with headings, terms, and key concepts in bold. |
41% |
14 |
|
Prompt5 |
Create comprehensive, but concise, notes summarizing the following text. Use nested bullet points: with headings, terms, and key concepts in bold. |
52% |
14 |
Extra Notes |
Find the full data and rankings on Google Docs or on GitHub.
Perhaps with more powerful hardware that can support 11b or 30b models I would get better results with more descriptive prompting. Even with Mistral 7b Instruct v0.2 I’m still open to trying some creative instructions, but for now I’m just happy to refine my existing process.
Prompt2: Wins!
Write comprehensive notes summarizing the following text. Use nested bullet points: with headings, terms, and key concepts in bold.
In this case, comprehensive performs better than "concise", or even than "comprehensive, but concise".
However, I do caution that this will depend on your use-case. What I'm looking for is a highly condensed and readable notes covering the important knowledge.
Essentially, if I didn't read the original, I should still know what information it conveys, if not every specific detail. Even if I did read the original, I’m not going to remember the majority, later on. These notes are a quick reference to the main topics.
Result
Using knowledge gained from these tests, I summarized my first complete book, 539 pages in 5-6 hours!!! Incredible!
Instead of spending weeks per summary, I completed my first 9 book summaries in only 10 days.
Plagiarism
You can see the results from CopyLeaks below for each of the texts published, here.
Especially considering that this is not for profit, but for educational purposes, I believe these numbers are acceptable.
Book |
Models |
Character Difference |
Identical |
Minor changes |
Paraphrased |
Total Matched |
---|---|---|---|---|---|---|
Eastern Body Western Mind |
Synthia 7Bv2 |
-75% |
3.5% |
1.1% |
0.8% |
5.4% |
Healing Power Vagus Nerve |
Mistral-7B-Instruct-v0.2; SynthIA-7B-v2.0 |
-81% |
1.2% |
0.8% |
2.5% |
4.5% |
Ayurveda and the Mind |
Mistral-7B-Instruct-v0.2; SynthIA-7B-v2.0 |
-77% |
0.5% |
0.3% |
1.2% |
2% |
Healing the Fragmented Selves of Trauma Survivors |
Mistral-7B-Instruct-v0.2 |
-75% |
|
|
|
2% |
A Secure Base |
Mistral-7B-Instruct-v0.2 |
-84% |
0.3% |
0.1% |
0.3% |
0.7% |
The Body Keeps the Score |
Mistral-7B-Instruct-v0.2 |
-74% |
0.1% |
0.2% |
0.3% |
0.5% |
Complete Book of Chakras |
Mistral-7B-Instruct-v0.2 |
-70% |
0.3% |
0.3% |
0.4% |
1.1% |
50 Years of Attachment Theory |
Mistral-7B-Instruct-v0.2 |
-70% |
1.1% |
0.4% |
2.1% |
3.7% |
Attachment Disturbances in Adults |
Mistral-7B-Instruct-v0.2 |
-62% |
1.1% |
1.2% |
0.7% |
3.1% |
Psychology Major's Companion |
Mistral-7B-Instruct-v0.2 |
-62% |
1.3% |
1.2% |
0.4% |
2.9% |
Psychology in Your Life |
Mistral-7B-Instruct-v0.2 |
-74% |
0.6% |
0.4% |
0.5% |
1.6% |
Completed Book Summaries
Instead of spending weeks per summary, I completed my first 9 book summaries in only 10 days. In parenthesis is the page count of the original.
- Eastern Body Western Mind Anodea Judith (436 pages)
- Healing Power of the Vagus Nerve Stanley Rosenberg (335 Pages)
- Ayurveda and the Mind Dr. David Frawley (181 Pages)
- Healing the Fragmented Selves of Trauma Survivors Janina Fisher (367 Pages)
- A Secure Base John Bowlby (133 Pages)
- The Body Keeps the Score Bessel van der Kolk (454 Pages)
- Yoga and Polyvagal Theory, from Polyvagal Safety Steven Porges (37 pages)
- Llewellyn's Complete Book of Chakras Cynthia Dale (999 pages)
- Fifty Years of Attachment Theory: The Donald Winnicott Memorial Lecture (54 pages)
- Attachment Disturbances in Adults (477 Pages)
- The Psychology Major's Companion Dana S. Dunn, Jane S. Halonen (308 Pages)
- The Myth of Redemptive Violence Walter Wink (5 Pages)
- Psychology In Your Life Sarah Gison and Michael S. Gazzaniga (1072 Pages)
Walkthrough
If you are interested to follow my steps more closely, check out the walkthrough on GitHub, containing scripts and examples.
Conclusion
Now that I have my processes refined, and feel confident working with prompt formats, I will conduct further tests. In fact, i already have conducted further tests and rankings (will publish those next), but of course will do more tests again and continue learning!
I still believe if you want to get the best results for whatever task you perform with AI, you ought to run your own experiments and see what works best. Don’t rely solely on popular model rankings, but use them to guide your own research.
Additional Resources
- Pressure-tested the most popular open-source LLMs (Large Language Models) for their Long Context Recall abilities u/ramprasad27 (Part 2)
- LeonEricsson / llmcontext - 💢 Pressure testing the context window of open LLMs
- Chatbox Arena Leaderboard
- 🐺🐦⬛ LLM Comparison/Test: Ranking updated with 10 new models (the best 7Bs)! u/WolframRavenwolf
- Hallucination leaderboard Vectara
Also appears here.