Mitigating LLM hallucinations in text summarisation

Posted by Henry Cooke on 20 Jun 2024, last updated 26 Jun 2024

I'm excited! After weeks of trying to get an LLM to output reliable, useful summaries of research interview transcripts, I've hit on something that works. This is a post about the journey to get here, and some details on the process I've arrived at. If you don't want any of the preamble or context, you can skip straight to "The Pipeline" below for specifics on the method I've patched together.

One of the main ways we do foresight in �鶹Լ�� Research & Development's Advisory team is by interviewing experts and doing on those interviews. It's a good way of distilling insight and consensus, but it takes forever. You need to do a close read of every interview, picking out key quotes and identifying themes. You'll typically do many passes of this over the course of a project as your themes and understanding of the material develop. One of my colleagues, while we were deep in this analysis on last year, asked me: "couldn't we just get an AI to do this?" This was an interesting question. Whatever you think about the current wave of AI hype, it's undeniable that one thing LLMs are good at is finding patterns in swathes of language data. It's their fundamental function. So on the surface of it, analysing a bunch of interview transcripts (not even generating anything new!) seems like an ideal task for an LLM.

At the time, I wasn't keen on adding a learning curve to the big pile of work we already had to do. However, over the last couple of months I've had a bit of time to look into the question and do some tinkering with a locally-hosted model.

Saying this loud for the folks at the back

I need to make it absolutely clear here that I am not suggesting that human researchers can, or should, be replaced by LLMs and Python scripts. There's no way we should be directly relying on the outputs of LLMs in a research context. As my esteemed colleague Bill Thompson likes to remind us: LLMs hallucinate all the time, it's just that sometimes the hallucinations are useful. They make stuff up, and they miss subtle, but key, details.

Nor am I suggesting that we should be completely handing off the part of thematic analysis where we go over the source material to an LLM. It's an important step which cements the material in our minds and gets us thinking about the emerging themes and how they all fit together.

What I am hoping to do is generate sufficiently useful interview summaries to jump-start the analysis process with a good-enough first pass. Many people, me included, find it much easier to approach a writing task if we have something, anything, to edit.

While I'm disclaimering: I am in no way claiming to be an expert on AI or LLMs. I'm an experienced programmer, and I've got many years' experience with various bits and pieces of language processing, but I'll level with you: whenever I try to read any documentation or papers about machine learning stuff, it's more or less incomprehensible to me. I'm indebted to the many people who have put time and effort into making this stuff usable by non-experts[1], and those who take the time to post tips and tutorials online.

Drawing a circle in chalk

We knew from the start that we couldn’t run this experiment in ChatGPT or CoPilot. The �鶹Լ��, like many other large organisations, is taking a cautious approach to generative AI and LLMs and we have clear editorial guidelines and guidance on the use of AI that would rule this out.

So we made a few operational decisions to minimise risk and cleared the proposal with colleagues looking after responsible AI policies, data protection and infosec.

First, we anonymise the interview transcripts before sending them to the model for analysis, so it never has data about who was speaking. Second, we run the LLMs on our own hardware, on �鶹Լ�� premises[2]. If I wanted to, I could get on a train to London and prod the server that we’re using with a screwdriver.[8]

This means that we’re in total control of how, when and where the models run, and no sensitive data leaves the �鶹Լ�� estate for processing. Let us not forget that the cloud is, after all, .

Third, we don’t use the outputs from the LLM directly. They are used strictly as supporting material during a research process; they’re checked by human researchers against interview transcripts (and recordings, if necessary), and they won’t be part of the (human) written outputs of our research projects. With all that in mind, I spun up an instance of on an R&D server (via , which made this a pleasingly simple task), and set about writing some (Python) code to throw our interview summaries at the LLM.

It's kind of fascinating working with LLMs at this level. On the one hand, it's absolutely wild to me that I can get a computer to do a decent job of summarising some text literally by telling it, in English: "your task is to summarise a transcript of a research interview."

On the other hand, the technology is so new, and fast-moving, and (most importantly) black-boxed, that we're very much in the "feeling our way around" stage on the path to usefulness - some pivotal techniques I'm about to outline were gathered from various discussion threads where people share phrases and formulae they've arrived at essentially by empirical reasoning. Not for nothing has it been alleged that "."[3]

The last 10% is 90% of the effort

It’s not been a straightforward ride: the strategy I'm describing in this post was pretty much a Hail Mary before giving up entirely. The problem is hallucination: sometimes, LLMs just make up a bunch of stuff instead of sticking to the script. This is by design. An LLM has no actual knowledge, or reasoning capabilities as such: it's generating plausible output, given some opening text. It has no concept of 'truth' or 'accuracy' - it's been said that LLMs are "".[4]

In practise, this meant that I fairly quickly got to a process which was reliably generating plausible-looking interview summaries, but those summaries would often include things the interviewees never said. Not ideal, from a research point of view. The hard bit turned out to be minimising those hallucinations.

So while it's still fresh in my mind, I'm going to outline the recipe I've arrived at to get consistently useful, accurate interview summaries. It's a pipeline that consists of an LLM, some prompt engineering, and a Python script. I have another, higher-level post brewing on the overall workflow for thematic analysis I've been working on with an LLM in the loop. This post is the detail about a process I've cooked up for minimising LLM hallucinations while summarising text - one component of that overall workflow.

The pipeline

Prompt stuffing

This literally means 'stuffing' (or inserting) additional context into the prompt you send to the LLM. Prompt stuffing is a basic building block of 'retrieval augmented generation', or RAG - a general approach used in prompting LLMs which revolves around giving the model extra context needed to answer specialist requests and steer it to using that context over making stuff up. In this case, the extra context is a chunk of interview transcript. I read about it in the , and the only trick here is stuffing the context into the prompt fairly near the beginning, as discussed in the next section.

Keep it simple

Even with models that apparently have a large "context window" (basically, maximum input length), I've noticed that I get essentially garbage back from the LLM when I feed it too much text. It feels like challenging its 'attention span' makes it 'lose interest' and fall back on generating an answer from what it already knows - the language already present in the model.

So my script breaks down the transcripts into shorter chunks[9] using before combining them with the summarisation prompt and sending it off to the LLM. I also keep the prompt as pithy and unambiguous as I can, and stuff the interview chunk into it early, giving more detailed instructions after the interview text. This seems to do a better job of keeping the LLM on-task.

In this respect, I've come to think of LLMs as being capable research assistants who are very easily distracted, and design my prompts accordingly.[5] Of course, it's equally likely that there's something wrong with the way I'm running the model, or the setup on the server, that's causing the model's attention to drift. To reiterate: I'm not an expert. One to investigate later.

Explicitly instruct the LLM to stick to the provided material

I got this idea from , where user suggests using this phrasing to help mitigate hallucinations when using RAG.

Use “what you know” and only “What you know” to respond to the user.
...
“What you know” = {text}

I used this to focus the LLM's attention on the interview transcript stuffed into the prompt.

Ask the LLM to check its own work

This is also from the above mentioned Reddit thread, where user Jdonavan reports using this phrasing to keep GPT on-track:

Internally generate three possible answer, then evaluate each for accuracy against the context

I have no idea if it's actually generating possible answers and evaluating them internally, but adding an edict along these lines does a pretty good job of preventing the LLM from inventing quotes that weren't in the original text.

The prompt

Putting that all together, the prompt I'm currently using looks like this:

 <s> [INST] You are an expert qualitative researcher skilled in thematic analysis. Your task is to summarise a transcript of a research interview.

Use “Interview Transcript” and only “Interview Transcript” to summarise the interview. Do not embellish or add detail to “Interview Transcript”.

### Interview Transcript ###:
{docs}

### Instructions ###:
1. Aim for a summary of around 400-500 words.
2. Your summary must include the following:
2a. The main topics and subjects covered.
2b. 5-7 most important findings and insights present in “Interview Transcript”.
2b1. Check these findings against “Interview Transcript” and reject any that do not have any supporting quotes in the text.
2c. 1-2 illustrative quotes for each key finding. 
2c1. Check these quotes against “Interview Transcript” and reject any that do not appear in the text.
2c2. Quotes must be transcribed exactly as they appear in “Interview Transcript”. Do not edit or summarise quotes.
2d. Any other relevant context or inferences you can make based on the interview content
2e. 1-2 suprising or unusual observations.
5. Format your summary in clear paragraphs with headings for each section. Use Markdown format. 
6. Interview transcripts have been anonymised. Do not name the interviewee.

[/INST]
Summary: </s>

There's a few other ideas in here that I haven't mentioned elsewhere in the post: the use of ### to break the prompt up into sections is detailed in , and the <s> and [INST] tags are nicely explained in .

Checking the LLM's work

The final step that weeds out hallucinations is much dumber. My Python script finds quotes in a generated summary and compares them against the original text. If there are any words present in quotes from the summary that aren't in the original text (with a little margin for rephrasing or mistranscription)[6], we take that as an indicator that the LLM has been making stuff up and generate the summary again.

Because LLM output is non-deterministic, chances are that a freshly generated summary will be closer to the mark - the script also generates a new random seed and gradually increases the for each retry as a way to get more diverse output.

'Dumber' means reliable here though - by relying on a process which has very clear and repeatable criteria for a binary success or failure, we avoid the ambiguities inherent in a stochastic system.

Putting it together

The Python script I've written is a bit of a mess, and not very complicated - essentially it's just sticking together the steps I outline above. The tricky bit is the thinking and the prompt engineering. So I'm not planning on publishing that any time soon - although I might tidy it up a bit and release something, if there's any demand.

I'll be writing up the overall workflow soon.

Things that didn't work (yet)

Few-shot prompting

This is a technique where you provide a model answer as part of your prompt, to guide the LLM's response. I've seen it be really effective as a way of getting consistently formatted output from an LLM and ensuring it hits all the points you want.

Unfortunately, I've not got this to work reliably yet. Every time I've tried, I've ended up with the same kind of garbage I described above in "Keep it simple". My assumption here is that the model answer is pushing the length of the prompt into "drifting attention" territory. This is a shame, and I'd like to get it to work - we'll be returning to it when we do more investigation into the attention span issue.

MapReduce

I read about this strategy in the for . It's conceptually similar to the model used for searching big data stores: a document that's longer than an LLM's context length is broken into chunks that do fit into the context. A summary is generated for each chunk (the 'map' step), then an overall summary is generated by combining all the chunk summaries together (the 'reduce' step).

Whenever I tried to do the 'reduce' step with an LLM, I ended up getting the kind of garbage back that I've come to associate with it 'losing interest'. My assumption here is that the combined summaries are too long for the LLM to be able to summarise effectively, so for now the 'reduce' is being done by a person.

This isn't so bad from a humans-doing-thematic-analysis point of view — we need to review the summaries anyway, so doing a bit of mental work to understand and combine the chunk summaries is doing an important job of familiarising ourselves with the material and sinking it into our brains. We'll be going back to this though, it should in theory be doable.

Final thoughts

It's cost me a few late nights (and more than one migraine [7]), but I now have a system which can reliably produce decent summaries from interview transcripts. As a bonus, by peering over my shoulder and asking me what I'm doing while I work some of this through, my 8-year old son now has a more nuanced understanding of LLMs than your average AI hype merchant, so that's nice.

It should be obvious by now, but it bears repeating that this particular application of LLMs is never going to replace a human researcher. It’s bump-starting a specific part of a research process, which overall is heavily dependent on the instincts, knowledge, creative connection-making, and value judgements which remain stubbornly the forte of squishy human brains. I’m not sure LLMs ever could model these traits.

And I’d never trust the output of an LLM without checking it first. Hallucinations are an intrinsic part of the technology and there are always subtle little nuggets or allusions that an LLM will miss in an interview transcript.

However we’ve found some useful ways to mitigate hallucinations when using LLMs to summarise text and help pick out quotes and key themes in material gathered as part of our research.

1. An interesting thing to me about the current wave of LLM and generative AI is the amount of effort people are putting in to make this stuff accessible to others. This feels like a shift; I've sort of come to expect new technologies, especially open source ones, to come with an . I like this shift!

2. I just did the maths, and apparently they each consume about as much electricity as a high-end gaming graphics card, sometimes a little less. So consider my late-night hacking sessions equivalent to my having the time playing Cyberpunk 2077 or something, in terms of energy use.

3. I've written before about Suffice to say: not a fan.

4. I think that's a bit unfair, tbh, but it's a helpful generalisation to think with.

5. This is a remarkably similar set of strategies to some that I use with my small, chaotic children.

6. This could definitely be smarter; say by comparing , but this method works well enough for now, so why complicate things?

7. It has been suggested that adopt the NHM metric (Number of Henry Migraines) as a measure of research tasks, like Sherlock Holmes and his famous three-pipe problems. So far, I have resisted this proposal.

8. This would probably make a few of my colleagues quite cross, but it’s theoretically doable.

9. The exact length of the chunks is something that varies from model to model and is something I felt my way towards by setting it to 2048 (original LLaMa’s context window size) and increasing it by powers of 2 until the model’s output started to lose coherence.