Finding Climate Targets with LLMs, Part 3 - RAG
Benchmarking RAG performance for identifying corporate climate targets.
· 6 min read
Introduction #
In my previous post, I determined that LLMs could do a decent job on the own of determining the climate emission targets for a given top S&P500 company and year. The precision could get up to 0.95 with a recall of about 0.3 to 0.4.
Now I want to assess how this performance could improve with a conventional RAG pipeline operating on the documents I gathered. I am not aiming for optimal performance at this stage. I will use convenient default options to provide a baseline for future experiments to benchmark against.
SPOILERS: GPT-5.2 and GPT5.1 outperform GPT5 similarly. As expected RAG improves overall performance. Highlighting GPT5.2 as a RAG baseline, its precision drops from 0.95 to 0.85, while recall jumps from 0.36 to 0.73, which drives a large F1 gain from 0.52 to 0.78. This basic RAG setup improves overall performance but shows that we need more than just an “out-of-the-box” approach to get high performance.
If you are interested in the previous posts, please take a look below:
- https://www.reyfarhan.com/posts/climate-targets-01/
- https://www.reyfarhan.com/posts/climate-targets-02/
Methodology #
In part one, I gave a brief overview of RAG and since I am far too efficient (lazy) to repeat myself, I encourage you to look there to replenish your understanding.
I chose to use Llamaindex for the RAG implementation since I have a lot of experience in it. I would produce the same structured output target definitions as used in part two, along with the same evaluation method and accompany harvested data as the corpus for the retrieval to operate on.
Since I have a good deal of Llamaindex experience already, I put together a conventional “hybrid retriever” RAG approach represented in Diagram 1 below.
flowchart TD
%% Retrieval-Augmented Generation
subgraph I[Indexing]
PDFS@{ shape: docs, label: "PDFs"}
TXTS@{ shape: docs, label: "Texts"}
NODES@{ shape: docs, label: "Nodes"}
ET[PDF to Text]
SS[Sentence-Splitter Chunking]
IN[Indexing]
PDFS --> ET
ET --> TXTS
TXTS --> SS
SS --> NODES
NODES --> IN
end
subgraph Q[Querying]
U@{ shape: manual-input, label: "User Query"}
R[Hybrid Retrieval]
LLM[Send to LLM]
U --> R
U --> LLM
end
S_IND[(Sparse Index)]
D_IND[(Dense Index)]
A[Get Structured Answer]
LLM --> A
IN --> S_IND
IN --> D_IND
S_IND --> R
D_IND --> R
R --> LLM
Diagram 1: Representation of the basic RAG System used here, utilising a hybrid retriever.
Here, the PDFs are pre-processed into text format and then split into smaller parts called “nodes,” which are segments of text that the system indexes and retrieves per query. Splitting the text into smaller parts like this means that in theory we can just access the most relevant parts of a document set and send those to the LLM to answer the user prompt.
That results in the user prompt having two uses in this architecture. Firstly, it provides context for the retrieval, and secondly the instructions for the task itself.
I used a simple “Sentence Splitter” technique to produce the nodes, which uses a simple heuristic to create nodes of a specified character length while trying to preserve sentence and paragraphs. Much more complex methods exist but are out of scope for now.
PDFs were converted to text using PyMuPDF4LLM, a relatively basic approach. More sophisticated methods, such as OCR or using multimodal LLMs, exist and the domain is a very active area of development (e.g. see OCR Arena). However, again, they are out of scope for now.
I employed both “sparse” and “dense” indexes for the retrieval process. The sparse index uses bm25 algorithm, searched via a bm25 retriever. Using a Reciprocal Rank Fusion Retriever, I combine this with the dense vector index, which holds vectors representing the embeddings in the text, searched via a standard vector retriever. More simply put, I combine both keyword and vector embedding searches to get the best of both worlds for the retrieval of relevant nodes for a prompt.
Llamaindex makes it relatively straightforward to configure a pipeline and put all of this together along with structured outputs and suitable system and query prompts for the retrieval.
Getting it all to work satisfactorily takes a bit of experimenting with the configuration parameters and eyeballing the output. I settled eventually on the values in Table 1 below.
| Variable | Value | What it controls |
|---|---|---|
chunk_size | 8192 | Target chunk/node length for splitting source text before indexing into nodes. Larger nodes preserve more context but can dilute retrieval precision. |
chunk_overlap | 512 | Characters repeated between adjacent nodes to reduce boundary loss. |
embed_model | "text-embedding-3-large" | Embedding model used to vectorise nodes and queries for similarity search. |
mode | "reciprocal_rerank" | Fusion/reranking strategy used when combining results from the bm25 and vector retrievers. |
max_tokens | 16390 | Upper bound on tokens generated by the LLM for a response. |
similarity_top_k | 10 | Number of top similar nodes retrieved by the vector and bm25 retrievers each before any fusion/reranking. |
fusion_top_k | 5 | Final number of nodes kept after fusion/reranking to pass downstream into the question answering stage. |
| Table 1: RAG pipeline parameters |
The evaluation uses the same method as in part 2, using the same micro precision, recall and F1 score calculations to measure performance.
Every component in Diagram 1 has an effect on the overall performance of the pipeline, and practically every component can be improved. I chose convenient default options to act as a baseline for future experiments to benchmark against.
Results #
| Model | Micro precision (mean ± std) | Micro recall (mean ± std) | Micro F1 (mean ± std) | Hallucination rate (mean ± std) | n |
|---|---|---|---|---|---|
GPT-5 | 0.7873 ± 0.0789 | 0.4762 ± 0.0206 | 0.5931 ± 0.0380 | 0.2127 ± 0.0789 | 3 |
GPT-5.1 | 0.7631 ± 0.0535 | 0.7262 ± 0.0412 | 0.7441 ± 0.0466 | 0.2369 ± 0.0535 | 3 |
GPT-5.2 | 0.7563 ± 0.0766 | 0.6905 ± 0.0743 | 0.7203 ± 0.0649 | 0.2437 ± 0.0766 | 3 |
Table 2: RAG result
Table 2 show that GPT-5.1 and GPT-5.2 are effectively tied here. Again note that the reported standard deviations reflect noise/instability from the LLM-as-a-judge evaluation, not model randomness. Therefore GPT-5.1 and GPT-5.2 should be treated as essentially the same performance.
Choosing GPT-5.2 for comparisons, we can see when we compare its performance with its “no-RAG” results, the precision drops from 0.95 to 0.76 while recall jumps from 0.36 to 0.69, which drives an overall F1 gain from 0.52 to 0.72.
In the RAG context here, GPT-5.2 produces more false positives/unsupported assertion. This indicates the retrieval might be injecting extra candidates and the model over-uses them or over-generalises.
The “Hallucination Rate” here is actually just 1 − precision and so does give an independent “unsupported by retrieved context” metric, it just shows the false positive rate.
We can see that this simple RAG setup boosts overall performance considerably but it’s not perfect out-of-the-box and will take more work to drive high performance. As stated, every component in Diagram 1 has an effect on the system performance. The only way to drive improvements is through experimentation against the benchmarks here.
Next Steps #
I glossed over finding out why and exactly how the precision dropped. I will dive into the results more and figure out what happened and how to mitigate against it.
I am also now refactoring the code to experiment with Agentic RAG Pipeline ideas such as those outlined here and here. Other ideas include:
and more possible methods. Doing this in a targeted way, by identifying weaknesses and potential solutions, using experimentation will form my approach.
Repository #
You can find the code here: