A Benchmark for Financial Document Comprehension
FinancialTouchstone is a question answering dataset designed to evaluate AI models on their ability to extract and comprehend information from real-world financial documents, including annual reports, earnings releases, and regulatory filings.
Dataset v1.2
Questions are derived from actual annual reports and financial filings from publicly traded companies across multiple industries.
All questions and answers are carefully curated and verified by financial domain experts to ensure accuracy and relevance.
Covers key financials, cash flow analysis, revenue breakdown, segment reporting, and company classification questions.
The benchmark evaluates models on six question types derived from the most frequently asked questions by professional equity analysts
Extract key financial metrics such as net income, EBITDA, earnings per share, profit margins, dividends, and other industry-specific KPIs from the annual report.
Extract cash flow figures including operating, investing, and financing activities, as well as free cash flow and total net change in cash.
Find and extract the company's revenue figures, accounting for industry-specific terminology (e.g., "premiums earned" in insurance, "net interest income" in banking).
Analyze year-over-year revenue growth, including reported growth rates or current and prior year revenues for comparison.
Identify and describe the company's business segments or divisions, including segment-specific revenues, growth rates, and operational details.
Determine the legal form and corporate structure of the company (e.g., Inc., LLC, GmbH, AG, plc) based on the provided context.
Models must answer questions based only on the provided context from annual reports. Key requirements include:
Here's a complete example showing how a question is answered using context from an annual report:
Company: Nestle S.A.
Report: Annual Report 2021
Country: Switzerland (CH)
Industry: Consumer Staples
"What information is provided about the company's cash flow?"
Operating cash flow (in CHF) Free cash flow* (in CHF) 13.9 billion 42.1% of net financial debt 8.7 billion
13.9bn CHF operating cash flow, 8.7bn CHF free cash flow
Each entry in the dataset follows this JSON structure (available in Excel format as well):
{
"unique_key": "ID_000001_cash_flow",
"report_id": "ID_000001",
"company_name": "Nestle",
"year": 2021,
"country": "CH",
"industry": "Consumer Staples",
"question_type": "cash_flow",
"golden_answer": "13.9bn CHF, 8.7bn free cash flow",
"golden_context": "page 3: Operating cash flow (in CHF)..."
}
We use LLM-as-a-Judge methodology for automated, scalable evaluation
Model responses are evaluated using automated LLM-based grading (GPT-5 and o3). The evaluation prompts were iteratively refined until no systematic errors remained, with an expected inherent error rate of approximately 2% due to edge cases.
Measures whether the golden answer facts were successfully mentioned in the model's response. An answer is correct if every verifiable claim is factually accurate according to the retrieved context and addresses the core requirement.
A hallucination is any factual claim in the model's answer that cannot be verified using the retrieved context. Extra information that IS present in the context is NOT a hallucination. Lower is better.
Tracks whether errors stem from the retrieval system failing to provide adequate context. When retrieval fails, model accuracy drops to 0.2%, demonstrating that retrieval is the primary bottleneck.
Check each factual claim in the model's answer against the Retrieved Context. The Retrieved Context is the single source of truth.
Check for factual errors, then verify relevance and completeness against the golden answer.
If incorrect, determine whether the retriever was insufficient (missing necessary information from the context).
The evaluation framework supports three grading modes:
Get the complete FinancialTouchstone dataset in Excel format.
Download Excel (v1.2)2,788 questions with golden answers and context
Get the FinancialTouchstone dataset in JSON format.
Download JSON (v1.2)Same data in JSON format with train/dev/test splits
Official prompts for querying models and LLM-based answer grading.
Download Prompts6 question types + 3 grading modes (JSON)
Access the original 470 annual reports used in the benchmark.
Open Google DrivePDF + OCR text for all 470 documents
The dataset contains 2,788 expert-verified questions derived from 470 annual reports. Only records with complete golden answers and supporting context are included. Records with missing or incomplete data are excluded from the public release to ensure evaluation quality.
Model performance on the FinancialTouchstone test set (with retriever errors)
| Rank | Model | Acc. | Hall. | Organization | Submitted |
|---|---|---|---|---|---|
| Baseline | Human Expert | 84.8% | 2.8% | DS-NLP, University of St. Gallen | 2025-01-15 |
| 1 | GraphRAG (text-embedding-3-small) + Gemini 2.5 Pro | 68.3% | 9.3% | DS-NLP, University of St. Gallen | 2025-07-01 |
| 2 | Vanilla RAG (text-embedding-3-small) + Gemini 2.5 Pro | 66.2% | 3.8% | Google DeepMind | 2025-07-01 |
| 3 | Vanilla RAG (text-embedding-3-small) + o3 | 64.6% | 5.9% | OpenAI | 2025-07-01 |
| 4 | Vanilla RAG (text-embedding-3-small) + Claude Sonnet 4 | 64.6% | 6.6% | Anthropic | 2025-07-01 |
| 5 | Vanilla RAG (text-embedding-3-small) + Grok 4 | 59.1% | 10.5% | xAI | 2025-07-01 |
| 6 | Vanilla RAG (text-embedding-3-small) + DeepSeek R1 | 58.1% | 11.8% | DeepSeek | 2025-07-01 |
| 7 | Vanilla RAG (text-embedding-3-small) + Claude Opus 4 | 57.4% | 7.3% | Anthropic | 2025-07-01 |
| 8 | Vanilla RAG (text-embedding-3-small) + Gemini 2.5 Flash | 54.1% | 11.8% | Google DeepMind | 2025-07-01 |
| 9 | Vanilla RAG (text-embedding-3-small) + o4-mini | 50.2% | 13.8% | OpenAI | 2025-07-01 |
| 10 | Vanilla RAG (text-embedding-3-small) + DeepSeek V3.1 | 47.6% | 14.6% | DeepSeek | 2025-07-01 |
| 11 | Vanilla RAG (text-embedding-3-small) + Gemini 2.5 Flash-Lite | 39.2% | 14.6% | Google DeepMind | 2025-07-01 |
| 12 | Vanilla RAG (text-embedding-3-small) + GPT-4o | 33.3% | 16.3% | OpenAI | 2025-07-01 |
This leaderboard shows overall system performance including retrieval errors. When the retriever fails to fetch relevant context, model accuracy drops significantly.
Model performance excluding retriever errors (correct context provided)
| Rank | Model | Acc. | Hall. | Organization | Submitted |
|---|---|---|---|---|---|
| Baseline | Human Expert | 84.8% | 2.8% | DS-NLP, University of St. Gallen | 2025-01-15 |
| 1 | Vanilla RAG (text-embedding-3-small) + Gemini 2.5 Pro | 91.6% | 3.2% | Google DeepMind | 2025-07-01 |
| 2 | Vanilla RAG (text-embedding-3-small) + o3 | 89.2% | 6.9% | OpenAI | 2025-07-01 |
| 3 | Vanilla RAG (text-embedding-3-small) + Claude Sonnet 4 | 89.0% | 7.2% | Anthropic | 2025-07-01 |
| 4 | Vanilla RAG (text-embedding-3-small) + Claude Opus 4 | 81.7% | 7.5% | Anthropic | 2025-07-01 |
| 5 | Vanilla RAG (text-embedding-3-small) + Grok 4 | 77.4% | 11.4% | xAI | 2025-07-01 |
| 6 | Vanilla RAG (text-embedding-3-small) + DeepSeek R1 | 75.6% | 13.0% | DeepSeek | 2025-07-01 |
| 7 | Vanilla RAG (text-embedding-3-small) + o4-mini | 75.1% | 13.2% | OpenAI | 2025-07-01 |
| 8 | Vanilla RAG (text-embedding-3-small) + Gemini 2.5 Flash | 73.6% | 12.6% | Google DeepMind | 2025-07-01 |
| 9 | Vanilla RAG (text-embedding-3-small) + DeepSeek V3.1 | 63.1% | 16.1% | DeepSeek | 2025-07-01 |
| 10 | Vanilla RAG (text-embedding-3-small) + Gemini 2.5 Flash-Lite | 63.1% | 16.8% | Google DeepMind | 2025-07-01 |
| 11 | Vanilla RAG (text-embedding-3-small) + GPT-4o | 50.9% | 32.6% | OpenAI | 2025-07-01 |
This leaderboard shows pure language model comprehension performance when given the correct context. It excludes cases where the retriever failed to fetch relevant information, isolating the LLM's ability to understand and extract information from financial documents.
Want to add your model to the leaderboard?
Results can be submitted by sending the paper and optionally the code to the authors via siegfried.handschuh@unisg.ch.
FinancialTouchstone is developed and maintained by the DS-NLP team at the University of St. Gallen
Chair of Data Science & NLP
Institute of Computer Science, University of St. Gallen
Researcher
Institute of Computer Science, University of St. Gallen
PhD Researcher
Institute of Computer Science, University of St. Gallen
FinancialTouchstone is the largest open benchmark for financial document comprehension, containing 2,788 question-answer pairs from 470 annual reports across 22 countries. Unlike prior benchmarks that focused primarily on US reports, our dataset covers global markets and diverse regulatory environments. The documents contain over 83 million tokens of high-quality, manually-written financial text.
Each entry contains a question, the golden answer, the supporting context from the annual report, and metadata about the source document. The dataset includes six question types covering key financials, cash flow, revenue, revenue growth, business segments, and company legal form.
The first leaderboard shows overall system performance including retrieval errors - this represents real-world RAG pipeline performance. The second leaderboard shows pure LLM performance when given the correct context, isolating the model's comprehension ability from retrieval quality. Our research shows that 66.5% of errors stem from retrieval failures, not model limitations.
We use an LLM-as-a-Judge approach with automated evaluation using frontier models (GPT-5 and o3). The evaluation checks for hallucinations (claims not verifiable from context) and correctness (whether the answer addresses the golden answer requirements). The evaluation prompts were iteratively refined to minimize systematic errors.
Human annotators achieved 84.8% accuracy with a 2.8% hallucination rate. Interestingly, some top models (like Gemini 2.5 Pro at 91.6% accuracy) exceed human accuracy, though no model yet matches the human hallucination rate.
The dataset is released under the Creative Commons Attribution 4.0 International License (CC BY 4.0), allowing use for research and commercial purposes with proper attribution.
Send your paper and optionally your code to siegfried.handschuh@unisg.ch. We will evaluate your submission and add qualifying results to the leaderboard.
Yes, all 470 annual reports are available in both PDF and OCR'd text format via Google Drive. The documents range from 50 to over 1,700 pages.
If you use FinancialTouchstone in your research, please cite our paper
@inbook{10.1145/3768292.3770417,
author = {Sp\"{o}rer, Jan},
title = {Can AI Read Like a Financial Analyst? A Financial Touchstone for Frontier Language Models Such as Gemini 2.5 Pro, o3, and Grok 4 on Long-Context Annual Report Comprehension},
year = {2025},
isbn = {9798400722202},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3768292.3770417},
booktitle = {Proceedings of the 6th ACM International Conference on AI in Finance},
pages = {291--298},
numpages = {8}
}
Paper: Can AI Read Like a Financial Analyst? A Financial Touchstone for Frontier Language Models Such as Gemini 2.5 Pro, o3, and Grok 4 on Long-Context Annual Report Comprehension
Venue: 6th ACM International Conference on AI in Finance (ICAIF '25), November 2025, Singapore