FinancialTouchstone

A Benchmark for Financial Document Comprehension

FinancialTouchstone is a question answering dataset designed to evaluate AI models on their ability to extract and comprehend information from real-world financial documents, including annual reports, earnings releases, and regulatory filings.

2,788 Questions
470 Documents
6 Question Types

Dataset v1.2

What is FinancialTouchstone?

Real Financial Documents

Questions are derived from actual annual reports and financial filings from publicly traded companies across multiple industries.

Expert Annotations

All questions and answers are carefully curated and verified by financial domain experts to ensure accuracy and relevance.

Comprehensive Evaluation

Covers key financials, cash flow analysis, revenue breakdown, segment reporting, and company classification questions.

Tasks

The benchmark evaluates models on six question types derived from the most frequently asked questions by professional equity analysts

1

Key Financials

Extract key financial metrics such as net income, EBITDA, earnings per share, profit margins, dividends, and other industry-specific KPIs from the annual report.

2

Cash Flow

Extract cash flow figures including operating, investing, and financing activities, as well as free cash flow and total net change in cash.

3

Revenue

Find and extract the company's revenue figures, accounting for industry-specific terminology (e.g., "premiums earned" in insurance, "net interest income" in banking).

4

Revenue Growth

Analyze year-over-year revenue growth, including reported growth rates or current and prior year revenues for comparison.

5

Business Segments

Identify and describe the company's business segments or divisions, including segment-specific revenues, growth rates, and operational details.

6

Company Type / Legal Form

Determine the legal form and corporate structure of the company (e.g., Inc., LLC, GmbH, AG, plc) based on the provided context.

Task Requirements

Models must answer questions based only on the provided context from annual reports. Key requirements include:

  • Ground all claims in the provided text - no external knowledge
  • Do not invent, calculate, or estimate figures not explicitly stated
  • If the context lacks the answer, state that the information is not available
  • Quote directly or cite information accurately when possible

Example: Cash Flow Question

Here's a complete example showing how a question is answered using context from an annual report:

1 Source Document

Company: Nestle S.A.

Report: Annual Report 2021

Country: Switzerland (CH)

Industry: Consumer Staples

2 Question

"What information is provided about the company's cash flow?"

3 Retrieved Context (from PDF page 3)
Operating cash flow (in CHF)
Free cash flow* (in CHF)

13.9 billion
42.1% of net financial debt

8.7 billion
4 Golden Answer

13.9bn CHF operating cash flow, 8.7bn CHF free cash flow

Data Schema

Each entry in the dataset follows this JSON structure (available in Excel format as well):

{
  "unique_key": "ID_000001_cash_flow",
  "report_id": "ID_000001",
  "company_name": "Nestle",
  "year": 2021,
  "country": "CH",
  "industry": "Consumer Staples",
  "question_type": "cash_flow",
  "golden_answer": "13.9bn CHF, 8.7bn free cash flow",
  "golden_context": "page 3: Operating cash flow (in CHF)..."
}

Evaluation

We use LLM-as-a-Judge methodology for automated, scalable evaluation

LLM-as-a-Judge Approach

Model responses are evaluated using automated LLM-based grading (GPT-5 and o3). The evaluation prompts were iteratively refined until no systematic errors remained, with an expected inherent error rate of approximately 2% due to edge cases.

Accuracy (Recall)

Measures whether the golden answer facts were successfully mentioned in the model's response. An answer is correct if every verifiable claim is factually accurate according to the retrieved context and addresses the core requirement.

Hallucination Rate (Inverse Precision)

A hallucination is any factual claim in the model's answer that cannot be verified using the retrieved context. Extra information that IS present in the context is NOT a hallucination. Lower is better.

Retriever Insufficiency

Tracks whether errors stem from the retrieval system failing to provide adequate context. When retrieval fails, model accuracy drops to 0.2%, demonstrating that retrieval is the primary bottleneck.

Evaluation Procedure

1

Hallucination Verification

Check each factual claim in the model's answer against the Retrieved Context. The Retrieved Context is the single source of truth.

2

Correctness Assessment

Check for factual errors, then verify relevance and completeness against the golden answer.

3

Retriever Assessment

If incorrect, determine whether the retriever was insufficient (missing necessary information from the context).

Grading Modes

The evaluation framework supports three grading modes:

  • Default: Standard evaluation against the gold standard, recommended for benchmark evaluation
  • Strict: Rigorous evaluation with high accuracy requirements - penalizes numerical discrepancies and missing elements
  • Lenient: Flexible evaluation focusing on core concepts - accepts reasonable approximations and gives credit for partial understanding

Getting Started

Download Dataset (Excel)

Get the complete FinancialTouchstone dataset in Excel format.

Download Excel (v1.2)

2,788 questions with golden answers and context

Download Dataset (JSON)

Get the FinancialTouchstone dataset in JSON format.

Download JSON (v1.2)

Same data in JSON format with train/dev/test splits

Evaluation Prompts

Official prompts for querying models and LLM-based answer grading.

Download Prompts

6 question types + 3 grading modes (JSON)

Annual Report PDFs

Access the original 470 annual reports used in the benchmark.

Open Google Drive

PDF + OCR text for all 470 documents

What's Included in v1.2

The dataset contains 2,788 expert-verified questions derived from 470 annual reports. Only records with complete golden answers and supporting context are included. Records with missing or incomplete data are excluded from the public release to ensure evaluation quality.

  • Included: Questions with both golden_answer AND golden_context populated
  • Excluded: 770 incomplete records (missing context or answer)
  • Quality: 115 errors identified and corrected (4.1% error rate on populated data)

Leaderboard

Model performance on the FinancialTouchstone test set (with retriever errors)

Rank Model Acc. Hall. Organization Submitted
Baseline Human Expert 84.8% 2.8% DS-NLP, University of St. Gallen 2025-01-15
1 GraphRAG (text-embedding-3-small) + Gemini 2.5 Pro 68.3% 9.3% DS-NLP, University of St. Gallen 2025-07-01
2 Vanilla RAG (text-embedding-3-small) + Gemini 2.5 Pro 66.2% 3.8% Google DeepMind 2025-07-01
3 Vanilla RAG (text-embedding-3-small) + o3 64.6% 5.9% OpenAI 2025-07-01
4 Vanilla RAG (text-embedding-3-small) + Claude Sonnet 4 64.6% 6.6% Anthropic 2025-07-01
5 Vanilla RAG (text-embedding-3-small) + Grok 4 59.1% 10.5% xAI 2025-07-01
6 Vanilla RAG (text-embedding-3-small) + DeepSeek R1 58.1% 11.8% DeepSeek 2025-07-01
7 Vanilla RAG (text-embedding-3-small) + Claude Opus 4 57.4% 7.3% Anthropic 2025-07-01
8 Vanilla RAG (text-embedding-3-small) + Gemini 2.5 Flash 54.1% 11.8% Google DeepMind 2025-07-01
9 Vanilla RAG (text-embedding-3-small) + o4-mini 50.2% 13.8% OpenAI 2025-07-01
10 Vanilla RAG (text-embedding-3-small) + DeepSeek V3.1 47.6% 14.6% DeepSeek 2025-07-01
11 Vanilla RAG (text-embedding-3-small) + Gemini 2.5 Flash-Lite 39.2% 14.6% Google DeepMind 2025-07-01
12 Vanilla RAG (text-embedding-3-small) + GPT-4o 33.3% 16.3% OpenAI 2025-07-01

Metrics

  • Accuracy (Acc.): Percentage of questions answered correctly based on the retrieved context.
  • Hallucination Rate (Hall.): Percentage of answers containing information not supported by the source documents (lower is better).

This leaderboard shows overall system performance including retrieval errors. When the retriever fails to fetch relevant context, model accuracy drops significantly.

Pure LLM Performance

Model performance excluding retriever errors (correct context provided)

Rank Model Acc. Hall. Organization Submitted
Baseline Human Expert 84.8% 2.8% DS-NLP, University of St. Gallen 2025-01-15
1 Vanilla RAG (text-embedding-3-small) + Gemini 2.5 Pro 91.6% 3.2% Google DeepMind 2025-07-01
2 Vanilla RAG (text-embedding-3-small) + o3 89.2% 6.9% OpenAI 2025-07-01
3 Vanilla RAG (text-embedding-3-small) + Claude Sonnet 4 89.0% 7.2% Anthropic 2025-07-01
4 Vanilla RAG (text-embedding-3-small) + Claude Opus 4 81.7% 7.5% Anthropic 2025-07-01
5 Vanilla RAG (text-embedding-3-small) + Grok 4 77.4% 11.4% xAI 2025-07-01
6 Vanilla RAG (text-embedding-3-small) + DeepSeek R1 75.6% 13.0% DeepSeek 2025-07-01
7 Vanilla RAG (text-embedding-3-small) + o4-mini 75.1% 13.2% OpenAI 2025-07-01
8 Vanilla RAG (text-embedding-3-small) + Gemini 2.5 Flash 73.6% 12.6% Google DeepMind 2025-07-01
9 Vanilla RAG (text-embedding-3-small) + DeepSeek V3.1 63.1% 16.1% DeepSeek 2025-07-01
10 Vanilla RAG (text-embedding-3-small) + Gemini 2.5 Flash-Lite 63.1% 16.8% Google DeepMind 2025-07-01
11 Vanilla RAG (text-embedding-3-small) + GPT-4o 50.9% 32.6% OpenAI 2025-07-01

About This Leaderboard

This leaderboard shows pure language model comprehension performance when given the correct context. It excludes cases where the retriever failed to fetch relevant information, isolating the LLM's ability to understand and extract information from financial documents.

Submit Your Results

Want to add your model to the leaderboard?

Results can be submitted by sending the paper and optionally the code to the authors via siegfried.handschuh@unisg.ch.

Organizers

FinancialTouchstone is developed and maintained by the DS-NLP team at the University of St. Gallen

Prof. Dr. Siegfried Handschuh

Chair of Data Science & NLP

Institute of Computer Science, University of St. Gallen

siegfried.handschuh@unisg.ch

Michael Gaus

Researcher

Institute of Computer Science, University of St. Gallen

michaelmarkus.gaus@unisg.ch

Jan Spörer

PhD Researcher

Institute of Computer Science, University of St. Gallen

jan.spoerer@unisg.ch

Frequently Asked Questions

What makes FinancialTouchstone different from other financial QA benchmarks?

FinancialTouchstone is the largest open benchmark for financial document comprehension, containing 2,788 question-answer pairs from 470 annual reports across 22 countries. Unlike prior benchmarks that focused primarily on US reports, our dataset covers global markets and diverse regulatory environments. The documents contain over 83 million tokens of high-quality, manually-written financial text.

How is the dataset structured?

Each entry contains a question, the golden answer, the supporting context from the annual report, and metadata about the source document. The dataset includes six question types covering key financials, cash flow, revenue, revenue growth, business segments, and company legal form.

Why are there two leaderboards?

The first leaderboard shows overall system performance including retrieval errors - this represents real-world RAG pipeline performance. The second leaderboard shows pure LLM performance when given the correct context, isolating the model's comprehension ability from retrieval quality. Our research shows that 66.5% of errors stem from retrieval failures, not model limitations.

How is evaluation performed?

We use an LLM-as-a-Judge approach with automated evaluation using frontier models (GPT-5 and o3). The evaluation checks for hallucinations (claims not verifiable from context) and correctness (whether the answer addresses the golden answer requirements). The evaluation prompts were iteratively refined to minimize systematic errors.

What is the human baseline performance?

Human annotators achieved 84.8% accuracy with a 2.8% hallucination rate. Interestingly, some top models (like Gemini 2.5 Pro at 91.6% accuracy) exceed human accuracy, though no model yet matches the human hallucination rate.

Can I use this dataset for commercial purposes?

The dataset is released under the Creative Commons Attribution 4.0 International License (CC BY 4.0), allowing use for research and commercial purposes with proper attribution.

How can I submit my results?

Send your paper and optionally your code to siegfried.handschuh@unisg.ch. We will evaluate your submission and add qualifying results to the leaderboard.

Are the annual report PDFs available?

Yes, all 470 annual reports are available in both PDF and OCR'd text format via Google Drive. The documents range from 50 to over 1,700 pages.

Citation

If you use FinancialTouchstone in your research, please cite our paper

BibTeX

@inbook{10.1145/3768292.3770417,
  author = {Sp\"{o}rer, Jan},
  title = {Can AI Read Like a Financial Analyst? A Financial Touchstone for Frontier Language Models Such as Gemini 2.5 Pro, o3, and Grok 4 on Long-Context Annual Report Comprehension},
  year = {2025},
  isbn = {9798400722202},
  publisher = {Association for Computing Machinery},
  address = {New York, NY, USA},
  url = {https://doi.org/10.1145/3768292.3770417},
  booktitle = {Proceedings of the 6th ACM International Conference on AI in Finance},
  pages = {291--298},
  numpages = {8}
}

Paper: Can AI Read Like a Financial Analyst? A Financial Touchstone for Frontier Language Models Such as Gemini 2.5 Pro, o3, and Grok 4 on Long-Context Annual Report Comprehension

Venue: 6th ACM International Conference on AI in Finance (ICAIF '25), November 2025, Singapore

DOI: 10.1145/3768292.3770417