FinancialTouchstone

A Benchmark for Financial Document Comprehension

FinancialTouchstone is a question answering dataset designed to evaluate AI models on their ability to extract and comprehend information from real-world financial documents, including annual reports, earnings releases, and regulatory filings.

2,788 Questions
470 Documents
6 Question Types

Dataset v1.2

What is FinancialTouchstone?

Real Financial Documents

Questions are derived from actual annual reports and financial filings from publicly traded companies across multiple industries.

Expert Annotations

All questions and answers are carefully curated and verified by financial domain experts to ensure accuracy and relevance.

Comprehensive Evaluation

Covers key financials, cash flow analysis, revenue breakdown, segment reporting, and company classification questions.

Getting Started

Download Dataset (Excel)

Get the complete FinancialTouchstone dataset in Excel format.

Download Excel (v1.2)

2,788 questions with golden answers and context

Download Dataset (JSON)

Get the FinancialTouchstone dataset in JSON format.

Download JSON (v1.2)

Same data in JSON format with train/dev/test splits

Evaluation Prompts

Official prompts for querying models and LLM-based answer grading.

Download Prompts

6 question types + 3 grading modes (JSON)

What's Included in v1.2

The dataset contains 2,788 expert-verified questions derived from 470 annual reports. Only records with complete golden answers and supporting context are included. Records with missing or incomplete data are excluded from the public release to ensure evaluation quality.

  • Included: Questions with both golden_answer AND golden_context populated
  • Excluded: 770 incomplete records (missing context or answer)
  • Quality: 115 errors identified and corrected (4.1% error rate on populated data)

Leaderboard

Model performance on the FinancialTouchstone test set (with retriever errors)

Rank Model Acc. Hall. Organization Submitted
Baseline Human Expert 84.8% 2.8% DS-NLP, University of St. Gallen 2025-01-15
1 GraphRAG (text-embedding-3-small) + Gemini 2.5 Pro 68.3% 9.3% DS-NLP, University of St. Gallen 2025-07-01
2 Vanilla RAG (text-embedding-3-small) + Gemini 2.5 Pro 66.2% 3.8% Google DeepMind 2025-07-01
3 Vanilla RAG (text-embedding-3-small) + o3 64.6% 5.9% OpenAI 2025-07-01
4 Vanilla RAG (text-embedding-3-small) + Claude Sonnet 4 64.6% 6.6% Anthropic 2025-07-01
5 Vanilla RAG (text-embedding-3-small) + Grok 4 59.1% 10.5% xAI 2025-07-01
6 Vanilla RAG (text-embedding-3-small) + DeepSeek R1 58.1% 11.8% DeepSeek 2025-07-01
7 Vanilla RAG (text-embedding-3-small) + Claude Opus 4 57.4% 7.3% Anthropic 2025-07-01
8 Vanilla RAG (text-embedding-3-small) + Gemini 2.5 Flash 54.1% 11.8% Google DeepMind 2025-07-01
9 Vanilla RAG (text-embedding-3-small) + o4-mini 50.2% 13.8% OpenAI 2025-07-01
10 Vanilla RAG (text-embedding-3-small) + DeepSeek V3.1 47.6% 14.6% DeepSeek 2025-07-01
11 Vanilla RAG (text-embedding-3-small) + Gemini 2.5 Flash-Lite 39.2% 14.6% Google DeepMind 2025-07-01
12 Vanilla RAG (text-embedding-3-small) + GPT-4o 33.3% 16.3% OpenAI 2025-07-01

Metrics

  • Accuracy (Acc.): Percentage of questions answered correctly based on the retrieved context.
  • Hallucination Rate (Hall.): Percentage of answers containing information not supported by the source documents (lower is better).

This leaderboard shows overall system performance including retrieval errors. When the retriever fails to fetch relevant context, model accuracy drops significantly.

Pure LLM Performance

Model performance excluding retriever errors (correct context provided)

Rank Model Acc. Hall. Organization Submitted
Baseline Human Expert 84.8% 2.8% DS-NLP, University of St. Gallen 2025-01-15
1 Vanilla RAG (text-embedding-3-small) + Gemini 2.5 Pro 91.6% 3.2% Google DeepMind 2025-07-01
2 Vanilla RAG (text-embedding-3-small) + o3 89.2% 6.9% OpenAI 2025-07-01
3 Vanilla RAG (text-embedding-3-small) + Claude Sonnet 4 89.0% 7.2% Anthropic 2025-07-01
4 Vanilla RAG (text-embedding-3-small) + Claude Opus 4 81.7% 7.5% Anthropic 2025-07-01
5 Vanilla RAG (text-embedding-3-small) + Grok 4 77.4% 11.4% xAI 2025-07-01
6 Vanilla RAG (text-embedding-3-small) + DeepSeek R1 75.6% 13.0% DeepSeek 2025-07-01
7 Vanilla RAG (text-embedding-3-small) + o4-mini 75.1% 13.2% OpenAI 2025-07-01
8 Vanilla RAG (text-embedding-3-small) + Gemini 2.5 Flash 73.6% 12.6% Google DeepMind 2025-07-01
9 Vanilla RAG (text-embedding-3-small) + DeepSeek V3.1 63.1% 16.1% DeepSeek 2025-07-01
10 Vanilla RAG (text-embedding-3-small) + Gemini 2.5 Flash-Lite 63.1% 16.8% Google DeepMind 2025-07-01
11 Vanilla RAG (text-embedding-3-small) + GPT-4o 50.9% 32.6% OpenAI 2025-07-01

About This Leaderboard

This leaderboard shows pure language model comprehension performance when given the correct context. It excludes cases where the retriever failed to fetch relevant information, isolating the LLM's ability to understand and extract information from financial documents.

Submit Your Results

Want to add your model to the leaderboard?

Results can be submitted by sending the paper and optionally the code to the authors via siegfried.handschuh@unisg.ch.