A Benchmark for Financial Document Comprehension
FinancialTouchstone is a question answering dataset designed to evaluate AI models on their ability to extract and comprehend information from real-world financial documents, including annual reports, earnings releases, and regulatory filings.
Dataset v1.2
Questions are derived from actual annual reports and financial filings from publicly traded companies across multiple industries.
All questions and answers are carefully curated and verified by financial domain experts to ensure accuracy and relevance.
Covers key financials, cash flow analysis, revenue breakdown, segment reporting, and company classification questions.
Get the complete FinancialTouchstone dataset in Excel format.
Download Excel (v1.2)2,788 questions with golden answers and context
Get the FinancialTouchstone dataset in JSON format.
Download JSON (v1.2)Same data in JSON format with train/dev/test splits
Official prompts for querying models and LLM-based answer grading.
Download Prompts6 question types + 3 grading modes (JSON)
The dataset contains 2,788 expert-verified questions derived from 470 annual reports. Only records with complete golden answers and supporting context are included. Records with missing or incomplete data are excluded from the public release to ensure evaluation quality.
Model performance on the FinancialTouchstone test set (with retriever errors)
| Rank | Model | Acc. | Hall. | Organization | Submitted |
|---|---|---|---|---|---|
| Baseline | Human Expert | 84.8% | 2.8% | DS-NLP, University of St. Gallen | 2025-01-15 |
| 1 | GraphRAG (text-embedding-3-small) + Gemini 2.5 Pro | 68.3% | 9.3% | DS-NLP, University of St. Gallen | 2025-07-01 |
| 2 | Vanilla RAG (text-embedding-3-small) + Gemini 2.5 Pro | 66.2% | 3.8% | Google DeepMind | 2025-07-01 |
| 3 | Vanilla RAG (text-embedding-3-small) + o3 | 64.6% | 5.9% | OpenAI | 2025-07-01 |
| 4 | Vanilla RAG (text-embedding-3-small) + Claude Sonnet 4 | 64.6% | 6.6% | Anthropic | 2025-07-01 |
| 5 | Vanilla RAG (text-embedding-3-small) + Grok 4 | 59.1% | 10.5% | xAI | 2025-07-01 |
| 6 | Vanilla RAG (text-embedding-3-small) + DeepSeek R1 | 58.1% | 11.8% | DeepSeek | 2025-07-01 |
| 7 | Vanilla RAG (text-embedding-3-small) + Claude Opus 4 | 57.4% | 7.3% | Anthropic | 2025-07-01 |
| 8 | Vanilla RAG (text-embedding-3-small) + Gemini 2.5 Flash | 54.1% | 11.8% | Google DeepMind | 2025-07-01 |
| 9 | Vanilla RAG (text-embedding-3-small) + o4-mini | 50.2% | 13.8% | OpenAI | 2025-07-01 |
| 10 | Vanilla RAG (text-embedding-3-small) + DeepSeek V3.1 | 47.6% | 14.6% | DeepSeek | 2025-07-01 |
| 11 | Vanilla RAG (text-embedding-3-small) + Gemini 2.5 Flash-Lite | 39.2% | 14.6% | Google DeepMind | 2025-07-01 |
| 12 | Vanilla RAG (text-embedding-3-small) + GPT-4o | 33.3% | 16.3% | OpenAI | 2025-07-01 |
This leaderboard shows overall system performance including retrieval errors. When the retriever fails to fetch relevant context, model accuracy drops significantly.
Model performance excluding retriever errors (correct context provided)
This leaderboard shows pure language model comprehension performance when given the correct context. It excludes cases where the retriever failed to fetch relevant information, isolating the LLM's ability to understand and extract information from financial documents.
Want to add your model to the leaderboard?
Results can be submitted by sending the paper and optionally the code to the authors via siegfried.handschuh@unisg.ch.