FinancialTouchstone Leaderboard

What is FinancialTouchstone?

Real Financial Documents

Questions are derived from actual annual reports and financial filings from publicly traded companies across multiple industries.

Expert Annotations

All questions and answers are carefully curated and verified by financial domain experts to ensure accuracy and relevance.

Comprehensive Evaluation

Covers key financials, cash flow analysis, revenue breakdown, segment reporting, and company classification questions.

Getting Started

Download Dataset (Excel)

Get the complete FinancialTouchstone dataset in Excel format.

Download Excel (v1.2)

2,788 questions with golden answers and context

Download Dataset (JSON)

Get the FinancialTouchstone dataset in JSON format.

Download JSON (v1.2)

Same data in JSON format with train/dev/test splits

Evaluation Prompts

Official prompts for querying models and LLM-based answer grading.

Download Prompts

6 question types + 3 grading modes (JSON)

What's Included in v1.2

The dataset contains 2,788 expert-verified questions derived from 470 annual reports. Only records with complete golden answers and supporting context are included. Records with missing or incomplete data are excluded from the public release to ensure evaluation quality.

Included: Questions with both golden_answer AND golden_context populated
Excluded: 770 incomplete records (missing context or answer)
Quality: 115 errors identified and corrected (4.1% error rate on populated data)

Leaderboard

Model performance on the FinancialTouchstone test set (with retriever errors)

Rank	Model	Acc.	Hall.	Organization	Submitted
Baseline	Human Expert	84.8%	2.8%	DS-NLP, University of St. Gallen	2025-01-15
1	GraphRAG (text-embedding-3-small) + Gemini 2.5 Pro	68.3%	9.3%	DS-NLP, University of St. Gallen	2025-07-01
2	Vanilla RAG (text-embedding-3-small) + Gemini 2.5 Pro	66.2%	3.8%	Google DeepMind	2025-07-01
3	Vanilla RAG (text-embedding-3-small) + o3	64.6%	5.9%	OpenAI	2025-07-01
4	Vanilla RAG (text-embedding-3-small) + Claude Sonnet 4	64.6%	6.6%	Anthropic	2025-07-01
5	Vanilla RAG (text-embedding-3-small) + Grok 4	59.1%	10.5%	xAI	2025-07-01
6	Vanilla RAG (text-embedding-3-small) + DeepSeek R1	58.1%	11.8%	DeepSeek	2025-07-01
7	Vanilla RAG (text-embedding-3-small) + Claude Opus 4	57.4%	7.3%	Anthropic	2025-07-01
8	Vanilla RAG (text-embedding-3-small) + Gemini 2.5 Flash	54.1%	11.8%	Google DeepMind	2025-07-01
9	Vanilla RAG (text-embedding-3-small) + o4-mini	50.2%	13.8%	OpenAI	2025-07-01
10	Vanilla RAG (text-embedding-3-small) + DeepSeek V3.1	47.6%	14.6%	DeepSeek	2025-07-01
11	Vanilla RAG (text-embedding-3-small) + Gemini 2.5 Flash-Lite	39.2%	14.6%	Google DeepMind	2025-07-01
12	Vanilla RAG (text-embedding-3-small) + GPT-4o	33.3%	16.3%	OpenAI	2025-07-01

Metrics

Accuracy (Acc.): Percentage of questions answered correctly based on the retrieved context.
Hallucination Rate (Hall.): Percentage of answers containing information not supported by the source documents (lower is better).

This leaderboard shows overall system performance including retrieval errors. When the retriever fails to fetch relevant context, model accuracy drops significantly.

Pure LLM Performance

Model performance excluding retriever errors (correct context provided)

Rank	Model	Acc.	Hall.	Organization	Submitted
Baseline	Human Expert	84.8%	2.8%	DS-NLP, University of St. Gallen	2025-01-15
1	Vanilla RAG (text-embedding-3-small) + Gemini 2.5 Pro	91.6%	3.2%	Google DeepMind	2025-07-01
2	Vanilla RAG (text-embedding-3-small) + o3	89.2%	6.9%	OpenAI	2025-07-01
3	Vanilla RAG (text-embedding-3-small) + Claude Sonnet 4	89.0%	7.2%	Anthropic	2025-07-01
4	Vanilla RAG (text-embedding-3-small) + Claude Opus 4	81.7%	7.5%	Anthropic	2025-07-01
5	Vanilla RAG (text-embedding-3-small) + Grok 4	77.4%	11.4%	xAI	2025-07-01
6	Vanilla RAG (text-embedding-3-small) + DeepSeek R1	75.6%	13.0%	DeepSeek	2025-07-01
7	Vanilla RAG (text-embedding-3-small) + o4-mini	75.1%	13.2%	OpenAI	2025-07-01
8	Vanilla RAG (text-embedding-3-small) + Gemini 2.5 Flash	73.6%	12.6%	Google DeepMind	2025-07-01
9	Vanilla RAG (text-embedding-3-small) + DeepSeek V3.1	63.1%	16.1%	DeepSeek	2025-07-01
10	Vanilla RAG (text-embedding-3-small) + Gemini 2.5 Flash-Lite	63.1%	16.8%	Google DeepMind	2025-07-01
11	Vanilla RAG (text-embedding-3-small) + GPT-4o	50.9%	32.6%	OpenAI	2025-07-01

About This Leaderboard

This leaderboard shows pure language model comprehension performance when given the correct context. It excludes cases where the retriever failed to fetch relevant information, isolating the LLM's ability to understand and extract information from financial documents.

Submit Your Results

Want to add your model to the leaderboard?

Results can be submitted by sending the paper and optionally the code to the authors via siegfried.handschuh@unisg.ch.