Introducing emailbench: an open-source benchmark for email understanding

Share with your network!

April 01, 2026 David Biagioni and Adam Napora

Large language models (LLMs) are becoming foundational to email security products. Increasingly, they power classification, threat detection, triage, and analyst workflows. While hundreds of public benchmarks measure general reasoning and cybersecurity, none are designed to measure whether an LLM understands email communication and email security as first-class domains.

Emailbench fills that gap. It is a compact, high-quality benchmark for email understanding, built entirely from open sources and designed to be practical for model evaluation and iteration.

Why a dedicated email benchmark?

Email has its own ecosystem: protocols, headers, authentication standards, delivery paths, and attacker behaviors distinct from other cybersecurity domains. Models that perform well on broad benchmarks can still miss important email-specific concepts. This is especially true if you care about:

authentication and identity signals (SPF, DKIM, DMARC)
phishing and social engineering patterns
header and routing semantics (MTA/MUA, Received chains, envelope vs. header)
secure handling practices and trust cues

To address these gaps, emailbench focuses on foundational concepts that repeatedly appear in real-world email security.

What is emailbench?

Emailbench is a multiple-choice question-answer (MCQA) dataset consisting of 312 questions derived from the Wikipedia “Email” page and a curated set of directly linked pages (for example, SMTP, SPF/DKIM/DMARC, MIME, phishing, spoofing, and related fundamentals).

Each example includes:

a four-option multiple-choice question
a single ground-truth answer
metadata (for example, category, estimated difficulty, and source citations)

Here’s a representative example:

{

"question": "What is the proper SMTP envelope sender for a bounce message?",

"ground_truth_answer": "B",

"gold": [

"choices": [

"(A) The RFC 2822 From header of the original message",

"(B) An empty path: MAIL FROM:<>",

"(C) The postmaster address of the sending domain",

"(D) The Sender header of the original message"

"question_category": "factual",

"estimated_difficulty": 3,

"citations": [

"One special form of a path still exists: the empty path MAIL FROM:<>, used for many auto replies and especially all bounces."

"question_generating_model": "azure/gpt-5",

"document_summary": "A bounce message is an automated email notifying the sender that a previous message could not be delivered; it may arrive immediately or after retries, and is formally known as an NDR/NDN or DSN...

"wiki_title": "Bounce_message",

"wiki_url": "https://en.wikipedia.org/wiki/Bounce_message",

"index": 67

}

How we built it

The goal wasn’t trivia. It was to create questions that are:

accurate
clearly worded
closed-book answerable (generalizable, not niche edge cases)
email-security relevant
foundational

The process was as follows:

Scrape and curate the Wikipedia “Email” page and directly linked articles.
Generate MCQA items using a benchmark-generation workflow.
Semantic deduplication (embedding and clustering) to remove near-duplicates.
Apply an LLM-as-judge pass or fail filter against the criteria above.
Perform human review to remove ambiguous wording, leakage, and remaining duplicates.
Run sanity check using “LLM-as-student” evaluation: run strong models on the dataset to surface low-quality questions that slipped through.

This combination (automatic generation + multiple filtering layers + human review) produced a dataset that’s small enough to run often but curated enough to be useful.

NOTE: Human review focused on removing poorly worded or confusing questions, near-duplicates, and questions that clearly referenced the source text or other external data sources (not truly “closed-book”). As with any dataset derived from public sources, there might still be inaccuracies. If you find one, let us know.

What it’s for

Emailbench is intended as a practical tool for teams building email-focused AI systems:

Model selection: Choose models optimized for email understanding, not just general ability
Regression testing: Detect email performance drops after training, or after prompt or tooling changes
Domain adaptation measurement: Quantify whether continued pre- or post-training improves email understanding
Benchmark generation workflow: demonstrate a repeatable approach for producing domain benchmarks from documentation-like sources

Experiments and results

A) Comparing closed- and open-weight models

Goal: Measure how well popular closed-weight (frontier) and open-weight models handle email-specific concepts

Method: We integrated emailbench into the Hugging Face lighteval benchmarking harness as a generative task. We then ran it against various frontier and open-weight models using a simple prompt:

The following is a multiple choice question (with possible answers) about email communication and security. Your job is to select the correct answer.

Question: {{question}}

Choices:

A. {{choice_a}}

B. {{choice_b}}

C. {{choice_c}}

D. {{choice_d}}

Respond only with the letter of the correct answer, A, B, C, or D. No other text or commentary.

Answer:

Results: Extractive match accuracy is described in the sections that follow.

Frontier models: Large frontier models achieved 100% accuracy. This isn’t surprising, given that the source Wikipedia pages are included in their training data. Even smaller frontier models performed at 98% accuracy and above.

Open models:

1B - 2B Parameters: Qwen2.5-1.5B-Instruct doubled the performance of Llama-3.2-1B-Instruct, and even outperformed Llama-Primus-Merged, an 8B-parameter model fine-tuned for cybersecurity knowledge.
7B - 8B Parameters: Qwen2.5-7B-Instruct outperformed both Llama-3.1-8B-Instruct and cybersecurity-specific fine-tuned models. These include Foundation-Sec-8B and Llama-Primus-Merged, which were derived from the Llama-3.1-8B base model.

From this study, we can conclude that frontier models have maximum accuracy. The Qwen2.5 family of models performed best among the open-weight models of both size classes. They even outperformed models fine-tuned on general cybersecurity data.

B) Measuring the impact of continuous pre-training on a 1B parameter model

The second experiment measured how continuous pre-training (CPT) affects email understanding — both gains and regressions. We used Wikipedia source data for the benchmark because it’s highly relevant by definition. We also used a different open-source corpus related to email for the test and validation sets.

Base model: We wanted to quickly run CPT on a small model under 2B parameters. Looking at the evaluation results, the meta-llama/Llama-3.2-1B-Instruct model met that criterion and showed most room for improvement on emailbench.

Dataset: For training, we used the same subset of Wikipedia pages used to generate emailbench. This included just 17 documents, giving the model the most relevant documents for the benchmark. For the validation and test sets, we sampled trendmicro-ailab/Primus-FineWeb that had high relevance to email outside of Wikipedia. As a relevance score, we used cosine similarity with the Wikipedia corpus.

Training: We used a simple CPT setup with the Hugging Face trl library. We chose a small learning rate of 1e-5 and batch size of 8, optimizing the entropy loss of next token prediction. Validation loss was minimized after just one epoch of training. This was likely due to the very small size of the training set. To confirm, we then evaluated Perplexity on the held-out test set. The training and evaluation curves for each checkpoint are shown in the figure.

While the model overfits to the training data after just one epoch, its performance on emailbench continues to improve (see the following figure). This is an outcome of the benchmark being derived from the training set.

Another question we wanted to answer was how the CPT process affects model performance across other, seemingly unrelated tasks. To get a sense of this, we ran the checkpoints against a handful of public benchmarks using 200 random samples from each:

gsm8k: grade-school math
ifeval: instruction following
bigbench_hard:causal_reasoning: answer a causal question about a short story.

The following figure shows the performance across these benchmarks.

Interestingly, as emailbench performance improved with each training step, so did performance on both gsm8k and bigbench_hard:causal_judgment. This aligns with empirical evidence that fine-tuning sometimes unlocks related capabilities, even if they’re not part of the training data. In contrast, performance on obeying instructions–as measured by ifeval–showed a flat to slightly downward trend. We expect longer training cycles over more data to amplify these effects. But the results illustrate the types of improvements or regressions that coincide with any continued pre-training or fine-tuning exercise.

Open source availability and how to use

We’re releasing emailbench as an open dataset so the community can:

evaluate email-specific understanding consistently
compare training strategies
build better email-security AI systems

Emailbench is available on Hugging Face. Visit: https://huggingface.co/datasets/proofpoint-ai/emailbench.

The dataset card provides an example of how to plug emailbench into the Hugging Face lighteval harness for quick experimentation.

Conclusion

Email is one of the most important–and most attacked–communication channels. If we’re going to rely on LLMs to reason about email, we must be able to measure whether they truly understand it.

Emailbench is a first step: focused, practical, and open.

We hope you find emailbench useful in your own evaluations. Happy benchmarking!

Cybersecurity for the agentic workspace starts with Proofpoint’s human and agent-centric security platform.

Join a live Protect event—learn how to protect people, data, and AI

Stop cyberthreats with AI-driven multichannel protection.

Experience Core Email Protection in action—block 99.99% of email threats

Transform data security with a unified, omnichannel approach.

Understand the top data security risks organizations face — and how to stay ahead

Unify AI security across people, agents, and MCP

AI at Proofpoint

Proofpoint technologies powering human and agent-centric security.

AI at Proofpoint

Optimize Proofpoint solutions with expert services.

"The partnership with Proofpoint, it's an extention of our team." –Celesta Capital

Comprehensive solutions for today’s cybersecurity threats.

Learn about new AI risks—and how to build a secure foundation for enterprise adoption

Superior protection for every industry, from small business to large enterprise.

Discover the security risks healthcare organizations can't afford to ignore

More than 80 of the Fortune 100 choose Proofpoint to protect their people, data, and AI.

Evaluating security vendors? Compare us by checking out side-by-side comparisons.

Research, insights and resources from Proofpoint experts.

New Agents, New Attacks: Securing Collaboration in the Agentic Era

Learn from our expert threat intelligence and insights that you won’t find anywhere else.

Proofpoint DISCARDED Tales from the threat research trenches

Learn more about the team driving human and agent-centric security.

Ready to join a company redefining cybersecurity?