Avalilação PREreview de Financial Named Entity Recognition: How Far Can LLM Go?
- Publicado
- DOI
- 10.5281/zenodo.19039536
- Licença
- CC BY 4.0
Summary
This paper presents a systematic evaluation of state-of-the-art LLMs on the financial Named Entity Recognition (NER) task, testing multiple models under three distinct prompting strategies: direct prompting, few-shot prompting, and chain-of-thought prompting. The authors evaluate performance using entity-level and weighted F1 scores, identify five representative failure types with concrete examples, and discuss implications for domain adaptation of general-purpose LLMs in financial information extraction contexts. The work is positioned as a diagnostic study aimed at clarifying the current capability ceiling of LLMs on this practically important but underexplored task.
Strengths
The focus on financial NER is well-motivated and practically relevant. Financial documents — earnings releases, regulatory filings, business news, analyst reports — contain dense entity references where extraction accuracy directly affects downstream applications including knowledge graph construction, risk monitoring, and automated document generation. Despite its importance, the behavior of modern LLMs on financial NER under varied prompting conditions has not been systematically characterized, and this paper fills that gap in a direct and useful way.
The taxonomy of five failure types is the paper's most practically valuable contribution. Concrete error categories — entity type confusion (e.g., company names labeled as persons), over-extraction of non-entity spans (e.g., "German car makers" labeled as ORG), contextual disambiguation failures (e.g., "Google Maps" labeled as ORG instead of recognized as a product), abbreviation blindness (e.g., "NYSE" unrecognized), and boundary errors — provide actionable diagnostic information for practitioners building financial NLP pipelines. These failure modes are not arbitrary: they reflect systematic patterns in how LLMs trained on general corpora fail to internalize the entity ontology conventions specific to financial text.
The three-prompt comparison — direct, few-shot, and chain-of-thought — is a sensible experimental design that covers the primary prompting paradigms available to practitioners who do not have labeled data for fine-tuning. The inclusion of chain-of-thought is particularly useful, as it has shown strong performance on reasoning tasks and its effectiveness on structured extraction tasks like NER is less well-understood.
Weaknesses and Limitations
The evaluation dataset and models are not described in sufficient detail for reproducibility. Financial NER benchmarks vary substantially in their entity ontology (some use PER/ORG/LOC/MISC, others use financial-specific types such as TICKER, PRODUCT, REGULATION, MONETARY_VALUE), document genre (news, filings, transcripts), and annotation quality. Without knowing which benchmark was used, what entity types were targeted, and how the test split was constructed, it is impossible to assess whether the reported F1 scores reflect performance on a representative or artificially constrained sample of the financial NER problem.
The paper evaluates three prompting strategies but does not include a fine-tuned supervised baseline as a reference point. Without knowing the performance of a RoBERTa or DeBERTa model fine-tuned on the same benchmark, the absolute F1 scores for LLMs cannot be contextualized. In financial NER, fine-tuned smaller models frequently outperform much larger general-purpose LLMs, and the paper's framing — "how far can LLM go?" — requires this reference point to be answerable.
The failure type taxonomy, while useful, is presented without quantitative distribution data across models and prompting strategies. Knowing that "non-entity mislabeling" accounts for 40% of errors under direct prompting but only 15% under few-shot prompting would allow practitioners to make principled choices about which prompting strategy to use for their specific error tolerance. The absence of this breakdown reduces the taxonomy's actionability.
The paper does not discuss the effect of financial domain-adapted LLMs such as FinBERT, BloombergGPT, or FinGPT relative to general-purpose models. Since the core question is how well LLMs handle financial NER, including at least one domain-specialized model would provide insight into whether the observed failures are addressable through domain pretraining or whether they reflect fundamental limitations of the generative NER paradigm.
The analysis of chain-of-thought prompting deserves more depth. For NER, CoT reasoning typically involves the model justifying its entity type assignment before producing the final label. Whether this reasoning is accurate when the model produces correct labels, and systematically flawed in a predictable way when it makes errors, is a substantive question with implications for self-consistency checking and verification approaches in production pipelines.
Suggestions
The paper should include a full data statement describing the benchmark, entity types, document genres, and train/test split used in the evaluation. If proprietary data was used, the authors should provide sufficient statistics for readers to assess scope and representativeness.
A confusion matrix across entity types for each model and prompting strategy would significantly increase the paper's utility. Financial NER errors are not uniformly distributed: certain entity pairs (ORG vs. PERSON for eponymous companies, TICKER vs. ORG for stock references) account for a disproportionate share of mistakes, and visualizing this would directly inform downstream pipeline design.
Given the paper's focus on failure modes, a section on mitigation strategies — domain-specific few-shot example selection, entity type definition injection, self-verification prompting — with preliminary results would substantially increase its practical impact. The failure taxonomy is valuable but becomes significantly more useful when paired with at least preliminary evidence of what interventions reduce each failure type.
Overall Assessment
This is a useful diagnostic study that addresses a practically important question: whether state-of-the-art LLMs can reliably perform financial NER under standard prompting conditions. The failure taxonomy is the paper's most actionable contribution. The main weaknesses are insufficient dataset documentation, the absence of fine-tuned supervised baselines for contextualization, and the lack of quantitative breakdown of failure types across models and prompting strategies. These are correctable gaps. As a workshop paper, it provides a solid empirical foundation that future work can build on. Recommended for acceptance with revisions to improve reproducibility and benchmark contextualization.
Competing interests
The author declares that they have no competing interests.
Use of Artificial Intelligence (AI)
The author declares that they did not use generative AI to come up with new ideas for their review.