Escribir un comentario

PREreview del AI-Based Clinical Decision Support Systems for Secondary Caries on Bitewings: A Multi-Algorithm Comparison

por Paritosh Katyal

Publicado: 14 de mayo de 2026
DOI: 10.5281/zenodo.20173918
Licencia: CC0 1.0

Paper: Chaves et al., "AI-Based Clinical Decision Support Systems for Secondary Caries on Bitewings: A Multi-Algorithm Comparison"

Preprint: medRxiv 26350883 · doi: 10.64898/2026.04.17.26350883v1

Reviewer: Paritosh Katyal — Director of Product, Bola AI

ORCID: 0009-0008-1549-7706

---

## Editor's note

This is an independent benchmark of 8 dental AI tools on a shared set of bitewing radiographs. The design is solid: STARD-AI reporting, OSF preregistration, 5 experts forming the reference standard (two examiners plus a four-person review board with one overlap). The headline finding is that across all the evaluated systems, real lesions are being missed at a consistent rate. The paper calls this a "conservative diagnostic profile." I'd push on that language. Three things I'd want to add to the discussion:

- What happens to clinicians over time when AI consistently misses a class of lesion.

- Whether "conservative" is the right word for what the data actually shows.

- Whether a shared public benchmark might extend the open-data direction the paper opens.

Recommendation: major revisions. Reviewer disclosure in appendix.

---

## Summary of findings and contribution

The paper benchmarks eight systems on caries detection in bitewing radiographs:

- Commercial: Second Opinion®, CranioCatch, Diagnocat, DIO Inteligência, Align™ X-ray Insights.

- Experimental: two Mask R-CNN variants (with Swin Transformer backbone), one Mask DINO.

- Test set: 200 bitewings, 885 restored tooth surfaces, 5-expert consensus reference standard (two examiners plus a four-person review board with one overlap).

Four findings stand out:

- All systems hit high specificity (0.957–0.986) but moderate sensitivity (0.327–0.487).

- Commercial and experimental models showed no statistically significant performance differences.

- Proximal surfaces showed significantly lower sensitivity than occlusal (OR = 0.30, 95% CI 0.20–0.45). This matters clinically because secondary caries usually develop at proximal interfaces.

- Misclassifications were predominantly downward: the systems missed real lesions rather than overcalling them.

This is a needed independent benchmark. Vendor-published numbers aren't directly comparable, and comparisons like this are how the field gets calibrated. The convergence between commercial and experimental models is itself interesting; it suggests further gains will come from dataset quality and labeling work, not from new architectures.

---

## Major issues

### 1. The "adjunct" framing assumes a clinician capacity the cross-sectional design cannot verify

The paper measures these AI tools at a single point in time, against a reference standard. It also calls them adjuncts to clinician judgment. That framing assumes the clinician keeps their own diagnostic capacity intact. But capacity is shaped by what you practice.

Dentists training alongside these CDSSs could, over time, develop reliance patterns that fit the AI's specific failure mode. The systems significantly underdetect on proximal surfaces (OR = 0.30 for sensitivity). Use them long enough and you may end up with clinicians who are reliably good at identifying non-carious surfaces, and progressively less able to catch, on their own, the lesions the AI misses. A system that consistently misses a class of lesion could, over time, shape attention away from that class.

A note from voice AI deployment. With voice AI for clinical documentation, the bet is that the tool lets clinicians be more present: they don't spend cognitive effort on note-taking, so their judgment in the moment is preserved. With radiographic AI doing the detection, the bet is different. The AI is shaping what the clinician sees. What that does longitudinally is the question this cross-sectional design can't answer. I'm raising it from voice AI experience, not from radiographic AI data; there's no published study I can point to on this specific mechanism, so this lands as an open question, not a finding.

### 2. "Conservative diagnostic profile" softens what the data shows

The paper calls the high-specificity, low-sensitivity pattern a "conservative diagnostic profile, favoring specificity over sensitivity," and frames this as clinically valuable for "minimizing unnecessary invasive treatments."

I'd push on the word "conservative." It reads like a virtue. What the data shows is that across the five commercial systems, dentin lesion detection ranged from 27 to 51 out of 103 (Second Opinion 51, Diagnocat 41, DIO 34, CranioCatch 28, Align 27). That's the systems missing real disease, not making a careful choice. Both readings are defensible (conservative threshold choice and underdetection), but the paper only engages the first. Saying plainly that the systems underdetect lesions is closer to what the numbers say.

One more thing on trust. False negatives and false positives don't carry the same weight with clinicians. From voice AI deployment, what I've seen is that clinicians tolerate occasional misrecognition but reject systems that drop content they actually said. Whether the same asymmetry shows up in radiographic AI, I can't cite. I'm offering this as a cross-domain pattern, not radiographic AI evidence. Goddard, Roudsari & Wyatt (2012, JAMIA 19(1):121–7) covers the general automation-bias mechanism.

### 3. Extending the open-data direction: would a shared public benchmark help?

The paper makes the case for standardized benchmarking and open datasets in dental AI. A natural next step worth raising: would a shared public benchmark be useful for the open-data work the paper does? Something vendors could compete on, where evaluation methodology and aggregate training-data composition are published.

Vendors aren't going to share what gives away their competitive edge. From voice AI deployment, the same vendor-layer opacity exists, and a shared public benchmark changes the incentive structure toward openness. The complication worth threading is that academic groups already collaborate directly with vendors, which shapes how any shared benchmark would actually work in practice.

---

## Minor issues

### Vendor pool framed as comprehensive but omits major US-market players

The evaluated set excludes Overjet, an FDA-cleared caries-detection vendor deployed across major DSO (dental support organization) networks like Heartland, Aspen, and Pacific Dental. Overjet is also Pearl's direct US-market competitor. Without it, the comparison can't claim it covers commercial dental AI CDSSs comprehensively.

The Limitations section says "regional restrictions on software availability prevented inclusion of some commercial platforms," but doesn't say which vendors were considered or why each was excluded. Naming the excluded vendors and the reasons would help future researchers building on this benchmark, clinical readers using it for procurement, and reviewers citing this work.

### No reader study to test whether AI assistance improves dentist performance

The paper compares AI-alone outputs to a reference standard. The clinically relevant comparison is dentist alone vs AI alone vs dentist + AI. That three-arm design is established by Devlin et al. (2021, reference [12] in the present paper). The findings here tell us how the CDSSs perform alone, not whether they actually help dentists in practice.

### Commercial sales framing versus paper findings

Pearl (Second Opinion® product marketing) and Overjet (caries-detection product page) market early-stage caries detection as a primary value proposition. The paper's findings show this is exactly where the systems underperform: enamel-only lesion detection ranged from 3/12 to 5/12 across the commercial systems. The contrast between vendor marketing and measured performance is worth calling out in the Discussion. Future researchers building on this benchmark will want a citable reference for the gap.

### Elevation of the proximal-surface finding

The OR = 0.30 (95% CI 0.20–0.45) result for sensitivity on proximal versus occlusal surfaces seems clinically relevant. Proximal surfaces are where secondary caries most commonly develop. The result is in Section 3 but doesn't make it to the Conclusion, where it deserves to be.

### Experimental models: author-developer independence

The experimental models (Exp_1, Exp_2, Exp_3) were developed by the Radboudumc Dental AI Hub, which is the same institutional group authoring the present benchmark. The paper establishes its independence from the commercial vendors; it's worth noting the parallel point on independence from the academic developers.

### Transparency, explainability, and the academic direction this paper opens

The paper's strongest contribution beyond the benchmark itself is its call for standardized benchmarking and open datasets in dental AI. Commercial CDSSs operate as proprietary, cloud-based platforms. They don't publish data sources, labeling protocols, or post-market performance indicators. That opacity is the gap independent benchmarks like this one address.

From industry experience: the labeling work for these vendors is done by clinical AI annotators, some on vendor payroll, some contracted, some volunteer. The data is out there. Opacity is a vendor product decision, not a technical limitation. Future work could explore whether vendors would share aggregate training-data statistics, like case mix, geographic distribution, and annotation protocol summaries, without giving up proprietary detail.

---

## Recommendation

Major revisions. The benchmark itself is methodologically strong. Three main asks:

- Push back on the "adjunct" framing. It assumes clinicians keep their full capacity, which a cross-sectional design can't verify.

- Move away from "conservative" as a description of what the data shows.

- Consider whether a shared public benchmark could extend the paper's open-data direction.

The minor issues cover scope (vendor pool, reader study), framing (marketing vs findings, the proximal-surface result, author-developer independence), and the transparency / labeling-labor gap.

---

## Reviewer vantage point and disclosures

- Reviewer: Paritosh Katyal, Director of Product at Bola AI.

- Bola AI: voice AI vendor for clinical documentation in dental practice.

- Patents: Co-inventor on US Patent US-20240257807 (voice-driven clinical data entry) and on a related patent extension for restorative charting. Both patents cover voice input and output; no radiographic AI is involved.

- Scope of expertise: voice and natural language interfaces, not radiographic image AI. Where this review draws on voice AI deployment patterns to comment on radiographic AI, I am offering it as analogy, not empirical evidence.

- Prior commercial relationships: Through Bola AI, I have prior commercial relationships with Pearl Inc. (Second Opinion®), one of the five vendors evaluated in this paper, and with Overjet, which is not evaluated but which I name in the vendor-pool minor as missing from the comparison. Bola used to have reseller and sales-partnership arrangements with both. Both vendors have since built their own voice AI offerings, which makes those arrangements moot, though we still talk at times. Bola has not pursued radiographic AI.

I have signed this review.

---

Signed,

Paritosh Katyal

Director of Product, Bola AI

Co-inventor, US Patent US-20240257807

ORCID: 0009-0008-1549-7706

Date: 05-13-2026

Competing interests

Through Bola AI, I have prior commercial relationships with Pearl Inc. (Second Opinion®), one of the five vendors evaluated in this paper, and with Overjet, which is not evaluated but which I name in the vendor-pool minor as missing from the comparison. Bola used to have reseller and sales-partnership arrangements with both. Both vendors have since built their own voice AI offerings, which makes those arrangements moot, though we still talk at times. Bola has not pursued radiographic AI.

Use of Artificial Intelligence (AI)

The author declares that they used generative AI to come up with new ideas for their review.

Puedes escribir un comentario en esta PREreview de AI-Based Clinical Decision Support Systems for Secondary Caries on Bitewings: A Multi-Algorithm Comparison.

Antes de comenzar

Te pediremos que inicies sesión con tu ORCID iD. Si no tienes un ORCID iD, puedes crear uno.