Avalilação PREreview de Self-Reported Side Effects of Semaglutide and Tirzepatide in Online Communities

de Rosewood Wasp

Publicado: 23 de março de 2026
DOI: 10.5281/zenodo.19188976
Licença: CC0 1.0

Peer Review: Self-Reported Side Effects of Semaglutide and Tirzepatide in Online Communities

Manuscript: Sehgal et al., arXiv:2603.12341 Review date: 2026-03-23

Summary

Sehgal et al. present a cross-sectional analysis of 410,198 Reddit posts (May 2019–June 2025) mentioning semaglutide or tirzepatide, identifying 67,008 users as self-reporting drug users via a GPT-4o-mini classifier. Among these, 43.5% report at least one side effect. The most frequently reported effects are nausea (36.9%), fatigue (27.8%), and hair loss (21.4%). The authors identify menstrual irregularities (3.8% of all users) and temperature-related symptoms as potentially unrecognized adverse effects and recommend these findings be considered for regulatory label updates. The central methodological claim is that large-scale social media analysis can "complement traditional pharmacovigilance by detecting emerging safety signals."

The study addresses a genuine gap: GLP-1 receptor agonists have a large and rapidly growing user base, and real-world adverse effect patterns in routine clinical use are incompletely characterized by clinical trial populations. The authors are correct that social media corpora contain pharmacovigilance-relevant information. The methodological question is whether this particular study design adequately supports the pharmacovigilance and regulatory claims made.

Strengths

Scale and temporal span. The corpus of 410,198 posts over six years is among the largest social media pharmacovigilance datasets for this drug class and captures the full adoption arc from early prescribing through the current obesity-indication expansion.
Systematic LLM pipeline with reported validation metrics. The GPT-4o-mini classification and MedDRA mapping approach is described with sufficient methodological transparency, and the authors report precision and recall metrics for the pipeline. This is a useful contribution to the methodological literature on LLM-assisted pharmacovigilance.
Focus on real-world experience. Social media captures adverse effect reporting from patients who are not enrolled in trials, who use drugs off-label, who have comorbidities typically excluded from trials, and who discontinue treatment — populations where pharmacovigilance data are sparse.
Novel signal hypotheses. The reproductive and temperature findings, regardless of their ultimate pharmacological attribution, constitute hypothesis-generating observations that could inform the design of prospective studies. This is the appropriate frame for this type of data.

Major Concerns

1. Attribution failure — weight loss confounding

The study's two novel findings — menstrual irregularities and temperature dysregulation — cannot be attributed to GLP-1 receptor agonist pharmacology using this dataset. Both are well-characterized sequelae of caloric restriction and weight loss independent of drug mechanism. Menstrual cycle normalization in women with obesity-related anovulation is documented across multiple weight-loss interventions; the bariatric surgery literature establishes exactly this pattern as a weight-loss-mediated effect rather than a procedure-specific pharmacological effect. Temperature dysregulation (including cold sensitivity) is a recognized consequence of significant caloric restriction.

The study design cannot distinguish a direct GLP-1R pharmacological effect from a weight-loss-mediated effect from a baseline condition effect in the underlying population (T2D, obesity, PCOS). No control group, no weight-loss outcome data, and no stratification by magnitude of weight loss are present. Presenting these findings as "novel GLP-1 RA side effects" overstates what the methodology can establish. The appropriate framing is "symptoms co-reported with GLP-1 RA use by Reddit users" — which is meaningfully different.

2. Denominator error rendering all prevalences non-generalizable

Reported prevalences are conditioned on users who posted about a side effect; the denominator excludes the treated GLP-1 RA population who use the drug without posting to Reddit, who post about other aspects of their experience, or who experience no effects worth reporting. The 43.5% figure cannot be interpreted as the prevalence of side effects in GLP-1 RA users — it is the prevalence among a self-selected subset that actively discusses side effects in social media communities. This limitation is acknowledged in the text but then effectively ignored in the Discussion, where prevalence figures are treated as clinically informative.

This is not a limitation that larger samples resolve: adding more posts from the same posting-about-side-effects population does not recover the missing non-posting population. The denominator problem is structural.

3. Numerator inflation from user identification design

Independent of the denominator issue, the user identification step oversamples users whose drug mentions co-occur with symptom reports. The classifier identifies users mentioning the drug in the context of reporting experiences; users who mention the drug without reporting experiences are systematically less likely to be captured. The 67,008 cohort is more symptomatic than the GLP-1 RA user population at large by design, inflating all prevalence estimates beyond what the denominator problem alone would produce. The reported 43.5% prevalence figure is subject to both biases simultaneously.

4. Multiplicity exposure for novel signal claims

With 40+ MedDRA symptom categories analyzed, the reproductive and temperature findings highlighted as novel are precisely the categories most vulnerable to false positive inflation from uncontrolled multiple comparisons. No Bonferroni correction, FDR adjustment, or pre-specified primary outcome is documented. The highlighted findings (3.8%, 5.8%) occupy exactly the rarer-finding range where false positive inflation is most consequential. The absence of a multiplicity correction for a hypothesis-generating study would not itself be disqualifying, but presenting specific low-frequency findings as supporting regulatory label review without acknowledging false positive risk substantially overstates the strength of the signal.

5. LLM-to-MedDRA mapping noise unquantified for novel signal claims

The pipeline validation metrics are reported at the aggregate level. They do not quantify precision and recall for the specific MedDRA Preferred Terms and System Organ Classes driving the novel signal claims. The reproductive finding depends on the pipeline correctly mapping free-text menstrual cycle descriptions to the reproductive and breast disorders SOC rather than to the gastrointestinal SOC (where symptoms like nausea and cramping overlap). The temperature finding requires distinguishing drug-related temperature sensitivity from cold intolerance secondary to weight loss. The aggregate performance metrics do not establish that the pipeline is sufficiently precise for these specific cross-SOC classifications. The relevant validation is PT-level performance on these specific signal categories.

6. Regulatory recommendation without intermediate confirmation

The recommendation to consider regulatory label updates based on these findings inverts FDA's evidentiary standard for causal attribution. FDA requires confirmatory evidence — typically from spontaneous adverse event reports with dose, causality assessment, and temporal relationship, or from prospective surveillance studies — before labeling changes. The study explicitly frames its own methodology as hypothesis-generating, then recommends a regulatory action that presupposes causal confirmation. These two statements cannot both be correct. The appropriate recommendation is the design of prospective studies capable of providing the confirmatory evidence the study acknowledges is absent.

7. Subreddit availability bias

The subreddit environment primes users to recognize, discuss, and attribute experiences to the drug before and during reporting. Users joining GLP-1 RA subreddits have been exposed to community-established side effect narratives prior to reporting their own experiences; availability of specific symptom narratives increases the probability of those symptoms being noticed and reported. This creates a bias mechanism not present in distributed adverse event reporting systems, where reporters are not pre-primed by a shared community narrative. The study does not address whether the apparent signal rates reflect pharmacological experience rates or community narrative amplification rates.

8. Sex stratification absent for female-specific outcome

The menstrual irregularity finding (3.8% of all users) conflates a female-specific outcome with a mixed-sex denominator. The clinically interpretable figure is the rate among female users, which the study does not report. With a mixed-sex denominator that almost certainly includes a substantial proportion of male users, the 3.8% figure systematically understates the female-specific rate. Any clinical interpretation of this finding requires the sex-stratified denominator.

Minor Comments

The comparison of reported prevalences against clinical trial adverse event rates (implicit in the Discussion) is methodologically invalid given the denominator mismatch described in Major Concern 2. This comparison should either be removed or accompanied by an explicit statement that the two figures measure different populations and are not directly comparable.
The temporal trend analysis conflates increasing GLP-1 RA adoption with increasing community discussion volume. Normalizing post volume against estimated prescription data (available from drug utilization databases) would strengthen any claims about temporal changes in reporting rates.
The study does not distinguish semaglutide from tirzepatide in its primary analyses, despite meaningfully different pharmacological profiles (dual GIP/GLP-1 agonism in tirzepatide versus GLP-1 agonism alone in semaglutide). Drug-specific stratification in the primary analysis is warranted.
A sensitivity analysis excluding posts containing third-person pronouns ("she," "he," "my wife," "my husband") would help bound the false positive rate for the self-report assumption. Some proportion of the 67,008 users may be caregivers or family members reporting others' experiences.
The study does not characterize the proportion of posts from the dosing/titration period versus the maintenance period. Side effect profiles during titration (particularly GI effects) are known to differ substantially from maintenance-phase profiles; combining these inflates apparent prevalence of titration-period effects.

Questions for Authors

What proportion of the 67,008 identified users are female? Reporting the menstrual irregularity prevalence with a female-restricted denominator is necessary for clinical interpretation. The mixed-sex denominator renders the 3.8% figure uninterpretable for practice.
Can you provide PT-level precision and recall for the specific MedDRA categories driving the novel signal claims — menstrual irregularities and temperature-related symptoms — rather than aggregate pipeline metrics? What is the false positive rate for each of these specific categories?
What is the study's proposed mechanism distinguishing direct GLP-1R pharmacological effects on reproductive function from the well-characterized weight-loss-mediated menstrual normalization and new-onset irregularity documented in the bariatric surgery literature? What study design features would be required to separate these pathways in future research?
Have you considered a first-report isolation analysis — restricting to each user's first post mentioning a symptom — to assess whether apparent clustering of temperature and reproductive symptoms survives the removal of posts that may represent social confirmation rather than independent reporting?
How does the study's finding that menstrual irregularities occur at 3.8% overall relate to the known background prevalence of menstrual irregularities in the obesity and T2D populations from which GLP-1 RA users are predominantly drawn? Without a baseline prevalence for the underlying population, it is unclear whether 3.8% represents an elevation above background or falls within expected rates.

Competing interests

The author declares that they have no competing interests.

Use of Artificial Intelligence (AI)

The author declares that they used generative AI to come up with new ideas for their review.

Comentários

Escrever um comentário

Nenhum comentário foi publicado ainda.