Write a comment

PREreview of Data-Centric AI for EEG-Based Emotion Recognition: Noise Filtering and Augmentation Strategies

by Siddharth Singh

Published: September 19, 2025
DOI: 10.5281/zenodo.17160978
License: CC BY 4.0

This paper proposes a data-centric AI framework for EEG-based emotion recognition that prioritizes improving data quality (through noise filtering) over developing more complex models. Instead of introducing a new deep learning architecture, the authors use a relatively simple 1D convolutional neural network and focus on two key strategies

Participant-Guided Noise Filtering, EEG samples are removed if the participant’s self-reported emotion intensity for that trial falls below a threshold. By excluding these low-intensity (ambiguous or noisy) emotion samples, the training data becomes more reliable. The authors demonstrate that progressively filtering out such ambiguous samples leads to notable performance gains. In binary arousal classification (high vs. low arousal), cleaning the dataset by removing all samples with self-rating ≤0.95 yielded an 8% absolute accuracy increase (from ~74.2% to 82.21%) and a 5.5% F1-score increase. This underscores the benefit of prioritizing “good data” over big data in model training.

Systematic Data Augmentation, To compensate for the reduced data after filtering (and to address class imbalance), the authors generate synthetic EEG samples. Three simple augmentation methods are explored: (1) averaging two consecutive data entries per emotion (Augmentation 1), (2) averaging 3–5 consecutive entries per emotion (Augmentation 2), and (3) injecting Gaussian noise into original data (Gaussian). These augmented samples are added to the training set (while the test set is only cleaned, not augmented, to avoid leakage). Augmentation had a modest effect in the high-level binary task (where data was already plentiful), but became increasingly important in finer-grained tasks with fewer samples per class. Notably, in the 7-class emotion classification (six basic emotions + neutral), the Augmentation 2 strategy improved accuracy by up to ~5% (from 33.8% to 38.4% at certain cleaning levels) compared to using only the original data. Augmentation particularly boosted the F1-score for under-represented classes, helping mitigate class imbalance.

In short, this work advances the field of affective computing by demonstrating a data-centric approach in EEG emotion recognition. It is one of the first studies to explicitly address label noise cleaning in an EEG emotional dataset and show its quantitative impact on model accuracy. The combination of noise filtering with augmentation is shown to be a practical and reproducible pathway for performance improvement. The findings encourage researchers to invest effort in dataset quality (consistency of labels, removal of ambiguous trials) and simple data expansions, as these can yield robust gains across multiple emotion classification granularity levels. This represents a shift in emphasis from developing ever more complex EEG models to making better use of the data we have, which is an important perspective for resource-constrained biomedical AI applications.

Major Issues and Suggestions

1. Validation on Unseen Subjects (Generalisability)

It is unclear whether the reported evaluation ensures subject-independent testing. The paper mentions using a shuffled K-fold cross-validation (training on K-1 folds, testing on the held-out fold), but does not specify if folds were arranged to keep each subject’s data separate (leave-one-subject-out). If the K-fold was purely random, data from the same participant could appear in both training and test sets, potentially inflating performance by allowing the model to learn subject-specific patterns. This is a critical point for EEG-based emotion recognition – models often struggle to generalize to new individuals due to idiosyncratic EEG features. The authors should clarify the evaluation protocol. For stronger evidence of real-world utility, I recommend performing a subject-independent evaluation (such as leave-one-subject-out or train on 19 subjects, test on the held-out subject).

2. Augmentation Methodology- Clarity and Validity

The data augmentation strategies, while straightforward, raise some concerns about how realistic and generalizable the synthetic data are. The paper describes Augmentation 1 as averaging “two consecutive data entries for each emotion” and Augmentation 2 as averaging 3, 4, or 5 consecutive entries. It is not fully clear how these “consecutive” entries are chosen or why averaging consecutive samples is expected to produce meaningful new examples. Does “consecutive” refer to adjacent samples in the dataset ordering (and if so, is the dataset sorted by time, by subject, or by intensity)? If samples from different subjects are averaged together, the resultant feature vector may mix subject-specific EEG characteristics in a way that does not correspond to any realistic EEG pattern. This could inadvertently introduce noise or distort class distributions. The authors should clarify this procedure in the manuscript for reproducibility. Moreover, justification or validation of these augmented samples is needed.

3. Data Cleaning Threshold - Practical Impact and Bias

The idea of discarding low-rated trials raises a potential concern: what is the real-world implication of removing up to 90% of the data? In the experiments, a very aggressive threshold (0.9) leaves only the top 200 out of 1600 samples. While this yielded the best accuracy on the remaining test data, such a high threshold might not be practical in deployment. In real applications, an EEG system will encounter many subtle or low-intensity emotions – it cannot simply ignore them. Removing “noisy” samples is essentially trading off recall for precision; the model excels on clearly labeled cases but is never tested on the ambiguous ones it threw away. The manuscript should discuss this trade-off. What threshold would the authors ultimately recommend for a deployed system? Too low, and noisy labels hurt performance; too high, and the model is only trained on extreme cases and may fail on moderate cases. An alternative approach could be label smoothing or weighting instead of hard removal.

4. Comparison with Prior Work (Context of Results)

The authors claim that their data-centric approach “in some cases [surpasses] the results of more sophisticated published models”. This is an exciting claim, but the support for it could be made more explicit. The only direct comparison provided is with the MAET model from prior work [26], shown in Table 4 for the seven-emotion classification. It appears that when moderate noise filtering is applied (removal threshold ≤0.8), the authors’ simpler model with augmentation achieved higher accuracy/F1 than MAET on the same task. This is a strong outcome and should be highlighted more clearly in the text. I suggest the authors explicitly state, “Our approach outperforms the MAET transformer (state-of-the-art model from [26]) on the 7-class emotion task when using data cleaning up to a certain level.” Moreover, beyond MAET, it would strengthen the paper to cite numerical results from other relevant studies (if available) to contextualize the performance. How do the achieved accuracies for binary or four-quadrant classification compare to any earlier results on SEED or SEED-VII? The introduction mentions several sophisticated models (LSTMs, GNNs, etc. in references [19][20][28]) that tackled emotion EEG. If those studies reported comparable metrics, including a brief comparison would show readers that this data-centric method truly keeps up with or exceeds prior art.

5. Statistical Significance and Robustness

The improvements reported (+8% accuracy) are quite promising, but the paper does not report any statistical significance testing or confidence intervals. Since a K-fold cross-validation was used, the authors likely have multiple runs (fold results) from which variability could be assessed. Reporting the mean ± standard deviation of accuracy/F1 across folds, or performing a paired t-test between, say, the baseline and the cleaned/augmented conditions, would substantiate that these gains are consistent and not due to chance. ,If the baseline accuracy was 74.2%and after cleaning it became 82.2%, was this improvement observed on most folds or just an average skewed by a few? Providing some measure of variance (even in a footnote or small graph) would increase confidence in the robustness of the approach.

6. EEG Domain Considerations (Normalization and Cross-Subject Variability)

A domain-specific concern is how the EEG features were handled across different participants. The paper describes converting EEG signals into differential entropy features across 5 frequency bands and 62 channels, then averaging over each video clip’s duration (yielding a 310-dimensional feature per trial). However, it is not mentioned whether any form of normalization or baseline correction was applied per subject. In EEG data, due to individual differences (baseline brain activity levels, electrode impedance variances), it is common to perform some normalization (such as z-scoring features per subject or subtracting per-channel means) to reduce person-specific bias. If no normalization was done, the model might partially be distinguishing subjects rather than emotions. This ties back to the subject-independence issue above. I recommend the authors specify if they normalized the feature vectors (did they standardize each feature across all training samples, or normalize each subject’s data to zero mean/unit variance?). If not, it would be worthwhile to discuss the potential impact: the model might achieve higher accuracy by exploiting constant offsets in some channels that differ between subjects.

Minor Issues and Suggestions for Improvement

Writing and Grammar: The manuscript is generally well-written, but a few minor errors should be corrected for clarity. “with a increase of 14.44%” should be “with an increase of 14.44%”. Similarly, in the sentence bridging Sections 4 and 5, a punctuation is missing: “Section 4 Lastly, Section 5 concludes the paper…” would read better as “Section 4. Lastly, Section 5 concludes the paper…”. A careful proofread to fix such small grammar issues (articles, comma placements, etc.) will improve readability. Also consider consistency in tense (most of the paper is in past tense when referring to the conducted study; ensure this is uniform).

Terminology consistency: The paper introduces the binary arousal task as “two-halves emotions.” While descriptive, this phrasing is a bit unconventional. In the abstract it’s rightly referred to as binary (high vs. low arousal). It may be clearer to stick to a consistent term like “binary high-vs-low arousal classification” throughout, instead of “two-halves”.

Augmentation description clarity: As noted in the major comments, the description of Augmentation 1 and 2 could be expanded slightly.

Include the value of K (folds) and dataset splits: The paper does not explicitly state how many folds were used in cross-validation (common choices are 5 or 10). Additionally, clarifying the total number of trials (which we infer as 1600 from context) and how they were divided would be useful for completeness.

Future work expansion: The conclusion mentions extending to multimodal signals and real-world clinical studies as future work. Perhaps also explicitly state investigating subject-generalization as a future direction (if you take on board the major issue above). Even if not, it’s minor, but acknowledging that deploying such a system would require ensuring it works on new individuals and possibly exploring online/self-supervised label refinement could be interesting forward-looking statements.

Competing interests

The author declares that they have no competing interests.

You can write a comment on this PREreview of Data-Centric AI for EEG-Based Emotion Recognition: Noise Filtering and Augmentation Strategies.

Before you start

We will ask you to log in with your ORCID iD. If you don’t have an iD, you can create one.