Comments
Write a commentNo comments have been published yet.
Simon Columbus1,* & William H. B. McAuliffe2
1Vrije Universiteit Amsterdam and 2Cambridge Health Alliance
* simon@simoncolumbus.co
In a recent pre-print, Snyczer et al. (2020) propose a new measure of perceived fitness interdependence, the Perceived Fitness Interdependence (PFI) scale. Fitness Interdependence has recently been defined as “the degree to which two or more organisms influence each other’s success in replicating their genes” (Aktipis et al., 2018; see also Brown, 1999; Roberts, 2005). Perceived Fitness Interdependence consequently refers to subjective perceptions of this degree of mutual influence. Before Snyczer et al. (2020) was posted as a pre-print the manuscript was submitted to a peer-reviewed journal, for which we acted as referees. In our reviews, we raised several conceptual and methodological issues that we believed may have adversely affected the development and validation of the PFI scale in its current form. Because the pre-print is not substantively different from the submission we reviewed, we believe that our concerns still merit consideration.
We briefly summarise our conceptual and methodological concerns; the full reviews follow below.
Review 1 (McAuliffe)
This paper reports the development of the Perceived Fitness Interdependence scale. I would support publication of a revised version of this paper in [redacted]. There are potential analytic issues that the authors should address before the manuscript is accepted.
Analysis Issues
Missing data. What is the justification for dropping 30 cases with missing data, and with dropping the niece/nephew category entirely? 26% missing data is far from disqualifying. Why not use multiple imputation or FIML? Was the reason that 30 participants did not complete the measure for all targets is that they did not have a sibling? That would be a clear case of selection bias.
Multi-level structure. How was the multi-level structure of the data represented in the EFA and CFA? Or were responses aggregated across targets for each person? If the latter, this will not do because then the measurement model represents individual differences in the tendency to view relationships as interdependent. I take the primary point of the scale to be measuring perceptions that participants have about specific relationships, not trait-level differences in interdependence beliefs. I see two options here. First, the authors could re-run the factor analyses in a multi-level framework. The main benefit of this approach is that the authors can explicitly test whether the factor structure is the same or different across levels of analysis. If the authors continue using R, they could conduct this analysis in the xxm package (https://xxm.times.uh.edu/blog/) or in lavaan: http://faculty.missouri.edu/huangf/data/mcfa/MCFAinRHUANG.pdf. In the xxm package I think the measurement model and the regression model could be combined and reported all at once.
Second, the authors could conduct the same single-level CFA, one for each target. If done in a multi-group framework, the authors can explicitly test whether the factor structure is invariant across targets, which is an assumption of their contention that the measure can be adapted for any type of target. Until this assumption is explicitly tested, they should refrain from claiming that the measure can be adapted for any relationship category.
Reporting of multilevel models. The reporting of results for the multilevel model should follow standard practice. A good example to follow would be Table 2 in Boudreaux and Ozer (2013), as well as the description of the model on page 438. Currently, the results are lacking many basic pieces of information: What was the ICC for the person-level of the model? This is important because person-level variance could represent response biases (or maybe a substantive difference in perceived fitness interdependence across myriad targets). Also, what centering strategies were used? the reader needs this information to interpret the coefficients.
Averaging across factors. Why did the authors average across factors in the multilevel regression? The factor analysis revealed that there are multiple factors. Similarly, cronbach's alpha was reported for the overall fitness interdependence scale even though internal consistency coefficients require unidimensionality. The alphas should be computed separately for each sub-scale. Better yet, report mcdonald's omega instead.
Open Science. I was not able to access the data or syntax that the authors used to report their analyses. This would have helped me diagnose some of the issues that I am currently just speculating about. Mturk data is not usually considered sensitive, so long as the worker IDs are removed. Can this information be made publicly available?
Power analysis. How was the effect size for the power analysis chosen? How was the power analysis conducted? For which model was it conducted? Note that multi-level power analyses requiring taking the level-2 part of the model into consideration when estimating power at level-1. How much power did the SEM have to detect misfit? I do think the current sample sizes meet a minimum for the fairly simple models that the authors conducted (Wolf, Harrington, Clark, & Miller, 2013), but the fidelity of SE models will always improve with increased sample sizes because their quality depends on the accuracy of the entire variance-covariance matrix, and covariances only become very accurate as N grows very large.
Relation to Brown (1999). Looking at Table 2, the validity coefficients for the Brown measure and the authors' shared fate measure look rather similar. We need confidence intervals here to know if the authors' measure is actually doing any better, given that the Brown measure is arguably also measuring perceptions of shared fate. I find the multi-level model in the supplemental materials ambiguous in speaking to this question because crucial overlapping variance is being partialed out from every predictor. More broadly, I was surprised that the paper did not include a more pointed critique of the Brown measure, given that its existence could make this new measure (at least the shared fate sub-scale) redundant. I think the authors could add an argument for why it is important to excise the mutualism aspect of interdependence from a measure of perceived interdependence. I take it that the authors believe (and I agree) that how much you think fitness-relevant outcomes depend on a target is an important consideration independent of the extent to which the target needs you to succeed.
Pilot data. On page 5 the authors note, "Before collecting data for study 1 and study 2, we piloted the PFI scale items on MTurk two separate times using the same criteria outlined below." Where is this pilot data? Is it the data reported in the supplemental materials about the "negative interdependence" version of the scale? If not, then whatever happened with these pilot studies should be reported, if only in the supplemental materials. If yes, then the authors should more explicitly state in the main manuscript that they tried to develop a different version of the scale. "Failed" attempts to develop a scale are very important findings in themselves because they turn readers onto what won't work and why. This information shouldn't be buried.
Interpretation Issues
Even after basic analytic issues are resolved, several missing pieces would need to be filled in before declaring that the fitness interdependence measure is "ready for prime-time": attendance to model misfit, discriminant validity, temporal stability, and content validity. Some of these can be addressed with the existing data; others would have to wait for future studies with richer datasets. I don't think the authors are yet in a position to recommend the measure for use. Rather, this paper should be more framed as more like a "progress report" about a validation process that has only just started.
Model fit. The confirmatory model did not fit. The high CFI/TLI isn't saying much other that models with strong correlations are very different from a baseline model in which items aren't allowed to correlate. The SRMR indicates that on average the residual correlations are "only" .036, but contributing to that average could be residual correlations larger than .10 or even .20. Inspecting modification indices or the standardized residual matrix would shed light on where the non-ignorable residual associations lie. These should be reported in the main text or supplement.
Chi-square and RMSEA are both indicating substantial misfit. My guess is that global fit was non-significant in study 1 because cross-loadings are allowed in EFA, but was significant in study 2 because the confirmatory model did not allow for cross-loadings. So, the authors could use the results from the EFA as guidance for adding a cross-loading or two to the confirmatory model. But adding factors or residual correlations could be more reasonable than adding cross-loadings, depending on the authors' theoretical understanding of why the items intercorrelate. The authors could also consider using Bollen's (2019) MIIV method (see Fisher et al. (2017) for an R package) to assess the misfit of each loading to pin down where the problem is. Yet another option is to add predictors or outcomes to the model to see if all of the causal action is happening via the factors alone. If modification indices indicate that adding paths from predictors to item residuals or from item residuals to outcomes is necessary to make the model fit, then more factors might be needed to account for the totality of the items' causal powers (see Hayduk, 2016 for an example).
To be clear, I am not recommending that the authors pursue non-significant global fit at all costs; adding paths will always improve fit even when the added paths are not well-motivated. For example, that 5 of 6 items are positively worded raises the possibility that the model fit indices are sensitive to the fact that the positively worded items share a source of variance that is not shared with the negatively worded item. It's possible that fit would improve by deleting the negatively worded item because then there would be no way for the model to disentangle shared variance due to the construct of interest versus shared variance due to the valence of the items. But the increase in fit would decrease validity because now the item valence variance is part of the latent factor, erroneously being counted as trait variance. All I am suggesting is that explicitly reporting the model as failing and considering alternative models would make transparent to the reader some of the possibilities for improving the measure. This is important because mis-specified latent variables will have biased associations with other variables and we don't know a priori whether this bias is small or large. There is also a broader pedagogical point to make because model fit standards among applied researchers are lax to the point where model testing has become a merely performative ritual rather than a genuine attempt at falsification. And the use of approximate fit cut-offs perpetuate the common but untrue notion that approximate fit indices track the degree of model misspecification. See Ropovik (2015) and Greiff and Heene (2017) for accessible reviews of common misconceptions about model fit.
Discriminant Validity. There were no efforts to establish discriminant validity. So, we know the PFI measures *something* that predicts willingness to help, but until we see small/nil correlations with other criteria that shared fate/empathetic engagement *shouldn't* correlate with (but, crucially, that other constructs that also predict willingness to help *should* correlate with), we cannot yet say that the measures are tapping interdependence in particular rather than one of the many constructs that predict willingness to help.
Temporal Stability. Does the reliable variance mostly reflect what the authors want to measure— i.e., a stable belief about a target that is only updated when there is a stable change in the interdependence of the relationship, plus maybe some lag time to register the change in circumstances— or transient fluctuations in that belief that is not useful for understanding how the participant thinks about the other person in general - e.g., their feelings about the target based on social interactions with the target from that day, projections of their mood at the time of measurement, etc.? Longitudinal data, like discriminant validity, would be key in follow-up studies. Presumably you would want a fairly short period between the test and re-test because the interdependence of some relationships can change fairly quickly. But you also don't want it to be so short that participants remember precisely what they had said the first time around or are still being influenced by the same transient source of variance that was affecting them at the first measurement. It would be nice for the authors to theorize about what the ideal time-gap would be for a dependability study, or to simply note that this is an open question that needs to be determined empirically.
Content Validity. The extremely high factor loadings can be attributed to the extremely similar wording of the items. This redundancy guarantees a high reliability coefficient but defeats the purpose of having multiple items. Multiple items can be justified on a few different grounds. First, they may be worded sufficiently different such that individual differences in people's idiosyncratic understanding of what items are asking is represented as unique variance in a SEM. Second, the items may have different endorsement rates, which helps ensure that the entire range of the construct is measured with adequate precision. Third, each item might not tap the central "essence" of the construct, and so you hope that each of your items is inaccurate in a different direction that cancel out in the aggregate. Highly similar items do not serve any of these goals. All they do is lengthen the protocol and give the misleading impression of very high reliability when in fact reliability
is low because the measurement error specific to a certain way of measuring the construct has not been partialed out from the latent factor. (Note that the empathetic engagement factor does a little better on this score because it has a reverse-scored item, which can cancel out some bias due to the valence of the item. The shared fate factor would probably increase in validity by replacing two of the positively worded items with one negatively worded item.)
The authors should consider seeing what happens when they represent each factor using just the single most face-valid indicator. My prediction is that relationships with other variables will be basically unchanged, with the possibility of ever-so-slightly larger confidence intervals. If this is the case, then there is no need to promote using a multi-item measure just because that's typically how questionnaires are done. Alternatively, revisions of the measure could involve more differentiated items, if the authors are worried that any one item is not sufficiently precise to get at the heart of the construct. Factors with more diverse item sets will have stronger correlations with criterion variables (the high reliability but low predictive validity of composites with similar items is called "the attenuation paradox").
Mean composite. The authors used a latent variable model to represent the measure, but then abandoned it for a (mean?) composite score in the regression analysis. This may be justifiable in this particular case because the items are so redundant that a mean composite score actually does come close to replicating what you would get with the latent variable model. Of course, that could change once the authors address model misfit. But, generally speaking, it makes sense to consider using a *weighted* mean/sum composite if one is going to go through the trouble of figuring out the "true" representation of the model using SEM, rather than first acknowledging the "truth" but then in practice biasing the regression coefficients by representing the predictors in a sub-optimal way.
Dependent Variables. I didn't see any dimensionality assessments of the willingness to help variables. Is it justifiable to treated them as unidimensional and score them as mean composites? If not, it might be better to treat each item as an outcome in a three-level model.
Common Method Variance. Both the predictors and outcomes are self-report measures. A much riskier test of the PFI's convergent validity would involve using behavioral/peer-reported outcomes. For now, a skeptic could attribute convergent validity to response styles.
Latent Correlations. The relationship between shared fate and empathetic engagement was left untheorized; instead, there was a freely estimated non-directional correlation. Should it not be that shared fate is CAUSING empathetic engagement? It seems less likely that the causal arrow predominantly goes from shared fate to empathetic engagement or that a common cause explains the factors' high correlation. If the association is not due to a common cause, then that causal arrow is important to specify, especially for when these measures are simultaneously integrated into more complex models where getting the causal ordering right will be crucial to estimating relationships among variables correctly. If shared fate causes empathetic engagement but they are, for instance, entered as mediators in the same stage of the model, shared fate will look less important than empathetic engagement even though it is a primary cause of empathetic engagement. I suspect this very situation could be going on in the supplementary regression where shared fate and empathetic emotions are both entered as simultaneous predictors.
Minor Issues
Review 2 (Columbus)
I have read the manuscript, "A new measure of perceived fitness interdependence: factor structure and validity" with great interest. I share the authors' conviction that interdependence is a profoundly important concept, and that fitness interdependence in particular has often been overlooked. Likewise, perceptions of interdependence are important and can drive people's behaviour, so the development of a measure of perceived (fitness) interdependence is a service to the field. Unfortunately, in its current form, I have rather severe doubts about the validity of this measure. This is even more important here than in most other research because scales, once published, are regularly taken as valid and rarely questioned. I detail my concerns below and hope that the authors will read them as critiques from a researcher how would be excited to use a well-validated measure of perceived interdependence.
Major concerns
Conceptualisation. It is not clear what the scale is supposed to measure. The authors clearly seek to measure perceptions of some actual property of relations between individuals. Indeed, the inclusion of the term 'fitness' suggest that this actual property is expressed in terms of (inclusive) fitness. However, it is not clear what this property is. The obvious starting point is the definition of Aktipis et al. (2018) stated at the beginning of the paper. However, this merely verbally states and does not formalise the notion of fitness interdependence (though Aktipis et al., 2018, draw on Roberts, 2005, who does).
Formalisation in some form or other would clarify what the scale does and does not seek to measure. It seems important to clarify how the scale, at least in principle, relates to theoretical accounts of interdependence as put forward by Roberts (2005) and Kelley et al. (2003). For example, I was not sure whether the scale aligns with the Interdependence Theory dimensions of the degree of interdependence (also called mutual dependence) or correspondence of interests (also called conflict of interests). This is, however, important when generating predictions to be tested using the scale.
Relatedly, I would have expected evidence that the scale aligns with the theoretical perspective within which it is embedded. In scale development, it is common to have theoretically-derived items rated by domain experts to assure that the items align with the theoretical account. This would be appropriate here as well.
Content and convergent validity. It is not clear what the scale actually measures. The authors identify two factors in their data, shared fate and empathetic engagement. These two factors were apparently not theoretically derived, in contrast to the overall aim of scale development. I will first address conceptual concerns about these factors, then highlight some empirical issues.
From my reading of the items, shared fate captures the covariation of people's outcomes. For example, by the notion of shared fate, two farmers working under the same local conditions may truthfully say that "what is good for X is good for me" with respect to the weather without having any influence on each other's outcomes. This stands in strong contrast to the notion of outcome interdependence in Interdependence Theory, which refers to mutual control over one's own and each other's outcomes. It also does not capture the definition provided by Aktipis et al. (2018), who define fitness interdependence as "the degree to which two or more organisms influence each other's success in replicating their genes." The formal definition by Roberts (2005) similarly refers to the consequences of acts for self and other. Thus, shared fate does not appear to capture any of the major theoretical notions of (fitness) interdependence in the literature.
The second factor is empathetic engagement, i.e., the degree of covariation between one person's emotions and another's outcomes. This is certainly interesting (and appears closely related to work on emotional interdependence, e.g. Sels et al., in press). However, it also clearly is not fitness interdependence. Again, it is easy to construct examples that makes this obvious. E.g., people may be happy about the success of their favourite sports team, but clearly this does not affect their own fitness. Of course, one might make an argument that empathetic engagement serves as a useful proxy for fitness interdependence, but this argument is not stated.
Beyond theoretical concerns, there is also the issue that empirically, there is little to go on to judge what either dimension actually measures. Both dimensions show medium to high correlations with inclusion of other in the self and closeness. However, these are not stated in terms of interdependence, so cannot truly serve as criteria. Indeed, conceptually fitness interdependence should be distinct from these two constructs. it may be that the high correlation is due to confounding in the prompts: Participants only rated targets (family members, acquaintances) for which closeness and fitness interdependence likely are correlated. It may be useful to consider circumstances under which these diverge (e.g., work contexts).
The authors did include one other measure of perceived interdependence, Brown's (1999) mutualism scale. This scale, however, highlights again the problem of theoretical alignment. Items from this scale include "I need [target] as much as they need me." In terms of Interdependence Theory, this clearly captures relative power (and would indeed be true even if both individuals were completely independent of each other). The items of the shared fate subscale, in contrast, align with a mix of the mutual dependence and conflict of interests dimensions of Interdependence Theory. There exists a multidimensional measure of perceived situational interdependence along these three dimensions (Gerpott et al., 2018). Examining how a measure of perceived (relationship-level) fitness interdependence aligns with this (situation-level) measure of outcome interdependence could clarify which aspects of interdependence are captures by the scale.
The only actual index of fitness interdependence included is genetic relatedness, which is only weakly correlated with fitness interdependence. This should, in fact, cast doubt on the validity of the fitness interdependence scale. With respect to the prediction of behaviours, the authors refer to a "kinship premium" (p. 13). However, in this case, this is not a premium---it is a component of fitness interdependence which people can clearly report (as they can report relatedness), but which is not captured by the scale. Beyond relatedness, there is no evidence that perceived fitness interdependence tracks any sort of valid criterion. Given that the authors suggest wide-spread applications of their scale, including in group contexts, criterion validity is absolutely necessary before publication of the scale.
Discriminant validity. It is not clear what the scale does not measure. In scale development, it important not just to show convergent validity, but also discriminant validity. Yet, neither study includes any measures intended to assess what the PFI scale does not capture. Given the crowded field of related measures (as evidenced by the high intercorrelations among included measures), perceived fitness interdependence should be theoretically and empirically distinguished.
One area where this matter particularly worries me is the application to intergroup relations. The authors write that the scale "could also be used for assessing perceived interdependence with groups as well as civic and national communities." No evidence is provided actually supporting this application. Indeed, the PFI scale could easily capture not perceived interdependence, but social identification (one could say that in an intergroup context, emotional engagement is more aligned with a social identity than with an interdependence perspective, given the focus on emotional outcomes). Thus, for such an application, discriminant validity with respect to social identification would be fundamental. In the absence of such evidence, applications of the PFI to intergroup relations is bound to lead to erroneous conclusions.
Predictive validity. It is not clear what the scale predicts. The study includes three outcome measures to assess predictive validity: Welfare tradeoff ratio, help in need, and helping without reciprocation. All are themselves survey measures. Here, some actual behavioural criterion would help to dissipate concerns about common method bias (though this is less of an issue for the WTR measure) and ecological validity. Overall, correlations among survey measures provide little evidence of predictive validity.
Second, table S2 shows that any predictive power of the PFI scale comes from the empathetic engagement subscale. The shared fate subscale only has little to no incremental validity. Again, this suggest that the PFI scale---or at least one of its subscales---does not capture any distinct construct.
Scale construction. It is not clear how the scale was developed. There is no mention of how items were developed. Although the authors mention that they conducted two pilots, results from these pilots are not described. Both would be good practice in scale development, where the writing and selection of items is a core part of the research process. In this case, in particular, I was surprised to see no mention of item selection. In theory-driven scale development, I would typically expect selection of items from a larger item pool based on theoretical (e.g., expert judgement) and empirical (e.g., EFA) grounds.
Minor issues
References
Aktipis, A., Cronk, L., Alcock, J., Ayers, J. D., Baciu, C., Balliet, D., ... & Sullivan, D. (2018). Understanding cooperation through fitness interdependence. Nature Human Behaviour, 2(7), 429–431.
Bollen, K.A. (2019). Model Implied Instrumental Variables (MIIVs): An alternative orientation to structural equation modeling. Multivariate Behavioral Research, 54(1), 31-46, http://doi.org/10.1080/00273171.2018.1483224.
Boudreaux, M. J., & Ozer, D. J. (2013). Goal conflict, goal striving, and psychological well-being. Motivation and Emotion, 37(3), 433-443.
Brown, S. L. (1999). Evolutionary origins of investment: Testing a theory of close relationships (Doctoral dissertation, Arizona State University).
Crutzen, R., & Peters, G. J. Y. (2017). Scale quality: alpha is an inadequate estimate and factor-analytic evidence is needed first of all. Health Psychology Review, 11(3), 242-247. https://doi.org/10.1080/17437199.2015.1124240
Fisher, Z. F., Bollen, K. A., Gates, K. M., & Rönkkö, M. (2017). MIIVsem: Model implied instrumental variable (MIIV) estimation of structural equation models. R Package Version 0.5.2.
Gerpott, F. H., Balliet, D., Columbus, S., Molho, C., & de Vries, R. E. (2018). How do people think about interdependence? A multidimensional model of subjective outcome interdependence. Journal of Personality and Social Psychology, 115, 716–742.
Greiff, S. & Heene, M. (2017). Why psychological assessment needs to start worrying about model fit. European Journal of Psychological Assessment, 33, 313-317. https://doi.org/10.1027/1015-5759/a000450
Hayduk, L. A. (2016). Improving measurement-invariance assessments: correcting entrenched testing deficiencies. BMC medical research methodology, 16(1), 130.
Kelley, H. H., Holmes, J. G., Kerr, N. L., Reis, H. T., Rusbult, C. E., & van Lange, P. A. M. (2003). An Atlas of Interpersonal Situations. Cambridge, UK: Cambridge University Press. https://doi.org/10.1017/CBO9780511499845
Rabbie, J. M., & Horwitz, M. (1988). Categories versus groups as explanatory concepts in intergroup relations. European Journal of Social Psychology, 18, 117-123.
Roberts, G. (2005). Cooperation through interdependence. Animal Behaviour, 70, 901–908.
Ropovik, I. (2015). A cautionary note on testing latent variable models. Frontiers in Psychology, 6, 1715.
Sels, L., Cabrieto, J., Butler, E., Reis, H., Ceulemans, E., & Kuppens, P. (in press). The occurrence and correlates of emotional interdependence in romantic relationships. Journal of Personality and Social Psychology. https://doi.org/10.1037/pspi0000212
Sznycer, D., Ayers, J. D., Sullivan, D., Beltran, D. G., van den Akker, O. R., Gervais, M. M., … Aktipis, A. (2020, May 3). A new measure of perceived fitness interdependence: Factor structure and validity. https://doi.org/10.31234/osf.io/7yzhd
Wolf, E. J., Harrington, K. M., Clark, S. L., & Miller, M. W. (2013). Sample size requirements for structural equation models: An evaluation of power, bias, and solution propriety. Educational and Psychological Measurement, 73(6), 913-934.
No comments have been published yet.