PREreview of Small area estimation of forest biomass via a two-stage model for continuous zero-inflated data

Published: March 5, 2024
DOI: 10.5281/zenodo.10783462
License: CC BY 4.0

Summary

This study evaluates a two-stage model-based small area estimator that incorporates a zero-inflation model. The model is applied to estimating aboveground biomass in Nevada using simulated small samples by subsampling the 10-year FIA data and a real small sample problem by only using 2019 FIA data. They compare their model to the standard post-stratified estimator used by FIA and a unit-level and area-level small area estimator. Their question is whether the zero-inflation model will yield higher precision and lower bias in small area estimation. The approach to answering this question is to compare the models by testing on the Nevada FIA data and comparing the bias, precision, and relative efficiency of each model. The authors conclude that the two-stage model performs better than the other tested models in this region which has large areas of non-forest. The results support this conclusion showing the zero-inflated model has higher precision, greater relative efficiency, and lower bias. They acknowledge that the model may not necessarily perform as well in regions with a large amount of forest land.

Much research is being done on small area estimators to improve forest structure estimation in part because of the growing importance of carbon accounting and because of the rise of remote sensing systems which can be used to improve the precision of estimators. Others have made similar comparisons between different types of estimators, but to date there have been few evaluations of this zero-inflated model, which is important because zero-inflation is a common issue with forest structure estimation, particularly in areas that have little forest area. As someone who works in forest structure estimation of woodlands and savannas, I found the description and test of the model particularly interesting and would consider using it for evaluating forest structure in other woodlands.

This manuscript could be extended by perhaps introducing their software package in a separate short publication rather than as a single section of the discussion. My primary recommendation is to expand the presentation of the results and elaborate on the meaning and importance of the results more in the discussion.

Major comments

Consider publishing the R package as a separate publication instead of introducing it along with the analysis where it is first used here. This is just a recommendation, but I think keeping the two main objectives of 1) introducing a new R package and 2) evaluating a zero-inflated model in separate papers would make for a clearer presentation of both. Barring that option, consider instead introducing the package in the methods. The discussion could still include a paragraph touting the benefits of the package as allowing for the wider adoption of zero-inflation models.
The discussion appears to be incomplete. So far it only presents the R package and does not discuss the results or put the results in the context of previous research. Comparisons could be made between the results here and other studies that compare the precision of different small area estimators. The discussion could also comment on the potential for biased model-based estimators. Carefully checking model specification and ensuring a model is unbiased is one major shortcoming of any model-based estimator. This is why FIA (and the IPCC) prefers design-based estimators.

Minor Comments

Title and Abstract

The title is good and accurately reflects the contents of the manuscript. If this manuscript is the only one to introduce the software that is developed, then I would also include the name of the software so that others can more easily find this manuscript to cite: Small area estimation of forest biomass via a two-stage model for continuous zero-inflated data with the R package saeczi

However, I feel that the software would be better introduced in a separate publication even if it’s only a very short one.

In the abstract, the research question is implied by the statement of “compare the performance of this estimator…” The research question could be stated explicitly and built up as a focus on developing small area estimators for regions that are zero-inflated (i.e., don’t have a lot of forest land).
In the abstract, the second analysis of the 2019 data is only briefly mentioned, and no results are given. This could be expanded upon briefly since it is also an important set of results.

Introduction

For the most part, the manuscript provides references to appropriate research. However, there are some statements that would benefit from support with a citation. Examples below with page number and indicator of where:
- P4: “Horvitz-Thompson estimator within strata” - cite the method used since this is the standard being tested against. I believe Betchold and Patterson 2005 would work here, but there may be a better citation.
- P6 “sometimes lead researchers to …” – Are there citable examples of this?
- P6 “…often stronger and more linear at the area level.” - I agree that this is generally true (in part because of the reduction of variance when aggregated to larger areas), but it would be good to support this statement with a citation.
P4-5 “model-assisted estimators…still show a lack of sufficient precision” - There are other instances in the remote sensing literature of model-assisted estimators yielding good precision. See the review by Stahl et al., 2016 for examples (https://doi.org/10.1186/s40663-016-0064-9). You may want to temper or modify this statement slightly to acknowledge these prior findings.
P5 “This model misspecification…” - It would be good to elaborate on how modeling assumptions are often violated. For instance, if the model depends on normality, and therefore zero-inflation violates this assumption, this could be stated explicitly. Maybe give a little more detail of the findings in Frescino 2022 mentioned in the previous sentence so the reader does not have to look up this paper. Also, I wouldn't say this is a requirement, but this could be a good point to bring up the option of non-parametric models used in model-based estimation, which do not have as strict of assumptions and may be able to deal with zero-inflation more naturally.
P6 “ at sampled plot locations” - What about non-sampled and non-forest plots? I think this would be a larger source of zeros.
In the final/objectives paragraph of the introduction, the research questions and objectives are not stated explicitly but implied. The final paragraph of the introduction instead summarizes what was done in each section of the paper. If the length of the methods warrants this summary of steps, then this may be best moved to an overview/intro section of the methods. The final paragraph of the introduction could then instead focus on what the research question was, what the objectives were, if there were any hypotheses, and a very basic (1-2 sentence) overview of how the questions/objectives were addressed.

Methods

This is a great analysis, and I appreciate the clear and simple description of the models and notations used. This aspect is often difficult to follow in other papers on small area estimation, particularly for readers who are new to this area of statistics.
In the data section, please include a paragraph that briefly describes the auxiliary variables listed in table 1.
P8 – “data were extracted and matched” – Were true or fuzzed plot locations used?
P8 – “observations across the 17 counties in Nevada…” – Please state what the time span (and maybe EVAL ID) of the data used somewhere around here in the text.
P8 – “measurement year 2019” - FIA collects data in panels associated with the inventory year (invyr), which can be treated as an equal probability sample representative of the state. But not all plots of the panel are measured in the invyr, and a measurement year (measyear) may contain data from other panels. Can you check and report that your data is from a single panel and can in fact be treated as an equal-probability sample? Also check that only single intensity plots (intensity==1) are used because some areas like national forests are sampled at higher intensity (i.e., more samples), which would bias estimates of areas that include both single and higher intensity sampling.
P11 – “categorical auxiliary variable that classifies…” - Can you describe this variable with more detail? What is the source of the forest/non-forest map and how is it made? Also, I believe the FIA uses the NLCD Tree canopy cover product to create multiple strata for estimation. Please consider using this approach instead to better represent the FIA standard that the zero-inflated model is being compared to.
P14 – “obtained from” – delete duplicate use
Section 2.3.2 Aims first paragraph – this paragraph would be best suited to the objectives paragraph of the introduction. At this point the objectives of this analysis should already be clear, and how they are done should be described instead.

Results

Section 3.1 – The results here should probably be described in greater detail as I explain in following comments. Each paragraph appears to be a start in describing a figure, but numeric details and perhaps some additional interpretation of the results is still needed. For example, when saying one model has more variation in PRB or Pct Relative RMSE, the standard deviations of these values could be given to provide detail on what is seen in the figures. However, forgoing that additional detail, the next four very short paragraphs could instead be written more concisely in a single paragraph by simply describing the results rather than introducing each as "we examine..." or "next, we turn to...".
P20 “have more variation” - This is a good thing to point out, but you could give numeric details on the difference in variation between the models (i.e., what is the standard deviation of each PRB or the model with greatest and least variation in PRB). This is one example of providing more detail in the results.
It would be good to include the specific results of the underlying models (e.g., the coefficients of the unit-level and zero-inflated models when using all data). These model details are not central to the point and conclusions of the study but readers more focused on the remote sensing predictors and model types may find them useful. They would be standard information to include in a remote sensing journal.
Section 3.2 - The next two paragraphs should be a new subsection of the methods rather than a part of the results. Also, say what 10-year period is used.
P21 “estimated relative efficiency” - I think relative efficiency usually puts the basis of comparison in the numerator so that a value > 1 indicates a higher efficiency (improvement) with the test method (zero-inflation model). I personally find this more intuitive (i.e. value >1 indicates test method is preferable), but either way works. I have also seen others translate relative efficiency to an equivalent number of plots needed to achieve the same variance. This is another approach that you may like to consider to supplement the RE values.
Figure 2 – Why do all the estimators underestimate the “true” RMSE?

Discussion & Conclusion

I have few minor comments here because the discussion is primarily a short summary of the R package, but I think it really should expand more on the meaning and importance of the results.
P24 “While we do not expect….” – This is redundant with the statement in the previous paragraph.

Appendix

P39 “making it impossible to perform the back-transformation” - Please move this statement to the methods since the lack of a back transformation is a likely criticism of many readers. This issue was something that occurred to me early on, and I bet most would feel the same if they did not see this justification.

Summary

Major comments

Minor Comments

Competing interests

Comments