Cumulative Effects of Uncertainty on Simulated Streamflow in a Hydrologic Modeling Environment

It is common in the literature to not consider all sources of uncertainty simultaneously: input, structural, parameter, and observed calibration data uncertainty, particularly in data-sparse environments due to data limitations and the complexities that arise from data limitations when propagating uncertainty downstream in a modelling chain. This paper presents results for the propagation of multiple sources of uncertainty towards the estimation of streamflow uncertainty in a data-sparse environment. Uncertainty sources are separated to ensure low likelihood uncertainty distribution tails are not rejected to examine the interaction of sources of uncertainty. Three daily resolution hydrologic models (HYPE, WATFLOOD, and HEC-HMS), forced with three precipitation ensemble realizations, generated from five gridded climate datasets, for the 1981–2010 period were used to examine the effects of cumulative propagation of uncertainty in the Lower Nelson River Basin as part of the BaySys project. Selected behavioral models produced an average range of Kling-Gupta Efficiency scores of 0.79–0.68. Two alternative methods for behavioral model selection were also considered that ingest streamflow uncertainty. Structural and parameter uncertainty was found to be insufficient, individually, by producing some uncertainty envelopes narrower than observed streamflow uncertainty. Combined structural and parameter uncertainty, propa gated to simulated streamflow, often enveloped nearly 100% of observed streamflow values, however, high and low flow years were generally a source for lower reliabilities in simulated results. Including all sources of uncertainty generated simulated uncertainty bounds that enveloped most of the observed flow uncertainty bounds including improvement for high and low flow years across all gauges although the uncertainty bounds generated were of low likelihood. Overall, accounting for each source of uncertainty added value to the simulated uncertainty bounds when compared to hydrometric uncertainty; the inclusion of hydrometric uncertainty was key for identifying the improvements to simulated ensembles.


Introduction
Hydrologic models are used to generate many different types of output, most frequently streamflow. Models make various simplifications when representing a target physical environment (e.g. Clark et al., 2015), and these simplifications, among the other sources, introduce uncertainty to simulated output. Uncertainty is defined as the realistic range to which an exact value for a given variable cannot be determined but can be represented by a likelihood ( Uusitalo et al., 2015). Uncertainty is generated by a model's structure (e.g., Clark et al., 2015) and its parameters (e.g., Shafii et al., 2015). Uncertainty is also ingested through model input (e.g., Pokorny, 2019) and observational data used for calibration, hereafter referred to as hydrometric uncertainty (e.g., McMillan et al., 2010). Uncertainties interact within a hydrologic modeling framework by propagation. Propagation is the process of how uncertainty from one modeling step affects the next, moving cumulatively downstream (e.g., Brown and Heuvelink, 2006;Ajami et al., 2007;Nikolopoulos et al., 2010;Mei et al., 2016). All methods for quantifying uncertainty are limited by subjectivity (Kavetski et al., 2003;Kavetski et al., 2006;Beven and Binley, 2014). Examination of the subjectivity in uncertainty estimation methods is a common topic that has driven the development of the many uncertainty estimation frameworks available in the literature (Matott et al., 2009).

Cumulative Effects of Uncertainty on Simulated Streamflow in a Hydrologic Modeling Environment
Input uncertainty includes a wide range of data sources. Many studies choose to focus on precipitation as a dominant source of input data uncertainty (e.g. Kavetski et al., 2006;Ajami et al., 2007). Some studies consider meteorological data errors and uncertainty towards suggesting a preferred product (e.g., Choi et al., 2009;Eum et al., 2014, Wong et al., 2017, while others suggest meteorological input ensembles (e.g., Montanari and Di Baldassarre, 2013;Rapaić et al., 2015;Gbambie et al., 2017;Lilhare et al., 2019). Meteorologic input ensembles are preferred as they better account for uncertainty than a single realization (Kavetski et al., 2006;Ajami et al., 2007). Meteorologic ensembles are common in hydrologic climate change studies (e.g., Dams et al., 2015;Karlsson et al., 2016), but less so in historical hydrologic studies (e.g., Nikolopoulos et al., 2010).
Model structural uncertainty represents the uncertainty generated by model spatial and temporal aggregation (Dams et al., 2015;Wi et al., 2015;Muhammad et al., 2018a), hydrologic process numerical representation (Tasdighi et al., 2018), and hydrologic connectivity, which considers internal model decision-making for the allocation of water and the order of calculation (e.g. Clark et al., 2015). Structural uncertainty is often accounted for by multi-model studies (e.g. Ajami et al., 2007), or modular modeling (Clark et al., 2015). Parameter uncertainty is inherently tied to structural uncertainty because the number and influence of parameters are dictated by the numerical process representation used in a hydrological model. Parameter uncertainty is often presented with parameter likelihood distributions described by the Generalized Likelihood Uncertainty Estimation (GLUE) methodology (Beven and Binley, 1992). The GLUE methodology has been the subject of much discussion due to the use of both formal and informal likelihood functions (Beven and Binley, 2014). A statistical inference may be illposed due to miss-classifications of uncertainty, or limited representation of uncertainty (Beven, 2016). The limits of acceptability (LOA; Beven, 2006) methodology address some of these shortcomings; however, subjectivity remains (e.g. Li et al., 2010;Li and Xu, 2014;Shafii et al., 2015), which may be further complicated by parameter identifiability (Wagener et al., 2003;Abebe et al., 2010;Merz et al., 2011). An informal likelihood function with GLUE may still offer important insights when assumptions for formal inference are not satisfied (Beven and Binely, 2014). The informal objective function and model selection method will influence the selected parameter sets (e.g. Shafii et al., 2015).
Hydrometric uncertainty, like input uncertainty, is ingested by a hydrologic model. Hydrometric uncertainty is a function of multiple sources of uncertainty such as gauge accuracy, rating curve fit, and many others (Shiklomanov et al., 2006;Hamilton, 2008;Hamilton and Moore, 2012;McMillan et al., 2012;Westerberg et al., 2016;Whitfield and Pomeroy, 2017;McMillan et al., 2018). Methods to quantify uncertainty in hydrometric data exist and are still being developed in recent literature and focus on rating curve based uncertainty estimates (e.g. Coxon et al., 2015;Kiang et al., 2018). Hydrometric uncertainty is not accounted for in traditional calibration metrics (e.g. Gupta et al., 2009;Westerberg et al., 2020). The LOA methodology can include hydrometric uncertainty but relies on subjective decision thresholds for model rejection. Reliability and sharpness offer an alternative objective function to optimize for simulation acceptance and rejection (Li et al., 2010;Westerberg et al., 2011;Li et al., 2014;Bourgin et al., 2015;Shafii et al., 2015;Zhou et al., 2016). Reliability is a measure of overlap with an observed benchmark (Montanari, 2005), and sharpness is defined as a measure of uncertainty bound width (Yadav et al., 2007).
Studies often find input or structural uncertainty to be the largest contributors to total uncertainty (e.g. Ajami et al., 2007;Chen et al., 2011;Dams et al., 2015). No study has exhaustively sampled all possible sources of uncertainty due to dimensionality issues, data limitations, and computational limitations (Ajami et al., 2007). Therefore, total uncertainty is defined as the accumulation of all sources of uncertainty considered in a study; this definition is used in this study as well. Separating the contribution of each source of uncertainty is difficult due to the complex interactions among sources; sampling techniques are generally applied to consider each source of uncertainty after propagation to output (Pechlivanidis et al., 2011). For instance, models like WATFLOOD (Kouwen, 2018) use grouped response units, an approach that ties parameter values (parameter uncertainty) to landcover (input uncertainty) (e.g. Eckhardt et al., 2003). Pokorny (2019) shows that precipitation data uncertainty (input uncertainty) is tied to spatial aggregation, which is also a function of model structure (Wi et al., 2015;Muhammad et al., 2018a). Bayesian inference methods are often applied to generate uncertainty distributions with formal likelihood functions (Stedinger et al., 2008;Renard et al., 2010). Formal methods, however, often present poorly constrained problems arising from various epistemic sources of uncertainty being simplified to stationary aleatory errors (Beven, 2016). Model output is also subjected to the propagation of all sources of uncertainty, which suggests uncertainty distributions are relative rather than absolute (Beven and Binley, 2014). The information gained by generating relative uncertainty distributions still informs of relative importance for the sampling of a particular source of uncertainty (e.g. Kavetski et al., 2003;Kavetski et al., 2006;Renard et al., 2010;Dams et al., 2015). Indeed, the relative separation of uncertainty sources has been widely studied in the literature (e.g. Kavetski et al., 2003;Vrugt et al., 2005;Kavetski et al., 2006;Clark et al., 2011;Clark et al., 2015). Comparisons of uncertainty sources and their cumulative effects on propagated uncertainty as they compound to total modeling uncertainty, however, remain rarely studied.
Few studies examine the cumulative impacts of each of the sources of uncertainty as they propagate to model output (e.g. Ajami et al., 2007). This is especially true in large, northern basins in Canada where data sparsity compounds the difficulties associated with the generation of relative uncertainty distributions. The consideration of low likelihood uncertainty distribution tails is generally reserved for forecasting based studies, but, they also become more important when data sparsity is an issue (e.g. Demeritt et al., 2007;Cloke and Pappenberger, 2009;Addor et al., 2011;Pappenberger et al., 2013). For example, rating curve uncertainty estimates for Canadian hydrometric data are not currently available due to data limitations (Hamilton and Moore, 2012). The Water Survey of Canada suggests ±5% uncertainty (Environment Canada, 1980), however, the literature suggests more complexity in uncertainty estimates up to ±10% (Dingman, 2015) or higher (e.g. McMillan et al., 2010;McMillan et al., 2012). Wide bounds of uncertainty are often criticized for lacking value because the likelihood of those bounds is low, hence communication of the value of wide uncertainty bounds is a common issue (Pappenberger et al., 2013).
To address the aforementioned gaps in the literature, the objectives of this study are to 1) select a general method appropriate for characterizing uncertainty propagation in remote data-sparse regions, 2) apply this method to generate relative uncertainty distributions for multiple sources of hydrologic model uncertainty, and 3) examine the cumulative effects of propagated sources of uncertainty on simulated flow uncertainty including low likelihood uncertainty distribution tails. This study is part of the Hudson Bay System (BaySys) project (see Barber, 2014). The BaySys project is a multi-disciplinary regulations impact study in which the effects of climate change and hydropower regulation are relatively partitioned for comparison. To generate large scale relative partitions of hydropower regulation and climate change impacts, the results from numerical models (e.g. hydrologic models) are used as inputs to other numerical models (e.g. ocean models). With model output being ingested by other models as input, understanding the uncertainty generated by those models is important for determining the value of final results. The region of study used to address the study objectives is the Lower Nelson River Basin (LNRB), which is further described in the following sections. The LNRB is a highly relevant basin to the BaySys project, as it is subject to considerable regulation and suffers from data sparsity.

Study area
The region of study is the Lower Nelson River Basin (LNRB) in Northern Manitoba, Canada (Figure 1). The LNRB was selected for its BaySys project relevance and because of its relative data sparsity (Barber, 2014); beyond the BaySys consideration, any data-sparse region could have been considered. The LNRB is a basin of ~9 0,500 km 2 at the downstream end of the ~1 ,400,000 km 2 Nelson Churchill Watershed (NCW) (Figure 1). The LNRB is characterized by a sub-arctic climate and low-relief terrain largely covered by forest and wetlands, with many lakes due to the low topographic relief. The LNRB is a data-sparse basin with limited climate stations. January is the coldest month with 30-year average temperatures (1981-2010) of -23.9°C and -24.4°C in the towns of Thompson and Gillam, respectively. Summers are cool, with July being the warmest month  with temperatures of 16.2°C and 15.8°C at Thompson andGillam, respectively. Precipitation (1981-2010) is highest in July at 80.9 mm and 78.6 mm for Thompson and Gillam, respectively. Total annual precipitation for Thompson and Gillam is 509.0 mm and 496.4 mm, respectively, for the 1981-2010 period, of which 43% occurred in summer (JJA). Hydrograph peak flows are dominated by spring melt runoff. Spring peaks in unregulated basins generally occur in May except for the Grass River, in which peak flows are usually attenuated to July or later by lakes and wetlands. The LNRB, and upstream contributing area, have been included in numerous climate change studies, which suggest hydrologic vulnerability to climate change. Some notable studies looked at shifts in river ice jams (e.g., Rokaya et al., 2018), changes in water volume originating from the prairies (e.g., Muhammad et al., 2018a), hydrologic regime trends (e.g., Westmacott et al., 1997)

Hydrometric data
Hydrometric gauges are shown in Figure 1 and detailed in Table 1. Water from Lake Winnipeg enters the LNRB via the Nelson River through the Jenpeg Generating Station and the East Channel and is augmented by flow diverted from the Churchill River (Notigi control structure) via the Burntwood River. There are six Manitoba Hydro operated generating stations within the LNRB: the Jenpeg, Kelsey, Kettle, Long Spruce, Limestone, and Wuskwatim generating stations act as points of regulation. Additionally, there is the Keeyask Generating Station, which is currently under construction. Manitoba Hydro operates the Wuskwatim generating station on behalf of the Wuskwatim Power Limited Partnership, of which Manitoba Hydro is a partner, and will operate the Keeyask generating station as a partner. Publically available WSC data were supplemented with internal Manitoba Hydro records. Missing data were infilled using linear interpolation for short gaps of two or fewer days. Longer data gaps were not infilled; performance metrics were calculated based on available data.
Since the LNRB is a downstream basin in the NCW, the flow from upstream sources must be added to the basin and considered as an input. The flow records from the Jenpeg generating station, East Channel, and the Churchill River Diversion (Notigi Control Structure) are, therefore, added as model forcings to the LNRB hydrologic models. The flows estimated through Jenpeg and the Churchill River Diversion are measured reasonably accurately, control structures provide stable conditions for flow measurement. The inflows from Jenpeg and the Churchill diversion are closely monitored and regulated, which means measurement uncertainty is small. Control structures are fixed with known dimensions, which allows uncertainty to approach gauge accuracy. With such ideal streamflow estimation conditions, uncertainty can reasonably be assumed negligible. The East Channel is subjected to the same uncertainty of any other non-regulated gauge, however, it represents only a small portion of the forced inflows. Therefore, while nudged inflows are subject to minor uncertainty, it was assumed negligible, which also reduced computational time for model runs.

Gridded climate data ensemble
Meteorological input data for this study consists of daily timeseries of precipitation and air temperature (daily; 1981-2010). The ensemble consists of five gridded datasets, presented in Table 2, which are suggested to be reasonable representations of observed conditions in the literature (Lilhare et al., 2019;Pokorny, 2019).
All gridded datasets were bilinearly interpolated to match ANUSPLIN's 0.10° grid (Lilhare et al., 2019). Daily time steps were used when available. For NARR, daily minimum and maximum air temperatures were estimated from 3-hourly data. Three precipitation realizations were generated: Ensemble minimum, mean, and maximum. Minimum (maximum) precipitation was generated by selecting the minimum (maximum) value of all five gridded datasets for each grid cell at each time-step (i.e., daily). This method is expected to generate an overestimate of uncertainty. Ideally, an estimate of total uncertainty would be available for every gridded meteorological dataset but this is not computationally feasible. Recent studies suggest that the selected datasets suffer from notable uncertainty (Lilhare et al., 2019). Therefore, since the goal is to ensure low likelihood uncertainty distribution tails are not rejected, the wide range of precipitation uncertainty is assumed reasonable. A mean precipitation realization was generated by the arithmetic mean of all five datasets with equal weighting. Regarding other model inputs, only the arithmetic mean air temperature realization was considered, which reduced computational demands, as uncertainty was noticeably lower in the estimation of temperatures, relative to the uncertainty in precipitation (Pokorny, 2019). Uncertainty data associated with temperature input are not considered because temperature data is measured with less uncertainty than precipitation. It is a standard assumption that precipitation data uncertainty is the major contributor to meteorological input uncertainty, which is built into most uncertainty frameworks (e.g. Kavetski et al., 2006;Huard and Mailhot, 2006;Ajami et al., 2007;Vrugt et al., 2008;Renard et al., 2010;McMillan et al., 2011).

Hydrologic model ensemble
An ensemble of hydrologic models, all running on daily timesteps, was selected to account for structural uncertainty by varying both spatial aggregation and hydrologic  These stations were not used due to close proximity to other stations, therefore, they did not provide additional information.
2 Station 05UB008 is a natural channel but is considered regulated by the WSC because its water source (Lake Winnipeg) is subject to regulation. Therefore it is affected by regulation but not regulated itself.
process representation. The models selected to represent the LNRB included: the Hydrological Predictions for the Environment (HYPE; SMHI, 2018), the WATFLOOD model (Kouwen, 2018), and the Hydrologic Engineering Center -Hydrologic Modeling System (HEC-HMS; USACE, 2016) ( Table 1). Existing models of the LNRB were used with some modifications to parameterizations. Differences in underlying data used to set up the models, such as land cover, are assumed to be minor and are accounted for with re-parameterizations (Eckhardt et al., 2003). The assumption that non-meteorological input selection (and differences) among the model setups introduce minor uncertainties has more of an impact for WATFLOOD and HYPE, since parameters in these models are tied to land and soil classification, but is less impactful to HEC-HMS. Assuming negligible temporal changes in land and soil classification is of concern when performing long-term simulation (e.g., 50 or 100-years), but has less of an impact over a shorter duration climatological period. Regardless, detailed land and soil classification for the 1981-2010 period were not available for the LNRB (Dwarakish and Ganasri, 2015). Contributing areas upstream of hydrometric gauges were reviewed and adjusted to ensure equal comparisons between models. The HYPE and WATFLOOD models were set up for long periods of simulation (i.e. 30 years) (Stadnyk et al., Accepted;Holmes, 2016). The HYPE model was set up as part of the BaySys project for historic and future simulations. The WATFLOOD model was set up for isotope-based calibration. However, the version of the LNRB HEC-HMS model used was set up for flood studies. Therefore, the HEC-HMS hydrologic processes were updated to make use of the Priestly-Talyor evapotranspiration method (Priestley and Taylor 1972), which is expected to perform better than the fixed monthly evapotranspiration method used by Sagan (2017) for the 30 year simulation period used in this study ( Table 3). The hydrologic models were run at a daily timestep for the 1981-2010 climatic period. Two additional years of model spinup were included (1979)(1980).

Methodology
The methodology for this study is designed to maintain low likelihood uncertainty distribution tails. Separate consideration of each source of uncertainty was conducted to ensure lower likelihood simulations were not rejected.

Data preparation
Input datasets include the three precipitation realizations, the arithmetic mean temperature realization, and the unaltered observed streamflow records at nudging locations. HEC-HMS required daily dewpoint temperatures for the Priestly-Taylor method. Dewpoint temperatures were not available in all datasets; a comparable mean product could not be generated for dewpoint temperatures. Therefore, dewpoint temperatures were generated from precipitation and temperature data so that the effect of using the minimum and maximum precipitation realizations was reflected in the estimate of dewpoint temperatures. Dewpoint temperature timeseries were estimated using Equation (1), developed by Hubbard et al. (2003): in which T d was the daily dewpoint temperature (°C); T n was the daily minimum temperature (°C); T x was the daily maximum temperature (°C); P Daily was the daily precipitation (mm); and α, β, γ, and λ were unitless regression constants fitted to ground-based climate station dewpoint temperature data in the LNRB by least squares regression. Input data for the HYPE model, recently set up in part by Stadnyk et al. (Accepted), was generated by assigning the nearest climate data grid points to sub-basins. The input data for the WATFLOOD LNRB model, recently set up in part by Holmes (2016), aligned with the 0.10° gridded ensemble data. The HEC-HMS LNRB model used input climate data generated by aggregating grid points within and along basin delineations.

Uncertainty analysis
Input uncertainty was addressed by the three precipitation input realizations; precipitation was considered the focus as its uncertainty is assumed higher than other model inputs (Wong et al., 2017). The precipitation ensemble minimum and ensemble maximum represent higher uncertainty than would be useful for design or operational work. The wide uncertainty bounds generated by the ensemble minimum and the ensemble maximum represent low likelihood tails of the uncertainty distribution. The low likelihood distribution tails may increase in likelihood under climate change effects, however, the uncertain range was also more similar to the uncertainty generated in mediumrange ensemble flood forecasting (e.g., Han and Coulibaly, 2019). Different percentile ranges can be used to consider narrower ranges of uncertainty while retaining the ability to consider low likelihood events at higher percentile ranges (e.g. Pappenberger et al., 2013). Uncertainty introduced by model structure was addressed through the application of three hydrologic models of varied structures. For example, the Odei River basin was represented by multiple sub-basins in HYPE (semi-lumped), by 82 0.10° grid cells, including those downstream of the streamflow gauge, by WATFLOOD (semi-distributed), and by a single basin in HEC-HMS (lumped). The range of spatial structures among the models tested the effect of spatial aggregation on input uncertainty propagation. Additionally, each model used different methods to represent individual hydrologic processes, each having a different number and type of parameters (Table 1). This model structure sampling does not allow for detailed partitioning of structural uncertainty introduced separately from spatial and temporal aggregation, selection of numerical processes, and hydrologic connectivity (internal flow paths). Additional sampling would have been required, varying a single of these sources with the others held constant, which is a sampling technique better suited to modular modeling (Clark et al., 2015). Instead, structural uncertainty is considered through the selection of three different model structures, which varies each type of structural uncertainty (aggregation, process, and connectivity) simultaneously; generating a single estimate of the overall structural uncertainty from the ensemble of models. This presents a potential limitation: using three established model structures does not ensure structural uncertainty extends to low likelihood distribution tails.
Model parameters are used to tune model performance to be representative of a target physical environment; however, they are subject to equifinality (Beven and Binley, 1992). Parameter uncertainty was addressed using the GLUE methodology (Beven and Binley, 1992;Beven and Binley, 2014); parameter sets were sampled using orthogonal Latin-hypercube sampling (OLHS; Tang 1993) from uniform priors bounded by ranges suggested by the most recent modelers (Holmes, 2016;Sagan, 2017;Stadnyk et al., Accepted). A total of 23, 51, and 63 parameters were selected for HYPE, WATFLOOD, and HEC-HMS, respectively. Sample quantities were determined, in part, by sensitivity analyses (Holmes, 2016;Sagan, 2017;Stadnyk et al., Accepted), and by computational resources. Some parameters were grouped by land cover or soil characteristics to further reduce computational demand. A total of 6900, 15,300, and 18,900 parameter samples were run for HYPE, WATFLOOD, and HEC-HMS, respectively (i.e. 2300 samples for each of three precipitation inputs with HYPE). More samples were given to models with more parameters because the response surface becomes more complex as the number of parameters increases. These samples were held constant for each precipitation input. A simple top 10% criterion applied to Kling-Gupta Efficiency (KGE) scores is used to select behavioral parameter sets (Gupta et al., 2009); the criterion, however, is subjective, which is a commonly debated topic in the literature (e.g., Stedinger et al., 2008;Li et al., 2010;Li and Xu, 2014;Shafii et al., 2015). An advantage to the top 10% criterion is the generation of wider uncertainty bounds than selection by optimization of an ensemble metric (Shafii et al., 2015). The KGE metric assumes observed data are perfect (i.e., no variance from the fitted rating curve at the time of observation). Ideally, hydrometric uncertainty would be ingested into the modeling chain to influence behavioral simulation selection. Two additional methods are also presented that better account for hydrometric uncertainty ingestion. Equations 2 and 3 present a performance metric (F obj ) developed by Westerberg et al. (2020): which, w(t) and Q obs (t) are weights and observed streamflow values for timestep t of T total timesteps, respectively. Weights are determined by Equation 3, where Q sim (t) is the simulated streamflow for timestep t, and Q L (t) and Q U (t) are the lower and upper bounds of hydrometric uncertainty, respectively.
Integrating hydrometric uncertainty into an objective function does not reduce the subjectivity of behavioral simulation selection; the top 10% of simulations is still selected. A second selection criterion is applied here as the combined overlap percentage (COP; Equation 4) presented by Westerberg et al. (2011): which, QR o is the overlap with the observed hydrometric uncertainty range Q Robs , and Q Rsim is the simulated uncertainty bound range. Simulations are selected that maximize the COP metric, which reduces the likelihood range. Formal likelihood functions were not considered to avoid simplifying epistemic errors to narrow the parameter uncertainty range.
The estimation method for observed hydrometric data uncertainty was adopted from McMillan et al. (2012) and simplified for this study. Open water flows and iceaffected flows were assumed to be determined by the "B" flag (Environment and Climate Change Canada 2019). Open water hydrometric flow data were assigned ±10% uncertainty bounds, whereas winter ice-affected flows were assigned ±20% uncertainty bounds. The estimate of observed streamflow uncertainty is an oversimplification (Hamilton 2008;McMillan et al., 2010;Hamilton and Moore, 2012;McMillan et al., 2012;Kiang et al., 2018), but is limited by hydrometric data availability. Metadata, which are defined as additional relevant information such as rating curves or error estimates, are not available for Canadian hydrometric data at this time. Total uncertainty was assessed using reliability and sharpness metrics (e.g., Montanari 2005;Yadav et al., 2007;Gneiting et al., 2007;Shafii et al., 2015;Zhou et al., 2016). Reliability was defined as the percentage of overlap of the observed flow bounds (daily) and the simulated ensemble bounds (daily) for the full period . Sharpness was defined in this study as (Equation 5): , , in which Q simUpper and Q simLower are the upper and lower bounds of the simulated ensemble, respectively, and Q obsUpper and Q obsLower are the upper and lower observed flow uncertainty bounds. Reliability and sharpness values are calculated using the minimum and maximum ensemble bounds at each timestep, although some studies used a percentile range. A sharpness value of one means simulated uncertainty has, on average, the same uncertainty range as observed uncertainty. A sharpness value less than one indicates that simulated uncertainty exceeded observed uncertainty. Theoretically, the only way for both reliability and sharpness, as defined here, to approach one over the long simulation period of 30 years would be for a hydrologic model ensemble to approach perfection in its representation of the physical environment, which is not feasible. In practice, short term periods of adverse measurement conditions elevate hydrometric uncertainty (e.g., ice on conditions, river ice breakup, overbank flooding, beaver dams, vegetation growth, channel morphology, etc.). This increased hydrometric uncertainty will not necessarily be reproduced by the hydrologic model, meaning a sharpness value above one is not always bad. With the simple estimate of hydrometric uncertainty implemented in this study, hydrometric uncertainty is likely under-or over-represented at each timestep, but becomes a reasonable average to represent long-term (1981-2010) hydrometric uncertainty. Still, short term events like rating curve extrapolation (Kiang et al., 2018) or ice jams (Hamilton and Moore, 2012) generate periods of very high hydrometric uncertainty. The simple estimate of hydrometric uncertainty applied in this study will not capture such short term events; we, therefore, present sharpness values for the full 1981-2010 study period to reduce the impact of such events in our analysis.
If a simulated ensemble generates high-reliability values but very low sharpness values, the uncertainty range extends into the low uncertainty tails of the likelihood uncertainty distribution (Beven and Binley, 2014); simulations at the uncertainty distribution tails are often excluded as they are of sufficiently low likelihood to not be valuable for design or general operations. Comparisons of the cumulative effect of uncertainty were assessed with reliability and sharpness plots partitioned by binned 10 percentile ranges; for example, flows between the 10 th and 20 th percentiles were grouped and represented at the 15 th percentile. Total uncertainty was presented for an average of the five highest flow volume years and for an average of the five lowest flow volume years.

Performance of behavioral simulations
The performance varied for each model (Table 4); HEC-HMS has the largest range in KGE for five locations, WATFLOOD has the largest range in KGE values for four locations, and HYPE has the largest range in KGE values for two locations. This result is anticipated and is a reflection of the number of parameters considered for each model. Regulated gauges (shown in bold) generally have the highest KGE scores and have a low range of KGE values since flows from upstream contributing areas are added as model input at nudging locations. Only additional flow generated downstream of those locations (typically smaller, proportionately) could affect regulated gauge performance. Simulations selected with the F obj metric generally produced negative scores close to zero, meaning simulated output in a given timestep was often outside the range of hydrometric uncertainty, but reasonably represented observed streamflow data. COP scores for each model were generally above 0.4.

Structural and parameter uncertainty
Reliability and sharpness (Figure 2) for each model are representative of parameter uncertainty propagated to model output, for each model. Structural uncertainty propagated to simulated output is represented by selecting the highest performing simulation for each model. The structural and parameter-based ensemble is representative of the combination of propagated parameter and structural uncertainty. Results suggested three general uncertainty source relationships, reliability dominated by a single model, reliability dominated by different models for different flow percentiles, and complimentary performance, in which no single model had high reliability, but ensemble reliabilities were still high. An example of each relationship is presented in Figure 2 as well as one regulated gauge.
The Taylor River reliability plot is generally dominated by a single model; the WATFLOOD model has similar reliability to that of the ensemble up to the 70 th percentile flows. The Taylor River sharpness plot reflects that WATFLOOD's high reliability is due to wide-ranging model performance, supported by Table 2, in which WATFLOOD exhibits the widest range of KGE values. HEC-HMS has the lowest performance for the Taylor River, and a similar but slightly smaller sharpness range. The low performance of HEC-HMS is sourced from a consistent wet bias; no HEC-HMS simulations have a negative bias, and the lowest bias is 4.2% in contrast to WATFLOOD, which has simulations with both positive and negative biases. The Taylor River sharpness plot for HYPE shows that the HYPE model alone is not able to represent the full uncertainty range of the observed data, although this is generally the sharpest gauge for HEC-HMS and WATFLOOD as well since the Taylor River is the smallest basin, and is dominated by forest, which simplified the parameter uncertainty for that gauge. The Weir River reliability plot (Figure 2) indicates that multiple models are of value to the ensemble reliability; WATFLOOD is better at representing observed uncertainty for flows below the 60 th percentile, while HEC-HMS is better at representing flows above the 60 th percentile. As reliability increases, sharpness generally decreases.
The Burntwood above Leaf Rapids location shows no model to be of high reliability between the 40 th and 70 th percentiles; however, the ensemble reliability is still close to 100%. Different model structures are complementary to the ensemble reliability; however, the decrease in sharpness of the ensemble versus any individual model is also generally high. For flows below the 40 th percentile, HYPE and WATFLOOD are of high reliability but are not the models with the lowest sharpness. Unlike the high reliability of WATFLOOD in the Weir River, which is sourced from wide uncertainty bounds, WATFLOOD's high reliability in the Burntwood above Leaf Rapids region is sourced from narrow uncertainty bounds, relative to HEC-HMS and HYPE.
For the regulated Nelson at Long Spruce gauge, the models are complementary to the ensemble's reliability. Sharpness values for the Nelson at Long Spruce show the models generally produced uncertainty bounds that are narrower than the observed uncertainty bounds. At this location, however, observed flows are added for contributing areas upstream of the LNRB at nudging points, and that uncertainty from these upstream locations is not propagated downstream or reflected in the model results.
Propagated structural uncertainty always has higher reliability than at least one of the individual model's parameter uncertainty. For the Burntwood above Leaf Rapids and Nelson at Long Spruce Generating Station gauges (Figure 2), structural uncertainty is more valuable to the combined ensemble than any model's parameter uncertainty, for three and seven of the selected percentile bins, respectively. The average reliability across all stations and percentiles for structural uncertainty is 48%; as is the average reliability for parameter uncertainty, averaged across all three models. The average sharpness across all gauges and percentiles for structural and parameter uncertainty are 1.8 and 2.4, respectively.
Simulations selected by the F obj metric generated slightly lower reliabilities for low percentiles, but slightly higher reliabilities for high percentiles when compared to the KGE-selected simulations (Appendix A; Figure A1). Since the F obj metric gives higher weight to high flows, improved higher flow reliability was expected. Simulations generally produced sharper uncertainty bounds than simulations selected using the KGE. Simulations selected by optimized COP were generally sharper than both the KGE and F obj simulations without notable loss of reliability (Appendix A; Figure A2). All ensembles selected using the COP consisted of fewer simulations than the number of simulations selected by the top 10% selection criterion (e.g. <230 simulations for HYPE). Simulations were not selected based on individual performance with the COP, so no best performing simulation for each model could be selected to assess structural uncertainty. Since the goal was to ensure low likelihood uncertainty distribution bounds were not rejected, generating a structural uncertainty estimate does not limit the interpretation of the COP selected simulations.

Input uncertainty
The uncertainty generated from varying model structures and parameters propagated to streamflow varies in magnitude (Figure 2 and Table 2). When the wide range of input uncertainty is included, reliabilities generally increase to near 100% for unregulated gauges across all three models for all simulation selection criteria. The percentile ranges of sharpness with input data uncertainty are, on average: 8.8, 6.3, 3.0 and 4.1 times lower for HYPE, WATFLOOD, HEC-HMS, and the structural and parameter ensemble, respectively for the KGE selected simulations; sharpness was reduced for all simulation selection methods. These changes are relative and do not mean HYPE has the widest uncertainty bounds with input uncertainty included. With reliability values at or near 100% and low sharpness, the uncertain range extends to the tails of the likelihood distribution. Reliabilities for each of the three regulated gauges are generally >60% for all flow percentiles and simulation selection methods. Regulated gauge reliabilities are not as affected by input uncertainty as are non-regulated gauges, because the regulated gauges are affected by flow nudging. Results for the Weir River with simulations selected by KGE are presented. The Weir River was selected since it displayed some of the three general uncertainty relationships highlighted by parameter and structural reliability and sharpness metrics (Figure 3). The width of the flow envelope in each panel is a representation of uncertainty propagated through a hydrologic model. HEC-HMS has the widest parameter uncertainty range in more basins than WATFLOOD or HYPE (e.g., Figure 3j versus Figure 3f). Input uncertainty is widest for WATFLOOD and narrowest for HEC-HMS, which is generally consistent across all gauges and simulation selections, although there was some variability. The range of structural uncertainty is generally narrower for the minimum precipitation input, and wider for the maximum precipitation input. Areas with zero density (i.e., gaps between hydrographs) are the result of only sampling at the extreme ranges of input uncertainty (i.e., minimum, mean, and maximum); if further sampling is done, the uncertainty range (i.e. reliability) would not change, but the areas of zero density would fill in.

Total modeling uncertainty
On average, most of the hydrometric uncertainty bounds are enveloped by the 10 th and 90 th percentiles of the combined structural and parametric uncertainty (Figure 4a  and b). Hydrometric flow bounds that lie outside the 10 th and 90 th percentiles are generally below the simulated ensemble for low flow volume years (Figure 4a) and above the simulated ensemble for high flow volume years (Figure 4b); this is consistent for all simulation selection methods. Including input uncertainty notably widens the simulated uncertainty bounds for both low flow and high flow volume years (Figure 4c and d). Most of the low flow observed bounds are bracketed by the 25 th and 75 th percentiles of the total simulated ensemble, however, the 10 th and 90 th percentiles are required to envelop the spring runoff peaks in the high flow volume years. The total simulated ensemble generally has the highest simulated density near the observed flow bounds, while much lower density is present near the uncertainty bounds. The COP selection criterion generates a much sharper total uncertainty estimate ( Figure 5). Simulations that introduce timing issues were rejected by the COP criterion. COP values for total uncertainty were generally above 0.6, but many of the low likelihood uncertainty distribution tails were rejected. Similar to the KGE and F obj selected simulations, high (low) flow years were often under-(over-) estimated.

Cumulative effects of uncertainty propagation
Models selected from the top 10%, rather than selected by optimization generated uncertainty bounds that are wider than when optimized by the COP metric (e.g., Stedinger et al., 2008;Shafii and Tolson 2015). Narrow uncertainty bounds are considered desirable (Shafii and Tolson, 2015), however, without consideration of hydrometric uncertainty, bounds may be too narrow to adequately represent hydrometric data uncertainty. The combination of parameter and structural uncertainty (Figure 2) is an improvement over parameter uncertainty alone for reliability for all simulation selection criteria (Ajami et al., 2007;Chen et al., 2011;Muhammad et al., 2018b;Muhammad et al., 2019). An important consideration is, however, the quality of the hydrometric uncertainty estimate. The simple estimate for hydrometric uncertainty is oversimplified but is reasonable when considered over the 30-year period, and preferable to not considering hydrometric uncertainty at all. Each hydrologic model structure has temporally variant strengths that, when considered as an ensemble (e.g. Figure 2), improve reliability (Ajami et al., 2007). The propagated structural uncertainty generally has reliability higher than at least one parameter uncertainty estimate and is generally one of the sharpest uncertainty partitions, which suggests that structural uncertainty would be identified as more important than parameter uncertainty if hydrometric uncertainty were not considered (e.g. Ajami et al., 2007, Chen et al., 2011, and Dams et al., 2015. Precipitation ensemble inputs present wide uncertainty bounds. When propagated through the hydrologic models, the resulting simulated uncertainty bounds approach 100% reliability for all simulation selection criteria. It is common to consider a wide percentile range such as the 95% prediction range (Shafii and Tolson, 2015), however, narrower percentile ranges are also presented in the literature, such as 35 th and 65 th percentile bounds presented by Han and Coulibaly (2019). The choice of which percentiles to use is based on communication goals (Pappenberger et al., 2013). Wide uncertainty bounds include low likelihood events that are of importance (particularly when considering long-term climate change), but if presented improperly, may lower confidence in the simulated output (Demeritt et al., 2007). The 10 th , 25 th , 75 th , and 90 th percentiles are presented in Figure 4c and d. The 10 th and 90 th represent a wide range of uncertainty for most of the year but are needed to bracket some of the highest and lowest flow events. The simulated range of uncertainty was found to be similar to that of some flood forecasting uncertainties (Han and Coulibaly, 2019), or those considering alternative meteorological data sources, like satellite-derived precipitation inputs (e.g., Nikolopoulos et al., 2010). Therefore, while the addition of input uncertainty generates a wide range of uncertainty, the improvement in reliability suggests that all sources of uncertainty should be considered to better represent hydrometric streamflow uncertainty. In data-sparse regions, there is often not enough information to constrain uncertainty. Low-density observational networks adversely impact calibration for all environmental models (e.g. Hutchinson et al., 2009). Constraining uncertainty to produce sharp uncertainty bounds is almost always the goal of studies considering uncertainty (e.g. Ajami et al., 2007), however, for flood forecasting, wide uncertainty bounds are generated and uncertainty is constrained by considering narrower percentile ranges in post-processing steps (Han and Coulibaly, 2019). In forecasting studies, forecasted meteorologic conditions present a large source of (input) uncertainty that remains challenging to constrain. Narrow percentiles like 35 th and 65 th may be considered if there is little consequence for misrepresentation of a flow event (usually at specified quantiles of interest). If there are notable consequences for misrepresenting a given quantile event that occurs at low likelihoods, wider percentiles are used instead. If low likelihood simulations are rejected, then wider percentiles cannot be considered (e.g. Figure 5).
Connecting this to the BaySys project, in addition to a remote, data-sparse study region, the goal was to produce a wide range of uncertainty that could be considered across various percentile ranges to explore the potential impact of freshwater extremes on the marine system circulation (Barber, 2014). As more high-quality data are made available through remote sensing techniques, more of the lower likelihood uncertainty distribution tails could be rejected from the selection criteria. In data-sparse regions, however, consideration of low likelihood simulations adds value to interpretations of results. This result would also apply to other data-sparse regions; however, for areas with higher data density, lower likelihood events could perhaps Figure 5: Uncertainty bounds compared with estimated hydrometric data uncertainty bounds for the Weir River (average daily). Results are presented for the five highest flow volume years (High Flows) for total uncertainty generated from COP selected simulations. Cumulative simulation density is shaded for each day with black representing zero density and white representing area above the highest flow simulation. x-axis labels are simplified to only show the first day of each month. DOI: https://doi.org/10.1525/elementa.431.f5 be better defined. While informal likelihood functions are used in this study, the use of formal likelihood functions would not reject simulations, but rather assign them to lower likelihoods (Beven and Binley, 2014). Therefore, the inclusion of the distribution tails is present in most uncertainty studies, but they are simply rejected in favor of narrower uncertainty bounds (Ajami et al., 2007).

Interaction of uncertainties
The interaction of uncertainty sources was impacted by the performance metric used to select simulations. Differences in performance metrics have been a relevant topic in recent literature, particularly for use in multiobjective calibration (e.g., Asadzadeh and Tolson, 2013).
Other performance metrics that do not ingest hydrometric uncertainty, such as the Nash-Sutcliffe Efficiency, were considered as behavioral simulation criteria, but produced little variation in reliability and sharpness values. Ingesting hydrometric uncertainty with the F obj metric generally produced sharper ensembles but focused on increasing reliability for high flows (Westerberg et al., 2020). The COP selected simulation ensembles were generally the sharpest and included fewer simulations than the top 10% (Appendix A). Since OLHS is used, no single performance metric is the focus of the optimization. Many calibration frameworks exist that can likely offer efficient sampling (e.g., Efstratiadis and Koutsoyiannis, 2010;Shafii and Tolson, 2015), however, studies such as Ajami et al. (2007) show that input uncertainty affects parameter distributions. Separate calibrations for each input dataset would explore different areas of the response surface and narrow the range of parameter uncertainty, thereby limiting the exploration of low likelihood events. Studies such as Vaze et al. (2010) suggest models are less able to simulate conditions that are wetter or drier than those a model is calibrated to. The results presented in Figure 3 agree with the results of Vaze et al. (2010) as they show low performance for very dry or wet inputs, but highlight the variation of structural and parametric uncertainty when biases are introduced through the input data. Parameter uncertainty decreases with the ensemble minimum precipitation input and increases with the ensemble maximum. An interpretation of this is that when precipitation volume is sufficiently low, most water is lost to infiltration or evapotranspiration (ET), with remaining soil reservoir capacity or potential ET (PET) capacity available. Changing the capacities of PET and soil reservoirs would affect fewer events, which would lower sensitivity. Similarly, when precipitation volume is sufficiently high, PET and soil reservoirs are often exceeded, which suggests smaller parameter changes could be more impactful to a larger number of precipitation events (Vase et al., 2010;Chen et al., 2011;Merz et al., 2011;Coron et al., 2012;Brigode et al., 2013).
The propagation of structural uncertainty also displays distinct interaction with input uncertainty. Spatial aggregation, such as that of HEC-HMS, generally reduced the sensitivity to the ingested input uncertainty (Fischer et al., 2013;Pendergrass et al., 2017); this further suggests that model structure affects the sensitivity of climate change projections. In a data-sparse region, including low likelihood uncertainty distribution tails may also improve the ability to explore changes in less frequent events, such as floods or droughts (Mendoza et al., 2016). Under climate change, these distribution tails may be more sensitive to climate shifts (Vaze et al., 2010;Mendoza et al., 2016). Therefore, consideration of cumulative sources of uncertainty that extend into low likelihood uncertainty distribution tails can be considered a viable method for examining different percentile ranges of propagated uncertainty towards understanding low likelihood events in a data-sparse region.

Conclusions
Propagated uncertainty from input, parameter, and structural uncertainty sources for an ensemble of hydrologic models are generated and evaluated against a simple estimate of hydrometric uncertainty for several hydrometric flow data gauges. Results are highlighted as follows: • Structural uncertainty without the inclusion of hydrometric uncertainty will likely appear as the highest quality flow ensemble; • When compared with hydrometric uncertainty, neither parameter nor structural uncertainties alone are sufficient to represent the range of hydrometric uncertainty; • Inclusion of all sources of uncertainty generates the widest uncertainty bounds, but the highest reliability; • The relative magnitude of structural and parameter uncertainty is a function of input uncertainty, suggesting variations in the contribution to total uncertainty by source, with increased variability expected for climate change studies and low likelihood events; and • The generation of wide uncertainty bounds that include low likelihood simulations can be used to explore low likelihood streamflow events in a datasparse environment.
Consideration of all sources of uncertainty is expected to improve the quality of the simulated ensemble although the generated uncertainty bounds widen into low likelihood uncertainty distribution tails if simulations are not selected with ensemble optimization techniques, which offers benefits for data-sparse environments. Each source of uncertainty adds further value, but, without the consideration of hydrometric uncertainty, the added value may not be obvious. Basin scale effects the sources of uncertainty and how they propagate; additional research into the cumulative effects of uncertainty propagation at different basin scales should be considered. Additionally, further research into the evolution of structural and parameter uncertainty when forced by GCM climate projections should be considered, since the contribution of each source of uncertainty displayed interaction with input data and low likelihood uncertainty distribution tails are likely to be sensitive to climatic shifts. in a Hydrologic Modeling Environment Art. 1, page 16 of 20

Data Accessibility Statement
Meteorological data used in this study are publically available datasets. All hydrologic outputs are publicly available. Figures following the format of Figure 3 and Figure 4 are also available in the OSF data repository. 10.17605/OSF.IO/ZQ3J6.