A spatially explicit inventory scaling approach to estimate urban CO 2 emissions

Appropriate techniques to quantify greenhouse gas emission reductions in cities over time are necessary to monitor the progress of these efforts and effectively inform continuing mitigation. We introduce a scaling factor (SF) method that combines aircraft measurements and dispersion modeling to estimate urban emissions and apply it to 9 nongrowing season research aircraft flights around New York City (NYC) in 2 0 18–2 0 2 0 .This SF approach uses a weighting function to focus on an area of interest while still accounting for upwind emissions. We estimate carbon dioxide (CO 2 ) emissions from NYC and the Greater New York Area (GNA) and compare to nested inversion analyses of the same data. The average calculated CO 2 emission rates for NYC and the GNA, representative of daytime emissions for the flights, were (49 ± 16) kmol/s and (144 ± 44) kmol/s, respectively (uncertainties reported as ± 1 s variability across the 9 flights). These emissions are within * 15% of an inversion analysis and agree well with inventory estimates. By using an ensemble, we also investigate the variability introduced by several sources and find that day-to-day variability dominates the overall variability. This work investigates and demonstrates the capability of an SF method to quantify emissions specific to particular areas of interest.


Introduction
Carbon dioxide (CO 2 ) emissions from fossil fuel combustion are the largest source of anthropogenic climate change and urban areas are responsible for a large fraction of these emissions (Intergovernmental Panel on Climate Change [IPCC], 2013).According to 2015 inventory data (Gurney et al., 2020a) and census urban area definitions (U.S. Census Bureau, 2021c), 43.8% of CO 2 emissions across the contiguous United States came from urban areas and the urban share of emissions is expected to increase as populations continue urbanizing (International Energy Agency, 2013; IPCC, 2013; United Nations decrease across all sectors by 2050 (Kaminsky, 2019).In addition to this, New York City (NYC) has passed the Climate Mobilization Act that includes several bills targeting building emissions, the largest emitting sector for the city when including indirect emissions (Cventure LLC et al., 2017), with large fines for noncompliance (New York City Council, 2019).
Several recent studies focused on the assessment of large northeastern U.S. city emissions have been published (McKain et al., 2015;Ren et al., 2018;Sargent et al., 2018;Plant et al., 2019;Ahn et al., 2020;Lopez-Coto et al., 2020;Floerchinger et al., 2021;Pitt et al., 2022).Table 1 highlights some key details about 3 of these northeastern U.S. urban areas for context (U.S. Census Bureau, 2021b).It is clear from Table 1 that the Greater New York Area (GNA), or New York-Newark urban area as named in the Census database (U.S. Census Bureau, 2021c), is significantly larger (more urban sprawl) and denser than the previously studied urban areas.Additionally, almost half of the CO 2 emissions come from NYC, the dense core of the GNA.All urban areas in Table 1, including the GNA, are those defined by the U.S. census.Their definition of urbanized areas can be summarized as a central urban core and any densely settled surrounding territory (urban fringe) that together has a minimum of 50,000 people, typically including any contiguous urban fringe with at least 1,000 people per square mile (U.S. Census Bureau, 1995).
As cities enact laws with explicit reduction requirements, it becomes necessary to have high-precision emissions monitoring techniques to inform progress.This is often done using self-reported bottom-up inventories, which involve combining emission factors (e.g., kmol CO 2 /m 2 s for residential housing) with activity data (e.g., total area of residential homes, m 2 ) to calculate emissions across source categories.In addition to the self-reported inventories that cities use, there are published bottom-up inventories that utilize a combination of emission factors and activity data, modeling, monitoring station data, fuel statistics, and so on from agencies like the Energy Information Agency and Environmental Protection Agency to allocate emissions in a gridded field (Gurney et al., 2009;Gurney et al., 2012;Gately & Hutyra, 2017).Gurney et al. (2021) discuss in detail the large differences that can exist between these self-reported inventories and published inventories.There are also published disaggregated inventories that rely instead on proxies to disaggregate reported national total emissions to a gridded product.These can be generated more regularly at a global scale but have additional uncertainty due to the use of a proxy rather than the relevant activity data (Janssens-Maenhout et al., 2017;Oda et al., 2018;Gurney et al., 2019).These published inventories can inform policy about emissions and are often used as priors in atmospheric transport modeling efforts (Lamb et al., 2016;Gurney et al., 2019).
Bottom-up inventories rely on a vast amount of input information, which makes them difficult and timeconsuming to develop.This leads to gaps between emissions represented in inventories and present-day emissions.Top-down studies fill this gap and provide an independent estimate to compare to inventories by using measurements to quantify emissions in near-real time (Cambaliza et al., 2014;Lauvaux et al., 2016;Gourdji et al., 2018;Pitt et al., 2018;Turnbull et al., 2019;Gurney et al., 2021).For example, Turner et al. (2020) and Yadav et al. (2021) quantified CO 2 reductions during the COVID-19 pandemic and identified the economic sectors most affected.Such a task would be impossible without both high-quality inventories and recent high-precision measurements.Timely information like this can be crucial to inform policy makers about the impacts of legislation, so they can adjust policy as needed to meet emission reduction targets.
Three commonly used top-down methods for emission quantification are the mass balance experiment, inverse modeling, and inventory scaling.Mass balance experiments provide a domain-total emission rate that can be calculated relatively quickly based on a simple conceptual model.However, it can be difficult to define an appropriate background concentration and the area which the calculated emission rate represents (Cambaliza et al., 2014;O'Shea et al., 2014;Heimburger et al., 2017;Ren et al., 2018;Ahn et al., 2020).Inverse modeling techniques use a dispersion model to simulate concentration enhancements at the measurement locations based on an initial emission map (prior), typically an emission inventory.They then typically weight these data based on prescribed error covariances of the transport model, prior, and measurements to reach an optimized emission map (posterior) that reduces the mismatch between the modeled and measured enhancements (Michalak et al., 2004;Mueller et al., 2008;Stohl et al., 2009;Lauvaux et al., 2016;Lauvaux et al., 2020;Lopez-Coto et al., 2020).Scaling approaches follow the same basic premise as an inverse model except they optimize the entire emission Bottom-up emissions are from Vulcan (Gurney et al., 2020a), discussed further in the Methods and Materials section.
map using a single scaling factor (SF).This assumes that the spatial distribution of emissions in the prior is accurate; however, it allows for a simpler approach to calculate posterior emissions and their sensitivity to model parameters while avoiding the complexities associated with defining error covariances (McKain et al., 2015;Pitt et al., 2018;Sargent et al., 2018;Feng et al., 2019;Kunik et al., 2019;Pitt et al., 2022).Karion et al. (2019) and Neininger et al. (2021) discuss work using several of these approaches and compare the results across them.
In this work, we introduce a variation of the SF approach.This approach requires a prior emission map of surface fluxes, a dispersion model to relate surface emissions to downwind measurements, and measured concentrations against which these modeled enhancements are compared.No other measured quantities are used.As legislation is limited to political boundaries, the focus of an SF approach should be on such discrete areas of interest (AOIs).However, measurements are influenced by all sources upwind and not solely those within an AOI, so modeled enhancements must incorporate all upwind emissions (i.e., use a large modeling domain) to appropriately compare to measured enhancements.As such, defining an appropriate background concentration to separate the AOI from other upwind sources is crucial (Mueller et al., 2018;Pitt et al., 2018;Sargent et al., 2018).Here, we begin with a simple long-range background based on the lowest observed and modeled concentrations.This allows us to relate measured and modeled enhancements as it accounts for the regional background concentration that only affects measurements, but the resulting enhancements represent all upwind emissions within the modeled domain (Lopez-Coto et al., 2020;Pitt et al., 2022).Using these enhancements in an SF approach requires assuming that the SF for the entire upwind region is the same as that of just the AOI.This could be particularly problematic with small AOIs such as cities where only a fraction of the total emissions may originate from the AOI.For example, one can imagine a prior with city emissions lower than true emissions (SF ¼ 2) while those in the surrounding semirural area are slightly high (SF ¼ 0.8).If this spatial heterogeneity across the domain was ignored, the observational analysis for all upwind emissions would result in a smaller SF than that of city emissions.If this smaller SF is then applied to just the city, the AOI, the posterior city emissions would be biased low.On the other hand, if the upwind sources were unaccounted-for in the simulation domain, the measured enhancements that originated in those sources would be attributed to the AOI, causing the city emissions to be biased high instead.Worse still, flights are influenced by upwind sources to varying degrees, meaning that this bias would vary from flight to flight, increasing daily variability.To minimize these effects, we have developed an SF approach that is specific to an AOI by weighting modeled and measured enhancements by the fraction of the modeled enhancements originated by the emissions only in the AOI, in addition to accounting for the long-range background.
Here, we apply this SF approach to 9 nongrowingseason flights downwind of NYC, the details of which are available in Supporting Information (SI) Table S1.We calculate posterior emission rates for 2 AOIs: the 5 NYC boroughs (Department of City Planning [DCP], n.d.) and the larger GNA (U.S. Census Bureau, 2021c), which includes NYC, and compare to an inversion analysis using the same AOIs (Pitt et al., 2022).These AOIs are clearly outlined in Figure 1A and B. Finally, we investigate multiple methods of calculating the SF and discuss the variability in the resulting posterior emission rates across SF calculation method, flight day, choice of meteorological model outputs (MET) used in dispersion modeling, and choice of prior.The different assumptions inherent in this method make it a good complement to more complex inversion approaches when the spatial distribution of prior emissions is sufficiently accurate to reproduce the measured plume structure.

Flight design and equipment
All flights were conducted using Purdue University's Airborne Laboratory for Atmospheric Research (ALAR), a modified twin-engine Beechcraft Duchess (Cambaliza et al., 2014;Heimburger et al., 2017).ALAR is equipped with a GPS/INS (Global Positioning System/Inertial Navigation System) system, a Best Air Turbulence probe for high precision 3-dimensional winds (Garman et al., 2006;Garman et al., 2008), and a Picarro Cavity Ring Down Spectrometer designed for measurements of CO 2 , methane (CH 4 ), and water vapor (H 2 O; Crosson, 2008), although this work focuses exclusively on CO 2 emissions.Measurements are reported as dry air mole fractions, in units of micromoles per mole of dry air, or parts per million (ppm).Each flight included multiple in situ 3-point calibrations using National Oceanic and Atmospheric Administration-certified standard cylinders of CO 2 and CH 4 (WMO-CO 2 -X2007, WMO-CH 4 -X2004A; Dlugokencky et al., 2005;Tans et al., 2017; see SI section S1 for detailed calibration information).
Over the course of the campaign, we used 2 different G2301-f analyzers and a G2401-m analyzer.The G2301-f analyzers were both run in low-flow mode, which measures each species every approximately 1.2 s and approximately 1.4 s, respectively.The G2401-m analyzer measured each species approximately every 2.3 s.We estimate the precision of the CO 2 measurements to be 0.04 ppm based on the average 1s variability during calibrations.
This campaign involved 9 flights around NYC in February/March of 2019 and 2020 and November of 2018 and 2019 when the biosphere was less active.Details about the conditions of each flight are available in SI Table S1.The flights are discussed throughout the text using sequential research flight (RF) labels (e.g., RF4). Figure 1A and B shows the corresponding flight dates.Airborne measurements are sensitive to emissions distributed over a wide area relative to ground-based observations and can capture any vertical variability in CO 2 concentrations.All flights were conducted by flying multiple transects downwind of the AOIs at different heights within the boundary layer.Within a given flight, the flight track across all transects was kept as constant as possible.Most flights also This work employs an ensemble approach to estimate the variability in the result introduced by the various inputs (transport models and emission inventories) as well as SF calculation methods and avoids relying entirely on any one particular member.We use a total of 4 meteorology data products (MET), 2 turbulence parameterizations in the dispersion model (making 8 different transport models), 4 bottom-up inventories as priors, and 5 SF calculation methods, all discussed in detail below and summarized in Table 2.This results in a total of 160 ensemble members per flight.The distributions of posterior emission rates across these 160 ensemble members times 9 RFs (1,440 individual posterior emission rates) are discussed in the results.Relying on the mean of a plausible group of, for example, MET, mitigates potential errors in any one particular MET while also allowing us to estimate the contribution of MET uncertainty (quantified as the spread in the posterior emission rate across the 8 transport models) to the overall posterior emission rate uncertainty.

Transport modeling
The surface influence, or footprint, of flight data was determined using the Hybrid Single Particle Lagrangian Integrated Trajectory Model (HYSPLIT) v 5.0.0 (Draxler & Hess, 1998;Stein et al., 2016;Loughner et al., 2021).A footprint, or influence function, represents the measurement's sensitivity to surface emissions, providing a quantitative link between measured enhancements in concentration and upwind surface fluxes.Each HYSPLIT run was repeated using 4 different publicly available MET as an input and 2 different parameterizations of vertical velocity variance, Kantha-Clayson (Kantha & Clayson, 2000) and Hanna (Hanna, 1982), as described in Table 2 (see Table S2 for the spatiotemporal resolution of each MET).Footprints were generated on a large modeling domain, bounded by 34.4 N, 44.4 N, 83.7 W, and 69.7 W at a resolution of 0.08 Â 0.08, as shown in Figure 1B-E, to ensure that potentially significant upwind sources were included in the simulation.
Measured concentrations were block-averaged to a 1-min resolution for all flight data within the boundary layer.HYSPLIT particle releases for each minute consisted Prior acronyms include the year that the data represent.
of approximately 1,000 particles distributed across the locations flown during that minute.Modeled anthropogenic CO 2 enhancements along the flight track were then simulated by multiplying footprints (ppmÁð m 2 s mmol Þ or mole fraction/flux) from the HYSPLIT dispersion model (Draxler & Hess, 1998;Stein et al., 2016;Loughner et al., 2021) with emissions from one of the 4 priors (converted to units of mmol m 2 s ).Modeled biogenic CO 2 enhancements were then calculated in the same manner using the vegetation photosynthesis and respiration model (VPRM; Mahadevan et al., 2008;Gourdji et al., 2022) and added to modeled anthropogenic enhancements.This simulates the mixing ratio enhancement in CO 2 (ppm) relative to a long-range background value at the boundary of the domain, and this background must be subtracted from measured concentrations to compare the measured enhancements with these modeled enhancements.

Background calculations
We calculate the long-range background using a percentile method.We define times when we sampled background air using both measured and modeled data to mitigate the potential risk of selecting background times with large modeled enhancements in cases where the modeled plume is mislocated.First, we add the 1-min blockaveraged modeled and measured data pointwise.Then, we identify the timestamps with concentrations between the first and fifth percentile of concentrations from this combined data set.These timestamps represent background air sampling for the data sets.The measured background is then defined as the average of all measured data during these timestamps, and the modeled background is the average of all modeled data during these timestamps.As the timestamps chosen to represent background are partially dependent on the modeled data, this process is done separately for every flight and for every unique ensemble member to yield slightly different backgrounds.

Prior emissions
We use 2 published bottom-up national priors, Anthropogenic Carbon Emissions System (ACES) version 2.0 (Gately & Hutyra, 2017) and Vulcan version 3.0 (Gurney et al., 2020b) as well as 2 global disaggregated priors, Open-Source Data Inventory for Anthropogenic CO 2 (ODIAC) version 2019 (Oda & Maksyutov, 2015), and Emission Database for Global Atmospheric Research (EDGAR) version 5 (Crippa et al., 2019;European Commission, 2019) to calculate modeled anthropogenic CO 2 emissions.Both ACES and Vulcan are hourly, 1-km 2 national bottom-up CO 2 priors.As bottom-up products, they both incorporate data from emission factors and activity data, modeling, fuel statistics, and so on to allocate emissions in a gridded field.They generally make use of similar methods and data sets, but there are still several methodological differences that can lead to some significant gridcell differences, as discussed in detail by Gurney et al. (2020b).For the analyses using Vulcan and ACES, Canadian emissions within the domain are taken from EDGAR since ACES and Vulcan do not extend beyond the United States.EDGAR (Janssens-Maenhout et al., 2017;Crippa et al., 2019;European Commission, 2019) is a monthly, 0.1 (approximately 12 km 2 for the northeast United States) global prior that disaggregates national emissions using nationally specific activity data and emission factors combined with the appropriate proxies for different economic sectors (e.g., human population, road density).It also makes use of the Carbon Monitoring for Action database for point sources.Lastly, ODIAC (Oda & Maksyutov, 2015;Oda et al., 2018) is a monthly, 1 km 2 , global prior that disaggregates the Carbon Dioxide Information Analysis Center national emissions primarily by using the Carbon Monitoring for Action database for point sources and night-light satellite data for nonpoint sources.For all analyses, we use annually averaged emissions for the most recent year available for each prior, which is indicated in Table 2.We also regridded all priors to match the extent and resolution of the domain using a conservative regridding scheme provided by the Python package xESMF (Zhuang, 2021).
Measured concentrations represent bulk emissions from all upwind sources, including both anthropogenic emissions and biogenic emissions, but all priors used only represent anthropogenic emissions.To incorporate biogenic CO 2 emissions into the modeled emissions, we also used a modified version of VPRM (Mahadevan et al., 2008), described in detail by Gourdji et al. (2022).VPRM is an empirical model optimized using flux tower data.It calculates net ecosystem exchange (respiration þ photosynthesis) as a function of the enhanced vegetation index, land surface water index, temperature, and photosynthetically active radiation using a combination of in situ measurements and remote sensing.We used hourly fluxes for the dates of the flight data, generated at a 0.02 resolution.

Weighting approach
Figure 2 shows 3 examples of the measured and modeled enhancements, the modeled enhancements originating only from the AOIs, and the weighting functions.It is clear from Figure 2 (A, C, and E) that AOI and non-AOI enhancements show concurrent peaks in many cases.As previously mentioned, the long-range background subtracted from both measured and modeled values allows them to be directly compared but does not address sources upwind of the AOI overlapping with those of the AOI (e.g., Philadelphia in RF5).To address this problem, we have developed a weighting function to better isolate the enhancements of specific AOIs.
We calculate a weighting function for 2 different AOIs, the 5 NYC boroughs (including water areas; DCP) and the GNA (U.S. Census Bureau, 2021c), which includes NYC, as highlighted in Figure 1B and A, respectively.The weighting function is defined pointwise (Weighting i ) as the fraction of modeled enhancements, that is, the product of prior and footprint, from the AOI (Modeled AOI i ) divided by the modeled enhancements for the entire domain (Modeled total i ) as shown in Equation 1.As the weighting function is calculated using modeled enhancements, it relies somewhat more heavily on the modeling (MET and prior) than simply using a long-range background.This approach, like all scaling approaches, is also entirely reliant on the prior for the spatial distribution of emissions.Weights are then multiplied with total modeled enhancements, meaning the sum of AOI and non-AOI enhancements (i.e., the solid red line in Figure 2) and measured enhancements after subtracting their respective long-range backgrounds.This weighting allows us to minimize the effect of non-AOI sources before calculating an SF and can be thought of as an additional background subtraction that is specific to the AOI.
Flights with significant influence upwind of the AOI, such as the flight on November 10, 2019 (RF5), would be poorly suited to any SF approach without some way to isolate the AOI, as the measurements are heavily influenced by non-AOI emissions.In such cases, one likely cannot assume that the SF for the AOI is consistent with that for all upwind sources either, and therefore, the influence of these non-AOI emissions could limit the number of usable flights (to seven of the nine in our case) if it is unaccounted-for.However, with the weighting approach, these flights can still be used, as non-AOI emissions are deweighted appropriately to focus the SF primarily on the AOI.Weighting should also reduce the variability from flight to flight as the footprints are more comparable flight-to-flight without the influence of regions farther upwind.

SF approaches
We calculated SFs, which represent the factor by which the weighted modeled enhancements need to be multiplied to best match the weighted measured enhancements, using 5 methods.We calculated a single SF for each ensemble member for each flight (i.e., not per transect).
The ensemble approach allows us to balance the benefits and drawbacks of each technique.For example, regression-based approaches would be more heavily impacted by a poor correlation between measured and modeled enhancements, which could occur if plumes are offset due to prior or transport model error.Bias correction methods would, on the other hand, be most impacted by background errors.Posterior emission totals for the AOIs were then calculated by multiplying these SFs by the total prior emissions within the AOIs, which can be seen in Figure 1B-E (see Figure S2 for the priors shown over the AOIs only).

Integral (I) SF method
The first method was an integral ratio SF (I).We integrated weighted enhancements across each flight for both measured and modeled enhancements.The ratio of these integrated enhancements gives an SF for that flight (Equation 2) Here, CO 2obs and CO 2mod represent the measured and modeled mole fractions, respectively, while BG obs and BG mod represent the measured and modeled long-range background mole fractions, Weight represents the weighting function and the index i represents the index of the 1 min measurements along the flight.

Regression (ordinary least squares [OLS] and major axis [MA]) SF methods
We also calculated SFs using an OLS and MA fit of weighted modeled enhancements (y) against weighted measured enhancements (x) across a flight.The slope of this fit is the reciprocal of the SF for this approach.This was done with the intercept forced to zero as this can partially mitigate the impact of poor correlation on the slope (see SI Section S2).

Bayes (B and 2B) SF methods
Lastly, we calculated SFs using 2 different approaches that rely on Bayes's Theorem, described more thoroughly in SI section S3.Broadly, these involve minimizing the differences between measured and modeled enhancements, similar to regression approaches, but including a regularization term that considers the uncertainties and prior knowledge of the SF, similar to an inversion.Optimum SFs were obtained by minimizing the cost function J (Enting, 2002;Tarantola, 2005): where l is the state vector (i.e., SF), l b is the first guess or a priori state vector, P b is the a priori error variance matrix, which represents the uncertainties in our a priori knowledge about the state vector, and R is the model-data error covariance matrix, which represents the uncertainties in the modeled enhancements H(l) and the observations y, also known as model-data mismatch.In our case, R is a diagonal covariance matrix that includes background uncertainty, measurement uncertainty, and transport modeling uncertainty added in quadrature.
The first approach (B) optimizes a single SF using the analytical solution to Equation 3. In the second approach (2B), 3 parameters are simultaneously optimized in the same manner.In this case, the first term is the SF, the second term is an offset to account for potential background bias (i.e., intercept), and the third term is the slope of a linear trend to account for spatial gradients in the background.In this work, we chose as prior SFs l b ¼ [1 0 0] with prior uncertainties P b ¼ [1 1 0.1] for 2B and for B we use

Results and discussion
We report posterior anthropogenic emission rates based on the range of results across the various MET, SF methods, flight days, and priors using the standard deviation across posteriors, one category at a time.These provide information about the variability in posterior emission rates introduced by each of these aspects of the ensemble.We then present campaign-averaged emission rates for both AOIs as NYC represents a more policy-relevant area, while the wider GNA better represents the area of sensitivity for our flights.The campaign average, representative of daytime emissions during our flight days, is the average posterior for the AOI across all 160 ensemble members and 9 flights (for a total of 1,440 calculated posterior emission rates).
In this work, the biogenic fluxes from VPRM are used solely to ensure the best possible comparison between measured and modeled enhancements.The VPRM modeled emissions are generally small and, after background subtraction, have little effect on the modeled enhancements.However, they have a more significant effect on the weighting functions as the rural areas outside of the AOIs have more significant biogenic emissions, thus affecting the ratio between AOI and non-AOI emissions (i.e., the weighting functions).This means that ignoring the biogenic contribution would have led to an overestimation of the AOIs emission rates.This can be seen in SI Figures S3 and S4.

Posterior emission rates
The posterior emission rates across the ensemble are shown in Figure 3 and summarized in Table 3. Across all parameters, the campaign-averaged posterior emission rate for NYC is (49 ± 16) kmol/s (mean ± 1s), and for the GNA, it is (144 ± 44) kmol/s, with the bounds representing the 1s variability across the 9 flights.We rely on the daily variability when discussing the campaign-averaged posterior emission rates because the daily variability incorporates both real variability in emissions and apparent variability caused by methodological sources, including those that are not captured by the ensemble spread, which are discussed in detail below.
The campaign-averaged posterior emission rates are larger than any of the prior emission rates (colored lines in Figure 3), but this is as expected given the priors are annual averages while all flights occurred during daytime hours in the nongrowing season when anthropogenic emissions tend to be larger.To more appropriately compare to the inventories, we use emission rates from the hourly ACES (year 2017) and Vulcan (year 2015) inventories for the dates and times that best represent the flights as was done in Pitt et al. (2022; abbreviated as ACI for the Anthropogenic Carbon Emissions System inventory and VUI for Vulcan in Figure 3, see SI section S4 for a detailed description).The average emission rates for these representative inventories are summarized in Table 3.For ACES, the average representative emission rates are (46 ± 9) kmol/s for NYC (7% less than our posterior estimate) and (145 ± 21) kmol/s for the GNA (<1% greater than our posterior estimate).For Vulcan, the average representative emission rates are (52 ± 10) kmol/s for NYC (6% greater than our posterior estimate) and (124 ± 20) kmol/s for the GNA (14% less than our posterior estimate).Bounds here represent the 1s variability across 45 representative days.Agreement for both AOIs is within the daily variability of the representative inventories and well within the daily variability of the posterior emission rates.This level of agreement for both AOIs, even though the SFs differ by as much as 50%, supports the argument that using weighting functions helped us to separate AOI emissions from non-AOI emissions.Although it is difficult to directly compare these emission rates to the NYC self-reported inventory (SRI), a recent study by Gurney et al. (2021) compared annual 2015 Vulcan emissions with the 2015 NYC SRI, considering only source categories for which the SRI did not include any indirect emissions (i.e., Scope 2 or 3).That study found that the total NYC CO 2 emissions over these source categories in the SRI were 19% lower than the equivalent Vulcan estimates.

Sources of variability
The variability in the posterior emission rate with respect to the prior used is 14% for NYC and 8.1% for the GNA (Figure 3A and B).This is an improvement from the variability across the prior emission rates, which show 44% variability in NYC and 35% variability in the GNA across the 4 priors, driven largely by ODIAC.If ODIAC is excluded, then the variability in the posterior emission rate drops to 6.6% for NYC and 7.9% for the GNA.This is still an improvement from the variability in the prior emission rates excluding ODIAC, which are 9.1% for NYC and 13.3% for the GNA.The variability across the transport models (MET and turbulence parameterizations) is 5.4% for NYC and 5.7% for the GNA (Figure 3C and D).Additionally, posterior emission rates using the vertical turbulence option based on Hanna (1982; HYSPLIT setting kblt ¼ 5) are consistently lower than those based on Kantha-Clayson (Kantha & Clayson, 2000; HYSPLIT setting kblt ¼ 2), with mean posterior emission rates that are, on average, 3.3% lower.This feature was seen in Pitt et al. (2022) as well.The variability across the 5 SF calculation methods is only 2.2% for NYC and 3.6% for the GNA (Figure 3E and F), making this the smallest source of variability.Finally, the daily posterior emission rate variability is 32% for NYC and 31% for the GNA (Figure 3G and H), making it a larger source of variability than all other terms combined in quadrature, that is, an estimate of the method's uncertainty (15% for NYC and 10% for the GNA).
These variabilities are summarized in Table 3 and combined to provide an estimation of the overall variability for the posterior emission rates as in Lopez-Coto et al. (2020) and Pitt et al. (2022).Combining these terms assumes that they are independent sources of variability, which is unlikely to be true.Additionally, the choice of ensemble members (MET, priors, etc.) could impact the ensemble spread across these parameters.Therefore, the estimated variabilities may not represent the true uncertainties, but they do still provide an estimate of the likely variability introduced by the various model choices.If this SF approach is used over larger timescales to assess trends, then the daily variability is also a relevant part of the campaign-averaged posterior variability.This results in a combined variability more comparable to the uncertainties for individual mass balance experiment approaches (Cambaliza et al., 2014;O'Shea et al., 2014;Heimburger et al., 2017;Ren et al., 2018;Ahn et al., 2020).It is worth noting that while the calculated daily variability is large (approximately 32%), the 95%, 2-sided confidence intervals (t 1Àa=2;N À1 Â s= ffiffiffiffi ffi N p , t 0.975, 8 ¼ 2.306) of our campaign mean with 8 degrees of freedom (N À 1) are smaller (approximately 25%) and can potentially be further reduced by adding additional flights.
Real variability in the inventories themselves is significant, including a factor of !1.5 change between summer and winter and day-to-day variability of !15% (see Figure S6).This intraannual variability alone is much larger than the annual reductions that policies for many cities, including NYC, would require.The daily variability in posterior emission rates is even larger than that of the representative days in ACES (19% for NYC and 15% for the GNA) or Vulcan (19% for NYC and 16% for the GNA).Potential reasons for this increased daily variability with the SF

NYC GNA
Figure 3. Boxplots of the posterior carbon dioxide (CO 2 ) emission rates for New York City (NYC) and the Greater New York Area (GNA).Boxplots of the ensemble of posterior CO 2 emission rates for NYC and the GNA, grouped by (A and B) prior, (C and D) meteorological model outputs (MET), (E and F) scaling factor calculation method, and (G and H) flight day.Boxes indicate the 25th-75th percentile, whiskers extend to the most extreme data point within 1.5 times the interquartile range of the box, and circles mark any data outside these bounds.ACI and VUI represent the Anthropogenic Carbon Emissions System and Vulcan inventories for similar days and times as the flights (see SI Section S4 for details).The MET are abbreviated as ER for the European Centre for Medium Range Weather Forecasts Fifth Reanalysis, GF for the Global Forecast System, HR for the High Resolution Rapid Refresh, and NA for the North American Mesoscale Forecast System, while the 2 and 5 represent the Kantha-Clayson (Kantha & Clayson, 2000) and Hanna (Hanna, 1982) turbulence parameterizations, respectively.DOI: https://doi.org/10.1525/elementa.2021.00121.f3 approach are largely the same as previously discussed for inversions (Lopez-Coto et al., 2020;Pitt et al., 2022) The transport model ensemble cannot account for poor performance across the ensemble and the performance will vary by day.The daily variability of emissions is underestimated by inventories because the temporal profiles of emissions used in inventories rely on some degree of interpolation or proxy data that can smooth out some temporal variability (making it akin to a climatology rather than a weather of emissions).The daily variability for the years that we flew also may have been larger than that of the inventory years, which are 2017 for ACES and 2015 for Vulcan.Lastly, the flights cover 2 winters (November 2018-March 2020), and some interannual variability in emissions may exist.

Comparison to inversion
These flights were also analyzed using a nested Bayesian inversion with similar ensemble members in Pitt et al. (2022).The data processing (long-range background, modeling parameters, etc.) of this work matches the base case of Pitt et al. (2022).This allows for a direct comparison of the results using these 2 different techniques as summarized in Table 3.The campaign-averaged posterior emission rate calculated in this work using an SF approach is 9.1% greater than the inversion-based campaign-averaged posterior emission rate for NYC and 15% greater for the GNA.This is well within the 1s combined variability of either method and both approaches show good agreement with representative inventory estimates from ACES and Vulcan, providing confidence in both approaches and in the inventories.Daily average emission rates are also highly correlated between methods, with a Pearson correlation that is over 0.9 for both AOIs.Additionally, the relative contribution of the different sources of variability is quite similar between the methods.Interestingly, the SF approach shows more consistent relative daily variability (i.e., as a percentage) between the two AOIs than the inversion analysis, which exhibits increased relative variability for NYC.This increased relative variability for NYC is also observed in the representative inventories, although with lower values.However, in absolute terms (kmol/s), the SF approach shows larger variability for the GNA and almost the same variability for NYC.Both this and the larger campaign-averaged posterior emission rate of the SF approach are due to the different assumptions inherent to these approaches.Inversions directly optimize both AOI and non-AOI sources and can adjust the spatial distribution of the prior although they default to the prior value in areas that have no data with footprint influence.The SF approach with a weighting function applies a single SF to the prior AOI, so it assumes that the prior distribution is accurate (as does any scaling approach) and it can only separate AOI from non-AOI emissions by assuming the AOI: non-AOI emission ratio in the prior is accurate.Finally, although the ensemble constructed and methods used were different, Lopez-Coto et al. ( 2020) also saw similar component variabilities in a study using flight data in an ensemble of inversions for the Washington, DC-Baltimore metropolitan area.The authors saw 33% daily variability, 7%-15% variability across transport models (MET and turbulence parameterizations, including an experimental turbulence parameterization in HYS-PLIT), and 11% variability across priors (including a flat prior with no spatial information).

Conclusions
The SF ensemble approach used here avoids many of the issues faced when applying a mass balance method or nonweighted SF methods to calculate urban emissions, by directly relating emissions to a source area and prior using transport modeling.It is also simpler to apply than a full inverse modeling approach making it useful for broad applications.Since the approach relies on different underlying assumptions, it provides a good complement to these alternative methods and has been shown here to provide similar results to an inverse modeling study using the same data.Both approaches also agree well with inventory estimates for both AOIs.This level of agreement, especially given the GNA had SFs as much as 50% larger than NYC SFs, supports the claim that the weighting function is able to reduce the effect of non-AOI emissions.This case adds to the literature showing that CO 2 inventory estimates, as well as transport models, are generally able to represent measurements relatively well (Lauvaux et al., 2016;Sargent et al., 2018;Ahn et al., 2020;Lauvaux et al., 2020;Lopez-Coto et al., 2020).The variability due to meteorology and that due to the priors were both relatively large sources of posterior emission rate variability, which suggests that improvements in both transport modeling (both HYSPLIT and the meteorological models that feed it) and bottom-up inventories are necessary to further reduce uncertainties in quantification methods such as this.This approach also resulted in posterior emission rates that were broadly consistent across priors even though EDGAR is a 10Â coarser global product and ODIAC was initially biased quite low for the region.However, the SFs are sensitive to a number of errors as shown by the previous work (Pitt et al., 2018;Sargent et al., 2018;Karion et al., 2019) as well as here and, thus, a careful model-data screening must be done to avoid large excursions in the estimations.Examples include most SF approaches used require a high correlation between modeled and measured enhancements, any SF approach requires an accurate spatial distribution of prior emissions, there must not be significant upwind sources outside the model domain, and sources outside an AOI must be considered.The inclusion of the upwind sources in the simulation domain (i.e., using a large domain) combined with the weighting function allows this approach to appropriately differentiate between emissions from particular AOIs and emissions from farther upwind, addressing the last two concerns.This allows us to focus on emissions within political and legal boundaries, such as cities, that are of particular interest for emissions quantification and trend analysis.Additionally, the ensemble approach used here can partially minimize the influence of some of the other limitations, and thus, it is preferred as opposed to a single SF method, prior, and MET.Topdown approaches such as this allow us to compare/contrast results with developed emission inventories and quantify emissions in near-real time.Future work more thoroughly comparing this ensemble SF approach to inversion as well as mass balance methods would be of value as these top-down approaches continue to be employed and further refined.
Contributed to acquisition and proper use of inventory data: CKG, KRG, LRH, GSR.
Approved the submitted version for publication: All Authors.

Disclaimer
Certain commercial equipment, instruments, or materials are identified in this article to specify the experimental procedure adequately.Such identification is not intended to imply recommendation or endorsement by National Institute of Standards and Technology nor is it intended to imply that the materials or equipment identified are necessarily the best available for the purpose.

Figure 1 .
Figure 1.Flight tracks and carbon dioxide (CO 2 ) priors.(A and B) All flight tracks overlaid on a state map (U.S. Census Bureau, 2021a) with the areas of interest highlighted in green, (A) highlighting the Greater New York Area (GNA) and (B) highlighting New York City.November 9, 2018, March 1, 2019, and March 27, 2019, had similar downwind passes over the Hudson River and thus cannot be distinguished.Winds were from the east on November 9, 2018, March 1, 2019, and March 27, 2019, west on November 15, 2019, and March 4, 2020, southwest on November 10, 2019, and February 16, and north on March 26, 2019, and March 7, 2020.(C-F) Each CO 2 prior across the modeling domain overlaid on a state map, with the GNA outlined in blue.All priors are in log 10 (kg C/km 2 h) with the color scale saturated at log 10 (0) kg C/km 2 hr.Water areas differ as Vulcan and the Anthropogenic Carbon Emissions System are national and thus do not extend beyond the coast, Emission Database for Global Atmospheric Research is global and includes marine emissions, and Open-Source Data Inventory for Anthropogenic CO 2 does not include emissions over water.DOI: https:// doi.org/10.1525/elementa.2021.00121.f1

Figure 2 .
Figure 2. Example time series of measured carbon dioxide (CO 2 ) enhancements, modeled CO 2 enhancements, and weighting functions.Measured and modeled CO 2 enhancements over time for the flights on (A and B) March 1, 2019, (C and D) March 27, 2019, and (E and F) November 10, 2019, using global forecast system modeled winds and the Vulcan prior.(A, C, and E) Use New York City (NYC) as the area of interest (AOI) and (B, D, and F) use the Greater New York Area.Each valley-peak-valley represents a downwind transect covering the extent of the plume.Upwind data are included for the flights with upwind transects, typically with near 0 weighting for NYC.The shaded regions represent the modeled enhancements that originated from the AOI, red lines represent the modeled enhancements from all upwind sources, and the green dashed lines represent the weighting functions.The flight on November 10, 2019, has smaller weighting and a larger mismatch between measured and modeled enhancements because of the SW winds mixing significant Philadelphia emissions with those from the AOI.DOI: https://doi.org/10.1525/elementa.2021.00121.f2