Urban areas are the major sources of greenhouse gas emissions but also leaders in emission reduction efforts. Appropriate techniques to quantify emissions and any potential reductions over time are necessary to effectively inform these mitigation efforts. The aircraft mass balance experiment (MBE) is an established technique used for such a purpose. In this work, we use a series of 55 MBEs downwind of power plants to assess the technique’s bias and precision. In addition, we investigate what factors drive the absolute error, determined as the absolute difference between observed and reported emission rates, in individual experiments using multilinear regressions. Power plants are required to monitor their carbon dioxide emissions with an hourly resolution, and these publicly available reported emissions can be directly compared to the mass balance estimates as a pseudo-known release. To quantify the bias we calculated the mean error, which was 10 ± 240 Mg·h^{−1} (1σ), regressed mass balance emission rates against reported emission rates to yield a slope of 0.967 ± 0.062, and compared the sum across all mass balance emission rates, 31,000 ± 1,000 Mg·h^{−1}, to the sum across all reported emissions, 30,660 ± 740 Mg·h^{−1}. All three of these approaches suggest no systematic bias. Then to quantify the precision for individual determinations we calculated the slope of a regression between the standard deviation across repeated MBEs and the corresponding average emission rate, which is 30.7% ± 6.7%. The main drivers of the absolute error were sparse sampling of the plume, poor horizontal and vertical mixing of the plume, and smaller signal-to-noise ratios. Quantifying the capabilities of this technique provides context for previous analyses and allows stakeholders and researchers to make informed decisions when choosing quantification methods. Identifying the factors that drive the absolute error also allows us to adjust flight design to minimize it and potentially improve uncertainty estimates.

## 1. Introduction

The Biden administration has set national goals to reach a 50% reduction in greenhouse gases (GHG) by 2030 and net zero by 2050, consistent with the Paris Agreement (Whitehouse.gov, 2021). This undertaking will require efforts at multiple scales, from the national to the state to the city. Urban areas alone account for approximately 45% of carbon dioxide (CO_{2}) emissions across the contiguous United States according to the Vulcan gridded emissions data product (Gurney et al., 2020; U.S. Census Bureau, 2021), and this percentage is expected to increase in the coming years as populations continue urbanizing (International Energy Agency, 2013; United Nations Department of Economic and Social Affairs Population Division, 2019). Accordingly, many cities have passed their own legislation and goals for specific GHG reductions (C40 Cities, 2022). However, high-precision measurement techniques are necessary to quantify emissions reductions to evaluate progress on this legislation and to identify effective mitigation strategies. We have conducted a large series of flights to assess the aircraft mass balance experiment’s (MBE) ability to address this need. This technique has been widely used in point source (cf. Carotenuto et al., 2018; Guha et al., 2020; and many others), regional (cf. Alfieri et al., 2010; Karion et al., 2015; and many others), and urban emission quantification studies (cf. Gioli et al., 2014; Klausner et al., 2020; and many others).

This work focuses on a data set of 55 MBEs from 26 flights at 19 different power plants conducted between 2016 and 2020, including some reanalyzed flights from previous publications (Hajny et al., 2019). This work takes advantage of this large data set to assess the aircraft MBE technique given the availability of power plant CO_{2} emissions data that are estimated using continuous emissions monitoring systems (CEMS) and reported to the Environmental Protection Agency (U.S. Environmental Protection Agency [U.S. EPA], 2022). Here we quantify the bias and precision of the method, and we investigate the flight conditions that most significantly affect the performance of the MBE approach using multilinear regressions. It is important to clarify that because these analyses are based solely on point source MBEs, the results apply at the large emissions point source scale and may not translate perfectly to area sources where estimating the background is challenging and plumes are more difficult to define. Additionally, unlike typical methane (CH_{4}) point sources, these are hot buoyant plumes from an elevated stack and so these results may not be perfectly applicable to other types of point sources. However, this is an important step in better understanding the performance of the mass balance technique, which can help lead to more general results on other types of sources in the future.

Aircraft MBE techniques vary slightly across research groups, but few have directly assessed the capabilities of their technique against known emissions. Conley et al. (2017) and Ahn et al. (2020) are two of the few studies that have directly assessed the method precision and/or bias against known emissions. Both studies used a similar approach to calculate the bias of the quantification method, relying on power plant CEMS data (U.S. EPA, 2022) as the known emission rate and, thus, have the same limitations regarding area sources with a more complex background. Erland et al. (2022) also showed that, at least for the Conley et al. (2017) and the Environment and Climate Change Canada’s top-down emission rate retrieval algorithm, if the underlying assumptions of the method are met, then the emission estimates are not overly sensitive to the choice in MBE algorithm. In our case, we used a very large data set and took the analysis a step further by examining the variables that contribute to scatter in the agreement between CEMS and MBE emission rates to assess variables that can be controlled to reduce the absolute error or to potentially help quantify it. Given the widespread application of the MBE technique, improving our understanding of these variables can provide important context for previous experiments while improving the experiment design of future works.

## 2. Materials and methods

### 2.1. Instrumentation

All flights were conducted using Purdue’s Airborne Laboratory for Atmospheric Research (ALAR; Mays et al., 2009; Cambaliza et al., 2014; Cambaliza et al., 2015; Heimburger et al., 2017; Lavoie et al., 2017; Hajny et al., 2019) which is a modified twin-engine Beechcraft Duchess with a typical airspeed of 66 m·s^{−1}. ALAR is equipped with a best air turbulence probe for 50-Hz high-precision 3D wind field measurements (Garman et al., 2006; Garman et al., 2008), a GPS/INS system reported at 50 Hz, 2 temperature probes reported at 50 Hz with measurement frequencies >1 Hz, and a Picarro cavity ring down spectrometer designed for 0.5 Hz measurements of CO_{2}, CH_{4}, and H_{2}O (Crosson, 2008). As the Picarro data are needed for analyses and have the slowest response time, all other relevant variables are downsampled to the Picarro’s frequency.

Picarro measurements are reported as dry air mole fractions, in units of micromoles per mole of dry air, or parts per million (ppm). This work focuses exclusively on the CO_{2} measurements because CEMS data do not include CH_{4} and previous analyses focused on quantifying the CH_{4} emissions from the subsets of these flights (Hajny et al., 2019). The Picarro operating range is between 0 and 1,000 ppm, with a guaranteed specifications range of 300–700 ppm and a flushing time of approximately 1.5 s (Picarro, Inc., 2021; Picarro, Inc., personal communication, 03/01/2023). Flights typically included 2 in-air three-point calibrations with National Oceanic and Atmospheric Administration-certified standard cylinders (WMO-CO_{2}-X2007) ranging from approximately 360 to approximately 450 parts per million CO_{2} (Dlugokencky et al., 2005; Tans et al., 2017; Hall et al., 2021). The cylinder values, certification dates, and so on are provided in Supporting Information (SI) Table S1. The maximum measured concentration during the MBEs ranges from 395 ppm to 940 ppm with up to 11.5% of the data for a given MBE going above the range of the three-point calibration used. Picarro has stated in personal communication that the instrument is highly linear within its 1,000 ppm operating range, so we do not treat measurements above our calibration range differently (Picarro, Inc., personal communication, 03/01/2023).

### 2.2. Power plant sampling and the MBE method

We sampled 19 power plants over the period 2016–2020, covering a range of power plant emission rates in the Eastern United States. Details for each power plant are provided in SI Table S2. This list includes power plants discussed in previous works (Lavoie et al., 2017; Hajny et al., 2019) using the same identifiers. Those works chose to keep facilities anonymous, given facility identities do not affect the interpretation or results, and as such we do the same. Flights included between 1 and 4 MBEs at progressively increased downwind distances to quantify the emission rate. MBEs that were performed back-to-back can be considered repeat experiments if the CEMS emission rates are assumed to be constant, which we confirm with CEMS data, allowing us to use them to assess the method’s precision. We note that our flight patterns are not Lagrangian as this is not only logistically impractical but also unnecessary since the emission rate is considered constant on the timescale of repeated MBE flight legs.

MBEs were typically conducted by flying at multiple altitudes, in approximately 250 ft intervals, at a fixed distance downwind of the site from as low to the ground as is safe to the highest altitude where enhanced CO_{2} concentrations could be seen, often with a vertical spiral through the center of the plume. Each MBE took an average of 42 ± 17 min to complete, varying primarily based on the interval between passes and the width and height of the plume. The flights were designed so that these passes were approximately perpendicular to the mean wind direction. The MBEs were typically within 5 km of the power plant to measure the CH_{4} as well given the much smaller CH_{4} signal, but some, particularly those with more than 2 MBEs, went farther downwind, up to 13 km. These CH_{4} measurements were the focus of previous works (Hajny et al., 2019). These downwind distances were chosen to provide an approximately Gaussian measurement and minimize sampling errors, that is, not close enough to inadequately sample the plume or risk signals above the operating range of the instrument, but not far enough to decrease the signal-to-noise (S/N) ratio too drastically.

This series of passes creates a theoretical 2D plane downwind of the power plant and we measure the flux through this plane. A conceptual diagram of this is shown in Figure 1 and an example flight is shown in Figure 2. Transects extended sufficiently beyond the plume so that CO_{2} concentrations returned to background values. An ordinary least squares (OLS) regression was fit through the background values on either end of the plume to estimate the background concentrations for each transect (Heimburger et al., 2017; Lavoie et al., 2017; Ren et al., 2018; Hajny et al., 2019; Ahn et al., 2020). This estimate of the background is, on average, within 1.5 ppm of a background estimated using upwind data given how close the upwind and downwind data are in time and space, which is quite small in comparison to the measured enhancements that are often ≥100 ppm. We choose to use a background based on an OLS regression through data on either end of the plume in the downwind transect data to be consistent with previous publications using ALAR (Heimburger et al., 2017; Lavoie et al., 2017; Hajny et al., 2019). These 0.5-Hz pointwise concentration enhancements were then converted to pointwise fluxes according to Equation 1:

Here [*C*]* _{i}* and [

*C*]

_{BG,i}represent the measured and background concentrations of CO

_{2}respectively ($molc\u22c5molair\u22121$),

*U*represents the measured horizontal wind speed (m·s

_{i}^{−1}), cos(θ)

_{i}is a unitless correction term ranging between 0 and 1 to account for cross winds with θ representing the angle between the wind direction and the transects,

*P*represents the measured pressure (atm),

_{i}*T*represents the measured temperature (Kelvin), and

_{i}*R*represents the ideal gas constant (atm·m

^{3}$molair\u22121$ K

^{−1}). The subscript

*i*indicates the point number, as the flux is calculated pointwise at 0.5 Hz using pointwise measurements for each variable. As flights were designed to be perpendicular to the predicted wind direction, the average cos(θ) term is 0.918 ± 0.074, near the perfectly perpendicular value of 1. To reduce the noise in the 50-Hz meteorological variables (temperature, pressure, and wind speed) we also apply a 10-s rolling average to these data before downsampling them to 0.5 Hz to calculate fluxes. To calculate an emission rate from these pointwise fluxes we used Equation 2:

We first interpolated the fluxes to a 100 m (horizontal) × 10 m (vertical) grid via Kriging (Chu, 2004; Cambaliza et al., 2014; Heimburger et al., 2017). The emission rate (mol·s^{−1}) was then calculated by integrating these pixels across the horizontal (−*x* to *x*) and vertical (0 to the top of the plume/convective boundary layer [CBL], *Z _{i}*) bounds of the plume. As we were unable to measure concentrations down to the surface and often saw enhancements at the lowest flown altitude, fluxes below the lowest pass had to be approximated. For 4 MBEs we were also unable to fly above the plume due to overcast conditions forcing us to approximate emissions above the plume up to the estimated cloud base, an estimate of the top of the CBL. This approach assumes that the flux through this downwind plane is approximately constant over the course of an experiment. As the sampling time is longer than the turbulence timescale and we are assuming a constant source, the composite plume created by interpolating fluxes across all transects is a representation of the average advected plume. Clearly, dense sampling of the plume will help reduce the potential sampling error in this approach, as is determined and discussed later in Section 3.2. All MBE emission rates, uncertainties, distances downwind, and other relevant data are provided in SI Table S3.

### 2.3. Uncertainties

MBE uncertainties were calculated as in Hajny et al. (2019). The uncertainties of the MBE emission rate are calculated as the uncertainties associated with the pointwise fluxes, the plume capture (i.e., extrapolation), and the Kriging interpolation combined as the square root of the sum of relative uncertainties squared, as shown in Equation 3. Only the plume capture uncertainty differs from the method described in Hajny et al. (2019):

The Kriging interpolation uncertainty is calculated as the average relative percentage difference between the results of 2 different Kriging methods (15%) for a subset of flights (Hajny et al., 2019). The plume capture uncertainty in this work is defined as the standard deviation across 3 separate extrapolation techniques relative to the average across them. Lastly, the flux uncertainty requires several steps to propagate the uncertainty in each measured quantity.

We use the average of 3 extrapolation approaches to estimate near-surface advected fluxes: Kriging with a synthetic transect (at bottom and, when needed, top), a 3-pass average, and a plume average. The synthetic transect approach involves creating a copy of the lowest flown transect at ground level before Kriging to inform the interpolation down to this altitude to avoid inappropriate extrapolation to the surface level. This can also be done at the estimated top of the boundary layer if that top altitude is not accessible by the aircraft. The 3-pass and plume average approaches take fluxes post-Kriging and define fluxes below the lowest flown transect as the average of either the lowest 3 transects or of the entire plume. The 3-pass average is relatable to the synthetic transect as both approaches use the information we have available to inform concentrations at the lowest altitudes. The plume average is similar to assuming the plume is well mixed in the CBL, which we note is unlikely given our small distance from the power plants. In SI Figure S1 we show an example set of Kriged results calculated via the 3 methods.

The uncertainty associated with incomplete plume capture was calculated as the standard deviation across these 3 approaches. MBEs where the plume was completely captured within the transects are instead reported as the emission rate without extrapolation and thus have a zero uncertainty component from incomplete plume capture. The highest flown transect, or the estimated boundary layer height if we were unable to fly above the plume, averaged 880 ± 230 m for MBEs that completely capture the plume (*N* = 9) while the lowest transect averaged 190 ± 110 m. For MBEs that did not fully capture the vertical extent of the plume (*N* = 46), the average altitudes are 990 ± 250 m and 221 ± 98 m, respectively. SI Figure S2 shows the impact that emissions below the lowest flown transect have by comparing the MBE emission rate to the same without extrapolation. Specifically, on average, the results are 28% lower than the MBE emission rates when ignoring cases with complete plume capture, implying that, on average, 28% of the flux exists between the lowest transect and the surface. However, this is not itself an uncertainty, merely the fraction of the total estimated using our approach. The uncertainty, estimated as the variability across different extrapolation techniques, is generally lower.

The flux uncertainty is the propagation of the uncertainty of each term in Equation 1, as shown in Equation 4:

The relative uncertainty in the perpendicular component of the wind speed $(\delta U,iU\u22a5,i)$ is based on the reported horizontal wind uncertainty of 0.4 m·s^{−1} from Garman (2009). We do not explicitly incorporate an uncertainty due to the cos(θ) term, indicating any nonperfect orientation of the downwind transects relative to the wind direction. The relative uncertainty in pressure $(\delta P,iPi)$ is calculated as the relative difference between the measured pressure and the barometric pressure calculated using the most recent surface pressure measurements from a nearby airport. This is biased high as the barometric law does not have to precisely agree with the measured pressure at altitude and some sites may not have an airport as nearby as others. The relative uncertainty in temperature $(\delta T,iTi)$ is calculated as the relative difference between 2 separate measurement systems on ALAR, a thermistor bead installed in the best air turbulence probe on the nose of the aircraft and a thermocouple system installed under the best air turbulence probe. For a small number of flights this was not possible as the thermistor bead was malfunctioning, so in these cases we use the campaign average instead. Both pressure and temperature uncertainties are generally quite small (<5%) relative to wind and enhancement uncertainties which will generally dominate the flux uncertainty.

The uncertainty of the background is calculated as the standard error of the linear fit used to define the background. This approach does not accurately capture the uncertainty due to the choice of background or method of defining the background, it solely describes the agreement between the fit used and the measured background data. For reference, the average absolute difference between the average upwind concentration and edge backgrounds for the 40 MBEs with significant upwind data was only 1.33 ± 0.27 ppm. This estimate is also biased high, especially for flights with multiple MBEs, as flights typically only had a single upwind pass at 1 altitude. This difference in background definition is negligible by comparison to the enhancements that were often over 100 ppm, although there are other methods one could propose to define a background value. As each transect has a separate background, this must be done separately for each transect. To convert to a pointwise background uncertainty, we then divide these transect-scale background uncertainties by the pointwise background values. Finally to convert from a relative background uncertainty to the absolute background component of the enhancement’s uncertainty we multiply by the pointwise enhancements.

The total pointwise absolute uncertainty in the enhancement is then calculated as the combination of the calibration and background uncertainties $(\delta cal2+\delta BG2)$. The uncertainty in the calibration is simply the propagated uncertainty of the slope and intercept of the calibration curve. To calculate the average relative uncertainty in the enhancement for the entire flight $(\delta cal2+\delta BG2Enhancement)$ we combine the pointwise absolute uncertainties as the square root sum of squares for each transect separately and divide by the transect’s integrated enhancement. This is converted to a flight average by squaring the relative uncertainty for each transect, averaging the relative variances together, then taking the square root of the average variance to derive a flight-averaged relative uncertainty in the enhancement, $(\delta cal2+\delta BG2Enhancement)$. To address the fact that transects with near zero enhancement would have a relative enhancement uncertainty that approaches infinity, we only consider transects that contribute at least 5% to the total enhancement for the MBE in this calculation.

### 2.4. CEMS data

The CEMS data are publicly available at an hourly resolution from the clean air markets program data (U.S. EPA, 2022), although there may be subhourly variation in these emission rates that would also affect measurements. To directly relate these data to the flights, which varied in duration and typically covered multiple hours (e.g., 11:30 AM–12:30 PM), the hourly averaged CEMS CO_{2} emission rates were combined in a time-weighted average. We also subtracted the time it took air to travel from the power plant to the MBE location, or transit time, from the actual MBE flight times. The transit time was calculated as the distance from the power plant to the plume (m) divided by the average wind speed (m·s^{−1}) during the MBE. For example, the CEMS emission rate for an MBE from 1:00 PM to 1:50 PM with a 10-min transit time would be (10 × [12:00 PM CEMS] + 40 × [1:00 PM CEMS])/50. We combine the uncertainty in the concentration and flow rate estimates in quadrature to estimate CEMS emission rate uncertainties at 14% (Peischl et al., 2010). In general, the facilities studied are all large enough so that the CEMS emission rates do not vary significantly during daytime hours, as shown for 3 example flights in SI Figure S3. We studied a wide range of power plants in this work, with CEMS emissions ranging from 77 Mg·h^{−1} to 2,260 Mg·h^{−1} for the 55 MBEs. For comparison, Ahn et al. (2020) had a range of CEMS emissions from 220 Mg·h^{−1} to 1,030 Mg·h^{−1} for their 16 MBEs, and Conley et al. (2017) had a range of CEMS emissions from 99 Mg·h^{−1} to 1,290 Mg·h^{−1} for their 5 flights.

### 2.5. Statistical methods

Given the variety of statistical methods used in this work, this section aims to describe them all in a concise, complete manner so that the remainder of the text can focus on the results of these analyses. All uncertainties in this work are presented as ±1 standard deviation. Throughout the work we round uncertainties to 2 significant digits and round measured values to the same number of decimal places as per the International Organization for Standardization recommendation (Joint Committee for Guides in Metrology, 2008). For example, as stated in the abstract, the sum across all CEMS emission rates is 30,656 ± 741 Mg·h^{−1}, which is rounded to 30,660 ± 740 Mg·h^{−1}. All of the methods discussed other than precision are considering each MBE as a separate independent estimate (*N* = 55). As discussed in more detail below, the precision is evaluated by treating back-to-back MBEs at a power plant as repeat experiments; thus, all repeat experiments in a given flight are used together to calculate a single data point for precision calculations (*N* = 22). Lastly, to define the terms as used in this work: The absolute error is estimated as the absolute difference between the MBE and CEMS emission rates given the CEMS is being treated as a known pseudo-controlled release. Uncertainty refers to either propagated quantitative uncertainties or the standard deviation of a measurement across a data set as appropriate. And precision refers to the repeatability of a technique, which we measure by treating back-to-back MBEs as repeat measurements.

To quantify the bias of the MBE method we used a York regression (York, 1968; Wehr and Saleska, 2017) between observed (MBE) and reported (CEMS) emission rates, the mean error (ME), and the difference between the summed CEMS and MBE emission rates across all flights. A York regression is a type 2 regression that accounts for uncertainties in both axes. The ME is calculated according to Equation 5:

where MBE* _{i}* is the MBE emission rate for a given experiment, CEMS

*is the corresponding CEMS emission rate, and*

_{i}*N*is the total number of MBEs.

Many flights included between 2 and 4 back-to-back MBEs at the same power plant. With verification from CEMS that the emission rate remained approximately constant across these MBEs, these can be seen as repeat experiments. Although the increasing distance from the power plant across replicate MBEs could have some impact, we saw no significant linear relationship between transit time and the absolute difference between MBE and CEMS emission rates (see SI Figure S4), which suggests this changing distance should be a small effect. Transit time was calculated as the distance from the power plant to the plume (m) divided by the average wind speed (m·s^{−1}) during the MBE. This was investigated rather than distance alone as both wind speed and distance affect plume dispersion, which would impact our S/N ratio. We calculate the precision of a single MBE as the slope of an OLS regression of the standard deviation of the emission rate across replicate MBEs against the average MBE emission rate across them, as shown in Figure 3A. For both panels of Figure 3 we are calculating an OLS regression rather than a York regression as we are using uncertainty estimates as our *y*-axes. We have no uncertainty estimates of the *y*-axes and the uncertainty in the *y* is likely significantly larger than the *x*, which is treated similar to a known standard.

We quantified the degree of change in the CEMS emission rate to verify that the CEMS emission rate is approximately constant across back-to-back MBEs for the precision calculation. We did so by calculating the percentage difference in the CEMS emission rate across the time frame of the repeat MBEs. The percentage difference is calculated as the difference in the CEMS emission rate from 1 MBE to the next divided by the average, multiplied by 100 to convert to a percentage. For the 4 flights that had 3 or 4 repeat MBEs instead of 2, the percentage difference was calculated as the standard deviation of the CEMS emission rate divided by the average CEMS emission rate across all MBEs, multiplied by 100 to convert to a percentage.

Lastly, to better understand the drivers of the absolute error, we use a multilinear regression of the absolute difference between observed (MBE) and reported (CEMS) emission rates using as covariates only quantities experimentally determined during the flights. As with the bias calculations, this absolute difference is a measure of the absolute error in the MBE emission rate given the CEMS is being treated as a known pseudo-controlled release. Every multilinear regression included an MBE emission rate term (MBE emission rate, MBE emission rate without extrapolation, MBE emission rate 1σ, or a scaled MBE term sqrt(MBE^{3}/total enhancement across all transects)), but otherwise every combination of the 33 covariates and their inverses was considered. Any multilinear fit with >4 parameters introduced additional collinearity (high correlations between parameters) had parameters that were not statistically significant, or did not improve the *R*^{2}, so only 4 parameter multilinear regressions are discussed. All 33 covariates are listed in SI Table S4, while the best performing fits and their parameters are discussed in Section 3.2.

## 3. Results and discussion

### 3.1. Bias

All of our methods for assessing the bias in the MBE method suggest that there is no systematic bias in the approach. The York regression shown in Figure 4 for the 55 MBEs has a calculated slope of 0.967 ± 0.062 with an intercept of −12 ± 12 Mg·h^{−1} with uncertainties here being calculated using a bootstrap with 10,000 runs. This slope is within 1σ of 1, suggesting that there is no statistically significant bias in the MBE results overall. We also calculated the ME to be 10 ± 240 Mg·h^{−1}, meaning it is not statistically significantly different from 0. These results are consistent with previous works using smaller data sets and somewhat different MBE methods. Ahn et al. (2020) calculated an ME of −20 ± 160 Mg·h^{−1} from 16 MBEs, and Conley et al. (2017) calculated an ME of −52 ± 75 Mg·h^{−1} from 5 flights. In this study, the CEMS emissions ranged from 77 Mg·h^{−1} to 2,260 Mg·h^{−1}; in Ahn et al. (2020), they ranged from 220 Mg·h^{−1} to 1,030 Mg·h^{−1}; and in Conley et al. (2017), they ranged from 99 Mg·h^{−1} to 1,290 Mg·h^{−1}. If we limit our study to the same ranges of CEMS emissions used in these previous works, our ME is 10 ± 190 Mg·h^{−1} and 30 ± 210 Mg·h^{−1} for the range in Ahn et al. (2020) and Conley et al. (2017), respectively. All of these works, including ours, also include individual cases with much larger differences between MBE and CEMS emission rates. Lastly, if we sum all MBE emission rates, we calculate a total of 31,000 ± 1,000 Mg·h^{−1}, and if we sum all CEMS emission rates, we calculate a total of 30,660 ± 740 Mg·h^{−1}. As these 2 values are well within their respective uncertainties, it also supports the claim that there is no statistically significant bias in the MBE results overall. As all 3 metrics indicate the MBE emission rate is unbiased, our extrapolation method of estimating near-surface advected fluxes is likely unbiased as well. With our large data set of MBEs across a variety of meteorological conditions in this work, we can further investigate the potential causes of this variability in performance.

### 3.2. Explaining variability in performance

The precision across all 22 flights with repeat MBEs (a total of 49 MBEs) is calculated here using a regression of the variability across MBEs against the average MBE emission rate as shown in Figure 3A and discussed in more detail in Section 2.5. We find the precision for individual determinations of the emission rate to be 30.7% ± 6.7% with the uncertainty here being calculated using a bootstrap with 10,000 runs. The intercept of this fit is −19% ± 27%, which is not statistically significantly different from zero. We also investigated the average ratio of the variability across MBEs and the average MBE emission rate (i.e., the average ratio between *y* and *x* values in Figure 3A). This results in a precision estimate of 25% ± 16%. These results are not statistically significantly different, but we report here the more conservative slope-based estimate given the much narrower uncertainty range. A small number of flights showed larger percentage differences in the hourly CEMS emission rates during the MBE time frame, but excluding them had little impact on the results unless using a strict percentage difference threshold, which significantly reduces the number of usable flights (precision = 30.2% ± 5.0% for *N* = 18 flights with <5% difference, precision = 21.6% ± 5.7% for *N* = 12 flights with <1% difference). To recall, the uncertainty in the CEMS estimates themselves is 14%, so restricting ourselves to flights with a percentage difference in CEMS that is much less than 14% is not necessarily meaningful.

It is clear in Figure 3B that the absolute difference between MBE and CEMS emission rates is correlated to the MBE emission rate, but there is a large degree of spread. The OLS fit has an *R*^{2} = 0.537, suggesting only approximately half of the variability in the absolute difference between MBE and CEMS emission rates can be explained by the emission rate alone. To further understand the dominant factors affecting the absolute difference between MBE and CEMS emission rates, we calculated a multilinear regression. The 4 parameter multilinear regression that performed best (as judged by *R*^{2}) while avoiding high collinearity (defined here as >0.7 correlation between individual parameters; Farrar and Glauber, 1967) and also avoiding a statistically significant intercept is shown in Equation 6:

Here *y* is the absolute difference between MBE and CEMS emission rates, *x*_{1} is the MBE emission rate calculated without extrapolating emissions to the surface, *x*_{2} is the percentage of measured data that is >1 ppm above background, *x*_{3} is the number of data points within a box encompassing the plume across the transects, and *x*_{4} is the reciprocal of the average enhancement >1 ppm above background. This combination of terms has an *R*^{2} of 0.594, no collinearity larger than 0.6, and all 4 terms (excluding the intercept) are statistically significant at the 99% level. This equation can be useful to estimate the uncertainty of a given MBE, as all terms can be directly calculated from the data.

Understanding the individual terms in this regression can potentially help us reduce uncertainty through better flight design. *x*_{1} (MBE emission rate without extrapolation) is highly correlated to the MBE emission rate and performs slightly better than the MBE emission rate when combined with other terms as the extrapolation itself requires assumptions that introduce additional variability. As shown in Figure 3B, the emission rate explains a large fraction of the variability. *x*_{2} (the percentage of measurements >1 ppm above background) and *x*_{3} (the number of data points within a box encompassing the plume across the transects) provide similar insights. The more plume data sampled, both in absolute terms (*x*_{3}) and percentage terms (*x*_{2}), the smaller the absolute difference between MBE and CEMS emission rates. These point source flights generally had large amounts of background data relative to plume data, so the percentage of measurements >1 ppm above background ranges from 2.6% to 48%. Lastly, *x*_{4} (the reciprocal of the average enhancement >1 ppm above background) tells us that the larger the average signal across the flight, the smaller the absolute difference between MBE and CEMS emission rates. This would be impacted by both larger maximum measured signals and more plume data being sampled and can be interpreted as a combination of the previous insights. The effects must be balanced as a larger emission rate results in a larger absolute (not relative) difference between MBE and CEMS emission rates, while more plume data result in a smaller absolute difference between MBE and CEMS emission rates.

However, this is only the best regression and there are several that perform similarly well that we can investigate. Figure 5 shows the parameters present in 5 similarly well-performing regressions (including the best-performing regression). Only regressions that meet the following criteria are considered: they have an *R*^{2} > 0.55, they have low collinearity (Pearson correlation < 0.7), they have an intercept that is not statistically significantly different from zero, and all 4 parameters are statistically significant at the 95% level. Additionally, 5 regressions were excluded because they included both the reciprocal of the average enhancement >1 ppm above background and 1σ of the enhancement >1 ppm above background in the same regression, giving relationships that did not make physical sense. Although the average and 1σ of the enhancement >1 ppm above background and their inverses are correlated, they are not correlated above a Pearson correlation threshold of 0.7 if only 1 of the 2 variables is inverted. We also excluded 5 regressions that included conflicting relationships, each with 2 different source sensitivity terms with opposite signs. Finally, we excluded 1 regression that met these criteria yet still included a spurious relationship that did not make physical sense and had the opposite sign as other fits (the 1σ of the enhancement >1 ppm above background). Parameters in Figure 5 are organized into groups that represent similar concepts and are highly correlated (Pearson correlation > 0.7). Although the percentage of measurements >1 ppm above background is not quite correlated with its own inverse at this 0.7 threshold, we still consider these terms as 1 group.

First, we would like to briefly discuss a few of the parameters tested that were not found in any of the best-fitting regressions shown in Figure 5. Given that, on average, 28% of the total flux across the downwind plane exists in the near surface, below flown transects, one may expect variables related to the extrapolation to be potentially significant sources of absolute error. We tested 3 parameters focused on the impact of our extrapolation: the minimum height above ground during the lowest flown transect (m), the total extrapolated flux below the lowest flown transect after interpolation (mol·m^{−2}·s^{−1}), and the maximum enhancement in the lowest flown transect (ppm). However, none of these were present in the 5 best-performing regressions. This suggests that our method of extrapolation represents the near-surface advected fluxes well enough that other sources affect the absolute error more significantly.

As mentioned in Section 2.5, every regression was calculated with 1 MBE emission rate term (MBE emission rate, MBE emission rate without extrapolation [*x*_{1}], MBE emission rate 1σ, or a scaled MBE term sqrt[MBE^{3}/total enhancement across all transects]) so this group is in all of the regressions. The group including the number of data points within a box encompassing the plume across the transects (*x*_{3}) and the reciprocal of the area of a box encompassing the plume is present in 80% of the regressions. The group of the reciprocal of the percentage of measurements >1 ppm above background (*x*_{2}) and its negative inverse are also present in 80% of regressions. The total enhancement within a box encompassing the plume relative to the total enhancement across the MBE is present in 40% of the regressions and provides another metric for the relative amount of plume data. As discussed earlier in this section, the more plume data in an MBE whether measured in absolute or relative terms, the smaller the absolute difference between MBE and CEMS emission rates. However, the area of a box encompassing the plume is not only an indirect measure of the amount of data in the plume, it is also a measure of how well mixed the plume has become as increased mixing would lead to a broader plume that’s more completely mixed vertically and is thus more likely to be sampled in any flight transects. This study has many MBEs conducted close to the facility as many were part of a previous study investigating CH_{4} emissions which had a much smaller S/N ratio (Hajny et al., 2019). The plume transit time from facility to transects ranges from 2 min to 39 min with a median of 7.5 min.

Lastly, the correlated terms of the reciprocal of the average enhancement >1 ppm above background (*x*_{4}) and the reciprocal of the standard deviation across all enhancements >1 ppm above background are present in 80% of the regressions. However, the 1 regression that does not have these terms instead has the reciprocal of the median absolute deviation across enhancements >1 ppm. Although calculated differently, this provides similar information to the regression as the reciprocal of the standard deviation across all enhancements >1 ppm above background. As we have both the average and the variability, these can be interpreted to be somewhat of a balance between the emission rate and the amount of plume data.

Given the relative importance of these terms, we can use them to make some suggestions about ideal flight plans that would minimize the potential impact on the absolute error. To summarize the previous paragraphs, from Figure 5 and Equation 6, we have identified that the absolute difference between MBE and CEMS emission rates is smaller for flights with a larger fraction of data within the plume, broader well-mixed plumes, a larger emission rate, and larger enhancements in general. As such, the highest quality MBEs should be conducted under relatively turbulent mixing conditions (high insolation) that balance the increased mixing with the decreased enhancements and prioritize dense sampling of the plume.

## 4. Conclusions

MBEs are a valuable tool for quantifying emissions, both for research and validating efforts to reduce emissions. This work adds to the existing literature (Cambaliza et al., 2014; Conley et al., 2017; Heimburger et al., 2017; Ahn et al., 2020) by assessing the bias and uncertainties of the aircraft MBE. There is not a statistically significant difference between the MBE and CEMS emission rates on average, but rather scatter about the 1:1 line. The precision of the method from multiple cases of replicate observations was 30.7%, which represents the precision for an individual MBE determination. Since there is no observed bias, we assume that the calculated precision is a reasonable estimate of the overall quantitative uncertainty in typical, individual MBE determinations. As such, this work finds no bias in the aircraft MBE method and a moderate level of uncertainty similar to other top-down atmospheric science quantification methods.

For example, Yuan et al. (2015) estimated the propagated uncertainty for eddy covariance emission estimates of CH_{4} at approximately 21% for 2 shale gas field flights. They also saw that their estimates were within uncertainty of MBE emission rates estimated using the same flight data, which had slightly higher uncertainties of approximately 35% (Peischl et al., 2015). Varon et al. (2018) estimated that satellite remote sensing approaches to quantify CH_{4} point sources would be associated with between 15% and 65% uncertainty, depending on factors such as wind speed, and Duren et al. (2019) saw an average CH_{4} point source emission rate uncertainty of 30% using an airborne remote sensing technique in California. Although these uncertainties are for a different trace gas, they are in large part caused by similar meteorological and background uncertainties and are provided here merely for comparison to our reported precision. Our estimate is also consistent with the lower limit of approximately 30% presented in Angevine et al. (2020). In their work, real flight data from 4 flights were simulated using dispersion modeling, allowing the authors to investigate some sources of uncertainty in an MBE under controlled simulated conditions. The uncertainty of longer term emission estimates could be reduced by both performing multiple MBEs and integrating with other empirical data such as site-level measurements (Heimburger et al., 2017; Lopez-Coto et al., 2020).

We have also determined several parameters that explain up to 59% of the variability in the magnitude of the absolute error for individual flights. The fraction of data within the plume in both percentage and absolute terms, the average and variability of the plume signals, the magnitude of the emission rate, and the degree of mixing all help to explain the variability in the absolute error. This information can be used to estimate the uncertainty of MBEs since all terms derive directly from measurements, and it can also inform flight design to minimize the absolute errors. Our flights covered a range of wind speeds from approximately 3 m·s^{−1} to 13 m·s^{−1}, with 1 flight having a much higher 21 m·s^{−1} wind speed, and the distance from the source ranged from 0.7 km to 13.3 km. Given these ranges, the most important parameters that are actionable suggest that the highest quality MBEs should be conducted under turbulent/convective mixing conditions (balanced by avoiding loss of S/N) such that the plumes are broader and more vertically mixed and prioritize dense sampling of the plume. Of course, this mixing must also be balanced as the average and variability of measured signals are also important covariates and increasing the mixing decreases the S/N ratio. While the outcomes of this study do not directly apply to urban areas or nonbuoyant plumes, this is an important step toward better understanding the performance of the MBE technique, and the conclusions may still help to design better flight approaches for such more complex emission sources.

## Data accessibility statement

Aircraft measurement data used in this study are available from the Harvard Dataverse at https://doi.org/10.7910/DVN/7EWN9E.

## Supplemental files

The supplemental files for this article can be found as follows:

**Text S1**. Calibration cylinder details.

**Text S2**. Defining parameters describing the facilities studied.

**Figure S1**. Example of extrapolation methods.

**Figure S2**. Impact of extrapolation on emission rate.

**Figure S3**. Example CEMS emission rates.

**Figure S4**. Relationship between MBE and CEMS plotted against transit time.

**Table S1**. Details of calibration cylinders used.

**Table S2**. Key parameters describing the facilities studied.

**Table S3**. Key parameters describing individual MBEs.

**Table S4**. All covariate descriptions.

## Acknowledgments

The authors thank air traffic control for their support with our various flight patterns and the Jonathan Amy Facility for Chemical Instrumentation at Purdue University for continued assistance in maintaining equipment in Airborne Laboratory for Atmospheric Research.

## Funding

The authors thank the Environmental Defense Fund and the National Institute of Standards and Technology (NIST; NIST Award No. 70NANB19H167) for support of this work.

## Competing interests

The authors have no competing interests, as defined by *Elementa*, that might be perceived to influence the research presented in this manuscript. Paul B. Shepson is an Associate Editor at *Elementa*. He was not involved in the review process of this article.

## Disclaimer

Certain commercial equipment, instruments, or materials are identified in this article in order to specify the experimental procedure adequately. Such identification is not intended to imply recommendation or endorsement by NIST nor is it intended to imply that the materials or equipment identified are necessarily the best available for the purpose.

## Author contributions

Contributed to flight measurements and flight design: All authors.

Contributed to analysis and interpretation of data: KDH, IL-C, PBS.

Drafted and/or revised the article: KDH, IL-C, PBS.

Approved the submitted version for publication: All authors.

## References

*Journal of Geophysical Research: Atmospheres*

*Atmospheric Environment*

*Atmospheric Chemistry and Physics*

*Elementa: Science of the Anthropocene*

*Atmospheric Chemistry and Physics*

_{2}emission strength estimation with aircraft measurements and dispersion modelling

*Environmental Monitoring and Assessment*

*Atmospheric Measurement Techniques*

*Applied Physics B*

_{4}mole fractions to a gravimetrically prepared standard scale

*Journal of Geophysical Research: Atmospheres*

*Nature*

*Atmospheric Measurement Techniques*

*The Review of Economics and Statistics*

*Journal of Atmospheric and Oceanic Technology*

*Boundary-Layer Meteorology*

_{2}emissions of Rome, Italy

*Environmental Monitoring and Assessment*

*Environmental Science & Technology*

_{2}emissions for the United States

*Journal of Geophysical Research: Atmospheres*

*Environmental Science & Technology*

_{2}calibration scale

*Atmospheric Measurement Techniques*

*Elementa: Science of the Anthropocene*

*World energy outlook 2013*

*Environmental Science & Technology*

_{2}and CH

_{4}in situ observations in summer 2018

*Elementa: Science of the Anthropocene*

*Environmental Science & Technology*

_{2}, CH

_{4}, and CO emissions estimation for the Washington, DC–Baltimore metropolitan area using an inverse modeling technique

*Environmental Science & Technology*

*Environmental Science & Technology*

*Journal of Geophysical Research: Atmospheres*

*Journal of Geophysical Research: Atmospheres*

*Journal of Geophysical Research: Atmospheres*

_{2}greenhouse gas measurements

*Atmospheric Measurement Techniques*

*World urbanization prospects: The 2018 revision*

*Atmospheric Measurement Techniques*

*Biogeosciences*

*Earth and Planetary Science Letters*

*Journal of Geophysical Research: Atmospheres*

**How to cite this article:** Hajny, KD, Lyon, DR, Armstrong, A, Floerchinger, CR, Jayarathne, T, Kaeser, R, Lavoie, T, Salmon, OE, Stirm, BH, Stuff, AA, Tomlin, JM, Wulle, B, Lopez-Coto, I, Shepson, PB. 2023. Assessing the bias and uncertainties in the aircraft mass balance technique for the determination of carbon dioxide emission rates. *Elementa: Sciences of the Anthropocene* 11(1). DOI: https://doi.org/10.1525/elementa.2022.00135

**Domain Editor-in-Chief:** Detlev Helmig, Boulder AIR LLC, Boulder, CO, USA

**Guest Editor:** Frank Flocke, NCAR Earth System Laboratory, Boulder, CO, USA

**Knowledge Domain:** Atmospheric Science