Tagging Missing or Bad Data: FILLVAL and Masks

Overview

swxsoc supports tagging missing or bad data with our in-memory containers and seamlessly converts these tags values in CDF files using the ISTP FILLVAL convention. Masks are the canonical in-memory representation of missing data. For convenience, floating-point NaN values are also treated as missing. Integer and string dtypes are never promoted to floats; missing positions are tracked exclusively through a boolean mask.

The FILLVAL sentinel emitted for a given variable is determined by the CDF data type chosen for it; see CDF Format Guide (Section 5, Data Type Mapping) for the NumPy dtype → CDF type rules.

Missing Values Across Data Types

Dtype

In-memory fill source

On disk

On read result

Float

NaN or .mask bit set

FILLVAL (e.g. -1.0e31)

Masked quantity; underlying value is NaN; .mask set

Integer

.mask bit set, or value equal to FILLVAL sentinel

FILLVAL sentinel (e.g. -32768 for int16)

NDData/NDCube with .mask set; dtype preserved

String (S/U)

.mask bit set, or value equal to b"nan" / b"NaN"

Single space b" "

NDData with .mask set; data shows b" " at masked positions

Time (Epoch)

Time column with native astropy masking (t[i] = np.ma.masked), or a Masked wrapper around a Time

Raw TT2000 sentinel -9223372036854775808 (the masked write path always emits CDF_TIME_TT2000). The reader also recognises the EPOCH sentinel -1.0e31 in pre-existing files.

Time column with .mask set at sentinel positions

Floats

Float measurements may contain NaN directly, or they can be wrapped in Masked to carry an explicit boolean mask. Both are written to disk as the variable’s FILLVAL. On read, the variable comes back as a Masked quantity whose underlying numeric data has NaN at masked positions:

import numpy as np
from astropy.units import Quantity
from astropy.utils.masked import Masked

values = Quantity(np.array([1.0, np.nan, 3.0], dtype=np.float32), unit="m")
# After write/read round-trip, ``loaded`` is a Masked Quantity with
# mask=[False, True, False] and underlying value [1.0, nan, 3.0].

Integers

Integer dtypes are never promoted to float. Missing positions must be carried as a boolean .mask on an NDData (or NDCube):

import numpy as np
from astropy.nddata import NDData

nd = NDData(
    data=np.array([10, 20, 30, 40], dtype=np.int16),
    mask=np.array([False, True, False, True]),
)

On disk, masked positions are written as the variable’s FILLVAL sentinel (chosen from the schema). On read, the dtype is preserved and the mask is restored. If a sentinel value appears in the on-disk data without an explicit mask, it is treated as fill on read.

Strings

String round-trips are intentionally asymmetric:

  • On write, any position equal to the literal bytes b"nan" or b"NaN" is treated as fill (this is how numpy coerces np.nan into S/U arrays) and is emitted to the CDF as a single space b" ".

  • On read, only the spec sentinel b" " is mapped to a mask bit. The literal string b"nan" is never reinterpreted on read.

A consequence: a legitimate string value of b"nan" will be coerced to fill on write. If you need to preserve the literal four-byte string "nan", do not use it as a data value.

Time / Epoch

Time columns use astropy’s native masking on Time:

from astropy.time import Time
import numpy as np

t = Time(np.arange(5), format="unix")
t[2] = np.ma.masked

When a mask is present, the writer emits the Epoch variable as CDF_TIME_TT2000 and writes the raw int64 nanosecond values directly, overwriting masked positions with the ISTP sentinel -9223372036854775808. This preserves the integer sentinel exactly; the unmasked write path (which goes through Time.to_datetime()) cannot represent it. On read, the time column comes back as a Time whose .mask is set at sentinel positions; .masked is True. The reader also handles CDF_EPOCH (float64) variables, recognising the -1.0e31 sentinel.

Working with masks

After loading a CDF, you can inspect missing values through the mask uniformly across types:

loaded = swxdata.timeseries["measurement"]   # Masked Quantity if any positions were fill; otherwise a plain Quantity
mask = getattr(loaded, "mask", None)
raw = loaded.unmasked.value if mask is not None else loaded.value

loaded_support = swxdata.support["my_int"]   # NDData
mask = loaded_support.mask
raw = loaded_support.data

loaded_spectra = swxdata.spectra["my_cube"]  # NDCube
mask = loaded_spectra.mask

loaded_time = swxdata.timeseries["time"]     # Time (native mask)
mask = loaded_time.mask

Caveats

  • Integer dtypes are never promoted to float; rely on the mask to identify missing positions.

  • The string write path treats b"nan" / b"NaN" as fill regardless of any mask; the read path is strict and uses only b" ".

  • For time columns, the masked write path always emits CDF_TIME_TT2000; the historical datetime write path is unchanged when no mask is present.