Aggregate NetCDF time series data.
So… you want to aggregate time series NetCDF files?
Install the utility with with pip:
pip install ncagg
On the command line, use ncagg
:
Usage: ncagg [OPTIONS] DST [SRC]...
Aggregate NetCDF files.
Options:
-v, --version Show the version and exit.
--generate_template FILE Print the default template generated for
PATH and exit.
-u TEXT Give an Unlimited Dimension Configuration as
udim:ivar[:hz[:hz]]
-c TEXT Give an Chunksize Configuration as
udim:chunksize to chunk the ulimited
dimension udim by chunksize
-b TEXT If -u given, specify bounds for ivar as
min:max or Tstart[:[T]stop]. min and max are
numerical, otherwise T indicates start and
stop are times.start and stop are of the
form YYYY[MM[DD[HH[MM]]]] and of stop is
omitted,it will be inferred to be the least
significantly specified date + 1.
-l [DEBUG|INFO|WARNING|ERROR|CRITICAL]
log level
-t FILENAME Specify a configuration template
--help Show this message and exit.
Notes:
-u
can specify an Unlimited Dimension Configuration. See below for details.-t
). See below for details.-c
for smaller output filesize. Examples:
ncagg output_filename.nc file_0.nc file_02.nc #...
ncagg output_filename.nc path_to_files/*.nc
ncagg -u record_number:time output_filename.nc path_to_files/*.nc
ncagg -u record_number
10 output_filename.nc path_to_files/*.nc
ncagg -u record_number
0.0166666 output_filename.nc path_to_files/*.nc
ncagg -u record_number
10 -b T20170601:T20170602 output_filename.nc path_to_files/*.nc
ncagg -u record_number
10 -b T20170601 output_filename.nc path_to_files/*.nc
find /path/to/files -type f -name "*.nc" | ncagg output.nc
For more information, see the Unlimited Dimension Configuration below. Thencagg
Command Line Interface (CLI) builds a Config based on the arguments
specified. Fine grained control over the config can be exercised by providing a
config template.
Aggregation works in two stages:
The Aggregation List object is just a list that describes the order in which to combine components of an aggregation.
The objects within the list represent source files, or segments of fill values. Source files are associated
with sorting and filling instructions within the file. Fill values indicate where, and how many fill values to create.
During stage 1, the Aggregation List is generated. The level of configuration given determines how much is done here.
At most, each file is inspected according to it’s unlimited dimension and the variable that indexes it to determine
sorting and filling. No data except for index_by variables are read and none written to disk during this stage. If
an expected cadence is not provided, filling is not done. If bounds are provided, the unlimited dimension is clipped
to ensure data is included only within the bounds. For the minimum configuration given, files are simply assembed in
order of sorted filename.
During stage 2, the Aggregation List is evaluated. Evaluating the Aggregation List means simply iterating over the
components contained and copying data from these into the output aggregation file, while keeping track of global attributes.
Reasons for using this approach:
The sophistication of the aggregation is determine by how much configuration information is given on
generation of the Aggregation List.
The Config contains information that a NetCDF CDL specification would, but in json format, extended
with aggregation configuration information. If not provided, a default version will be created using the first
file in the list to aggregate.
The Config contains three properties (keys):
attributes
Each property is associated with a list of objects so to preserve ordering. The order in the
objects corresponds to the order of appearence in the output. Objects of all sections
have a “name” property.
Dimensions specify the dimensions of the file and has at minimum a “name”, and a “size”
which can be null for an unlimited dimension. Unlimited dimensions may also have an
Unlimited Dimension Configuration which will be described in a dedicated section below.
Variable objects contain a “name”, “dimensions”, “datatype”, “attributes”, and
“chunksizes”. The dimensions property is a list of dimension names on which the variable depends, each
must be configured in the dimensions section. datatype is something like int8, float32, string, etc.
Finally, attributes is another property containing key and values corresponding to variable attributes
commonly including “units”, “valid_min”, “_FillValue”, etc.
Attributes objects contain “name”, “strategy”, and optionally “value” for NetCDF Global Attributes. The
strategies are described below.
The Unlimited Dimension Configuration associates a particular unlimited dimension with a variable by which
it can be indexed. Commonly, a dimension named time is associated with a variable also named time which
indicates some epoch value for all data associated with that index of the dimension.
For example, a file may have a dimension “record_number” which is indexed by a variable “time”. Using
the Unlimited Dimension Configuration, we can specify to aggregate record_number such that the variable
“time” forms a monotonic sequence increasing at some expected frequency.
Here is what a typical GOES-R L1b product aggregation output looks like:
{
"name": "report_number",
"size": null,
"index_by": "time",
"expected_cadence": {"report_number": 1},
}
In English, the configuration above says “Order the dimension report_number by the values in the variable time, where
time values are expected to increase along the dimension report_number incrementing at 1hz.” This would be specified
to the ncagg CLI using ncagg -u report_number
.1 output.nc in1.nc in2.nc
The configuration allows to even index by multidimensional time (ehem, mag with 10 samples per report). On the command
line specified as -u report_number
, or as json:1:10
{
"name": "report_number",
"size": null,
"index_by": "OB_time",
"other_dim_indicies": {"samples_per_record": 0},
"expected_cadence": {"report_number": 1, "number_samples_per_report": 10},
}
One design constraint was to not reshape the data, so above, we order the data by looking at index 0 of
samples_per_record for every value along the report_number dimension. We assume that the other timestamps along
samples_per_record are correct. Also, given the configuration above, we only insert fill records of OB_time
if a full report_number record is missing (all 10 values along the number_samples_per_report dimension missing).
Indexing an unlimited dimension was described above. In addition to simply indexing by a variable, in the case that
the variable represents time, a common operation would be to restrict value to some range, to, for example, create
a day file. The Unlimited Dimension Configuration would look like:
{
"name": "report_number",
"size": null,
"index_by": "time",
"min": 14000000, # in units of the variable "time", expected
"max": 14000060, # something like "seconds since 2000-01-01 12:00:00"
"expected_cadence": {"report_number": 1}
}
Which would be specified on the command line as ... -u report_number
where the 1 -b1400000:14000060 ...
-b
option stands for “bounds”.
As min and max almost exclusively indicate datetime values, for convenience, they
are accepted as types: numerical, string, or python datetime. In string representation, they must start with “T” and
can be of the form “TYYYY[MM[DD[HH[MM]]]]” where brackets indicate optional and if omitted, will be inferred to be
minimum valid value, ie: 01 for MM (month). A units attribute must available for the index_by variable in the
form of “
Consider the suvi-l2-flloc (flare location) product which has two unlimited dimensions, time and feature_number.
At any time record, there can exist an arbitrary number of features. Consider a variable reporting the flux from
each feature at each time: flux(time, feature_number)
. Although feature_number is unlimited, it is unique to
each time and thus needs to be “flattened”:
flux([0], [0]) -> [[3.2e-6]]
flux([0], [0, 1]) -> [[3.3e-6, 5.4e-7]]
undesired_aggregated_flux(time, feature_number):
[[3.2e-6, _, _],
[ _, 3.3e-6, 5.4e-7]]
desired_aggregated_flux(time, feature_number):
[[3.2e-6, _],
[3.3e-6, 5.4e-7]]
The desired_aggregated_flux
is achieved by setting {“flatten”: true} within an the unlimited dimension configuration for feature_number.
[{
"name": "time",
"size": null,
"index_by": "time",
}, {
"name": "feature_number",
"size": null,
"flatten": true,
}]
The aggregated netcdf file contains global attributes formed from the constituent granules. A number of
strategies exist to aggregate Global Attributes across the granules. Most are quite self explanatory:
The configuration format expects a key “global attributes” associated with a list of objects each containing
a global attribute name, strategy, and possible value (for static). A list is used to preserve order, as the
order in the configuration will be the resulting order in the output NetCDF.
{
"global attributes": [
{
"name": "production_site",
"strategy": "unique_list"
}, {
"name": "creator",
"strategy": "static",
"value": "Stefan Codrescu"
}, {
...
}
]
}
NOT IMPLEMENTED. IN PROGRESS. SUBJECT TO CHANGE.
Consider SEIS SGPS files which contain the data from two sensor units, +X and -X. Most variables are of the form
var[record_number, sensor_unit, channel, …]. It is possible to create an aggregate file for the +X and -X sensor
units individually using the take_dim_indicies configuration key.
{
"take_dim_indicies": {
"sensor_unit": 0
}
}
With the above configuration, sensor_unit must be removed from the dimensions configuration. Please also ensure that
variables do not list sensor_unit as a dimension, and also update chunk sizes accordingly. Chunk sizes must be a list
of values of the same length as dimensions.
ncagg
can be configured to output files into a format specified by a configuration template file. It is expected
that this is a json format file. A generic template can be created using the ncagg --generate_template [SAMPLE_NC]
command. The output of the template command is the default template that is used internally if no template is specified.
Use ncagg --generate_template example_netcdf.nc > my_template.json
to save the default template for an example_netcdf.nc file
into my_template.nc. Edit my_template.json to your liking, then run aggregation using ncagg -t my_template.json [...]
.
The template syntax is verbose, but hopefully straightforward and clear. The incoming template will be validated
upon initiating an aggregation, but some issues may only be found at runtime.
The attributes section is a list of objects contianing global attributes:
The dimensions section is a list of objects containing the dimensions of the file. Most configuration options
are covered in Unlimited Dimension Configuration section, but to clarify:
Similarly, variables section is a list of objects configuring output variables. Remove the object
corresponding to some variable to remove it from the output.
Important notes:
copy_from_alt
to specify a list of alternative variables to copy data from if a variablename
isn’t found.Take care that everything is consistent when doing heavy modifications.
In addition to the CLI, ncagg
exposes an API which makes it possible to call from Python code:
from ncagg import aggregate
aggregate(["file1.nc", "file2.nc"], "output.nc")
aggregate
optionally accepts as a third argument a configuration template. If none is given,
the default template created from the first input file is used. Thus code above is equivalent to:
from ncagg import aggregate, Config
config = Config.from_nc("file1.nc")
aggregate(["file1.nc", "file2.nc"], "output.nc", config)
This allows for the possibility of programatically manipulating the configuration at runtime before
performing aggregation.
An Aggregation List is composed of two types of objects, InputFileNode and FillNode objects. These inherit in common
from an AbstractNode and must implement the get_size_along(unlimited_dim)
and data_for(var, dim)
methods. Evaluating an aggregation list is simply going though the Aggregation List and calling something like:
nc_out.variable[var][write_slice] = node.data_for(var)
The data_for
must return data consistent with the shape promised from node.get_size_along(dim)
.
The complixity of aggregation comes in handling the dimensions and building the aggregation list. In addition to
the interface exposed by an AbstractNode, each InputFileNode and FillNode implement their own specific functionality.
A FillNode is simpler, and needs to be told how many fills to insert along a certain unlimited dimension and
optionally, can be configured to return values from data_for
that are increasing along multiple dimensions
according to configured expected_cadence
values from a certain start value.
An InputFileNode is more complicated and exposes methods to find the time bounds of the file, and additionally,
is internally capable of sorting itself and inserting fill values into itself. Of course, it doesn’t modify the
actual input file, this is all done on the fly as data is being read out through data_for
. Implementation wise,
an InputFileNode may contain within itself a mini aggregation list containing two types of objects: slice and
FillNode objects. Similarly to the large scale process of aggregating, an InputFileNode returns data that has
been assembled according to it’s internal aggregation list and internal sorting.
This software is written for aggregation of GOES-R series Space Weather data products (L1b and L2+). As
such, it contains extensive tests against real GOES-16 satellite data. Many “features” in this code are
intended to address “quirks” in the ground processing (implemented by a certain contractor…).
Tests are in the test
subdirectory. Run all tests with
python -m unittest discover
The code is compatible with Python2 (2.7) and Python3, so unittests should be run with both. One interesting
thing I’ve noticed is the test suite appears to be about 20% faster in Python3 than in Python2.
Note: currently it is expected that 1 test(s) fail.
Setting up a virtualenv is recommended for development.
virtualenv venv
. venv/bin/activate
pip install --editable .
Deploy to pip, after running unittest with both with python2 and python3. The git stash
is important so that
the build is from a clean repo! We don’t want any dev or debug changes that are sitting unstaged to be included.
git stash
rm -r dist/
python setup.py bdist_wheel --universal
twine upload dist/*