RSGISLib Classification

The classification module has functions which allows classifiers to be applied to image data, either on a per pixel based or following an image segmentation and the classification of the resultant segments/clumps/objects.

The classification functions are available within a number of sub-modules for interfacing with different libraries and methods:

RSGISLib Scikit-Learn Pixel Classification
RSGISLib LightGBM Pixel Classification
- Binary Classification Functions
- Multi-Class Classification Functions
RSGISLib XGBoost Classification
- Binary Classification Functions
- Multi-Class Classification Functions
RSGISLib CatBoost Image Classification
RSGISLib Keras Pixel Classification
- Training Functions
  - train_keras_pixel_classifier()
- Classify Functions
  - apply_keras_pixel_classifier()
RSGISLib Keras Image Chips Classification
- Training Functions
  - train_keras_chips_pixel_classifier()
- Classify Functions
  - apply_keras_chips_pixel_classifier()
  - train_keras_chips_ref_classifier()
RSGISLib Clumps Classification Utilities
- Populate RAT Training
  - populate_clumps_with_class_training()
- Extract Data for Training
  - extract_rat_col_data()
RSGISLib Scikit-Learn Unsupervised Pixel Classification
- Pixel Clustering
- RAT Clustering
  - cluster_sklearn_rat()
RSGISLib Imbalanced Classification Utilities

This rsgislib.classification module provides functions for dealing with training data, undertaking an accuracy assessment and other useful utilities, see below.

Pixel Training Data

rsgislib.classification.get_class_training_data(img_band_info: List[ImageBandInfo], class_vec_sample_info: List[ClassVecSamplesInfoObj], tmp_dir: str, sub_sample: int = None, ref_img: str = None) → dict

A function to extract training for vector regions for a given input image set.

Parameters:

img_band_info – A list of rsgislib.imageutils.ImageBandInfo objects to define the images and bands of interest.
class_vec_sample_info – List of rsgislib.classification.ClassVecSamplesInfoObj objects to define the training regions.
tmp_dir – A directory for temporary outputs created during the processing.
sub_sample – If not None then an integer needs to be provided which takes a random selection from the available samples to balance the number of samples used for the classification.
ref_img – A reference image which defines the area of interest, pixel size etc. for the processing. If None then an image will be generated using the input images but the tmp_dir needs to be defined.

Returns:

dictionary of ClassSimpleInfoObj objects.

rsgislib.classification.split_sample_train_valid_test(in_h5_file: str, train_h5_file: str, valid_h5_file: str, test_h5_file: str, test_sample: int, valid_sample: int, train_sample: int = None, rnd_seed: int = 42, datatype: int = None)

A function to split a HDF5 samples file into three (i.e., Training, Validation and Testing). The input HDF5 file can be created using the rsgislib.zonalstats.extract_zone_img_band_values_to_hdf function.

Parameters:

in_h5_file – Input HDF file, probably from rsgislib.zonalstats.extract_zone_img_band_values_to_hdf.
train_h5_file – Output file with the training data samples (this has the number of samples left following the removal of the test and valid samples if train_sample=None)
valid_h5_file – Output file with the valid data samples.
test_h5_file – Output file with the testing data samples.
test_sample – The size of the testing sample to be taken.
valid_sample – The size of the validation sample to be taken.
train_sample – The size of the training sample to be taken. If None then the remaining samples are returned.
rnd_seed – The random seed to be used to randomly select the sub-samples.
datatype – is the data type used for the output HDF5 file (e.g., rsgislib.TYPE_32FLOAT). If None (default) then the output data type will be float32.

rsgislib.classification.create_train_valid_test_sets(cls_in_info: Dict[str, ClassSimpleInfoObj], cls_out_info: Dict[str, ClassInfoObj], test_sample: int, valid_sample: int, train_sample: int = None, rnd_seed: int = 42, datatype: int = None)

A function which takes a dict of rsgislib.classification.ClassSimpleInfoObj such as those retrieved from get_class_training_data and a dict of rsgislib.classification.ClassInfoObj and creates the train, test, valid data samples from a single input file for all the classes. This is a simple wrapper function around the split_sample_train_valid_test function making it easier to process multiple classes.

Parameters:

cls_in_info – a dict of rsgislib.classification.ClassSimpleInfoObj objects with the input HDF5 file paths which will be split.
cls_out_info – a dict of rsgislib.classification.ClassInfoObj objects which specifies the output paths for the output HDF5 files.
test_sample – The size of the testing sample to be taken.
valid_sample – The size of the validation sample to be taken.
train_sample – The size of the training sample to be taken. If None then the remaining samples are returned.
rnd_seed – The random seed to be used to randomly select the sub-samples.
datatype – is the data type used for the output HDF5 file (e.g., rsgislib.TYPE_32FLOAT). If None (default) then the output data type will be float32.

rsgislib.classification.get_class_info_dict(cls_smpls_info: Dict[str, ClassSimpleInfoObj], smpls_dir: str) → Dict[str, ClassInfoObj]

A function which converts a dictionary of ClassSimpleInfoObj objects into a dictionary of ClassInfoObj objects. This is useful when get_class_training_data has been used to extract samples and you then want to use create_train_valid_test_sets to split the samples into training, validation and testing datasets.

Note. the output file names for the training, validation and testing datasets are as defined as the basename of the input hdf5 samples with either _train, _valid or _test appended on the end.

Parameters:

cls_smpls_info – A dict with the class name as the key with a ClassSimpleInfoObj instance as the value.
smpls_dir – the file path for the training, validation and testing datasets.

Returns:

A dict with the class name as the key with a ClassInfoObj instance as the value.

rsgislib.classification.get_num_samples(in_h5_file: str) → int

A function to return the number of samples within the input HDF5 file.

Parameters:: in_h5_file – Input HDF file, probably from rsgislib.zonalstats.extract_zone_img_band_values_to_hdf.
Returns:: the number of samples in the hdf5 file.

rsgislib.classification.plot_train_data(cls1_h5_file: str, cls2_h5_file: str, out_plots_dir: str, cls1_name: str = 'Class 1', cls2_name: str = 'Class 2', var_names: List[str] = None)

A function which plots the training data (in HDF5 format) for two classes with histograms for the two axis’. Note, this plot only works for training extracted for pixel or clumps and not chip based training.

This function uses the plotly library (https://plotly.com). It saves the plots to disk as PNGs so the plotly-orca package is also required.

Parameters:

cls1_h5_file – Input HDF5 file with the training for class 1.
cls2_h5_file – Input HDF5 file with the training for class 2.
out_plots_dir – Output directory for the plots
cls1_name – The name of the first class (Optional, default is ‘Class 1’)
cls2_name – The name of the second class (Optional, default is ‘Class 2’)
var_names – An optional list of variable names for the training. Optional, otherwise call ‘Var #1’, ‘Var #2’ … ‘Var #N’ etc.

rsgislib.classification.convert_cls_smpls_to_pandas_df(h5_file: str, img_file_info: List[ImageBandInfo])

A function for converting a HDF5 file generated by extracting data for classification using rsgislib to a pandas dataframe.

Parameters:

h5_file – the path to the hdf5 file.
img_file_info – List of rsgislib.imageutils.ImageBandInfo objects which is used to provide the column names.

Returns:

pandas.DataFrame

rsgislib.classification.convert_mutli_cls_smpls_to_pandas_df(cls_smpls_info: Dict[str, ClassSimpleInfoObj], img_band_info: List[ImageBandInfo], class_id_col: str = 'class_id')

A function which takes a dict of ClassSimpleInfoObj objects and creates a pandas data frame where the classes are specified using the id within the ClassSimpleInfoObj object.

Parameters:

cls_smpls_info – dict where the class name is the value and a ClassSimpleInfoObj object is the value.
img_band_info – a list of rsgislib.imageutils.ImageBandInfo specifying the variable order within the HDF5 files. This is used to generate the column names in the dataframe.

Returns:

a pandas dataframe with the first column being the class and those after the variables.

Chips Training Data

rsgislib.classification.get_class_training_chips_data(img_band_info: List[ImageBandInfo], class_vec_sample_info: List[ClassVecSamplesInfoObj], chip_h_size: int, tmp_dir: str, ref_img: str = None) → Dict[str, ClassSimpleInfoObj]

A function to extract training chips (windows/regions) for vector regions for a given input image set.

Parameters:

img_band_info – A list of rsgislib.imageutils.ImageBandInfo objects to define the images and bands of interest.
class_vec_sample_info – A list of rsgislib.classification.ClassVecSamplesInfoObj objects to define the training regions.
chip_h_size – is half the chip size to be extracted (i.e., 10 with output image chips 21x21, 10 pixels either size of the one of interest).
tmp_dir – A directory for temporary outputs created during the processing.
ref_img – A reference image which defines the area of interest, pixel size etc. for the processing. If None then an image will be generated using the input images but the tmpdir needs to be defined.

Returns:

dictionary of ClassSimpleInfoObj objects.

rsgislib.classification.split_chip_sample_train_valid_test(in_h5_file: str, train_h5_file: str, valid_h5_file: str, test_h5_file: str, test_sample: int, valid_sample: int, train_sample: int = None, rnd_seed: int = 42, datatype: int = None)

A function to split a chip HDF5 samples file (from rsgislib.zonalstats.extract_chip_zone_image_band_values_to_hdf) into three (i.e., Training, Validation and Testing).

Parameters:

in_h5_file – Input HDF file, probably from rsgislib.zonalstats.extract_chip_zone_image_band_values_to_hdf.
train_h5_file – Output file with the training data samples (this has the number of samples left following the removal of the test and valid samples if train_sample=None)
valid_h5_file – Output file with the valid data samples.
test_h5_file – Output file with the testing data samples.
test_sample – The size of the testing sample to be taken.
valid_sample – The size of the validation sample to be taken.
train_sample – The size of the training sample to be taken. If None then the remaining samples are returned.
rnd_seed – The random seed to be used to randomly select the sub-samples.
datatype – is the data type used for the output HDF5 file (e.g., rsgislib.TYPE_32FLOAT). If None (default) then the output data type will be float32.

rsgislib.classification.split_chip_sample_ref_train_valid_test(in_h5_file: str, train_h5_file: str, valid_h5_file: str, test_h5_file: str, test_sample: int, valid_sample: int, train_sample: int = None, rnd_seed: int = 42, datatype: int = None)

A function to split a chip HDF5 samples file (from rsgislib.zonalstats.extract_chip_zone_image_band_values_to_hdf) into three (i.e., Training, Validation and Testing).

Parameters:

in_h5_file – Input HDF file, probably from rsgislib.zonalstats.extract_chip_zone_image_band_values_to_hdf.
train_h5_file – Output file with the training data samples (this has the number of samples left following the removal of the test and valid samples if train_sample=None)
valid_h5_file – Output file with the valid data samples.
test_h5_file – Output file with the testing data samples.
test_sample – The size of the testing sample to be taken.
valid_sample – The size of the validation sample to be taken.
train_sample – The size of the training sample to be taken. If None then the remaining samples are returned.
rnd_seed – The random seed to be used to randomly select the sub-samples.
datatype – is the data type used for the output HDF5 file (e.g., rsgislib.TYPE_32FLOAT). If None (default) then the output data type will be float32.

rsgislib.classification.flip_chip_hdf5_file(input_h5_file: str, output_h5_file: str, datatype: int = None)

A function which flips each sample in both the x and y axis. So the output file will have double the number of samples as the input file.

Parameters:

input_h5_file – The input HDF5 file for chips extracted from images.
output_h5_file – The output HDF5 file for chips extracted from images.
datatype – is the data type used for the output HDF5 file (e.g., rsgislib.TYPE_32FLOAT). If None (default) then the output data type will be float32.

rsgislib.classification.flip_ref_chip_hdf5_file(input_h5_file, output_h5_file, datatype=None)

A function which flips each sample in both the x and y axis. So the output file will have double the number of samples as the input file.

Parameters:

input_h5_file – The input HDF5 file for chips extracted from images.
output_h5_file – The output HDF5 file for chips extracted from images.
datatype – is the data type used for the output HDF5 file (e.g., rsgislib.TYPE_32FLOAT). If None (default) then the output data type will be float32.

rsgislib.classification.label_pxl_sample_chips(sample_pxls_img: str, cls_msk_img: str, out_image: str, gdalformat: str, chip_size: int, cls_lut: Dict[int, float], sample_pxl_img_band: int = 1, cls_msk_img_band: int = 1)

A function which labels image pixels based on the proportions of a class within a chip around the pixel (can be used in combination with rsgislib.imageutils.assign_random_pxls). It is expected that this function will be used when trying to use existing maps to create deep learning chip classification training data.

Pixels are labelled if the proportion of pixels is >= the threshold provided in the LUT. If more than one class meets the threshold then the one with the highest proportion is assigned.

Parameters:

sample_pxls_img – The input binary image with the pixel locations (value == 1)
cls_msk_img – The classification image used to assign the output pixel values.
out_image – The output image. Single pixels with the class value will be outputted.
gdalformat – The output image file format.
chip_size – The size of the chip used to identify the class - would probably correspond to the chip size being used for the deep learning classification. Areas used is half the chip size around the pixel (i.e., the pixel from the samples image will be at the centre of the chip).
cls_lut – A dict look up table (LUT) with the thresholds per class for the pixel to be classified as that class.
sample_pxl_img_band – Default 1. The image band in the sample image.
cls_msk_img_band – Default 1. The image band in the sample image.

sample_pxls_img = 'LS5TM_20000108_latn531lonw37_r23p204_osgb_samples.kea'
cls_msk_img = 'LS5TM_20000108_latn531lonw37_r23p204_osgb_clouds_up.kea'
output_image = 'LS5TM_20000108_latn531lonw37_r23p204_osgb_samples_lbld.kea'

cls_lut = dict()
cls_lut[1] = 0.2
cls_lut[2] = 0.2
cls_lut[3] = 0.99

label_pxl_sample_chips(sample_pxls_img, cls_msk_img, output_image, 'KEA', 21, cls_lut)

Post Classification Refinement

rsgislib.classification.fill_class_timeseries(input_imgs: List[str], out_dir: str, no_data_val: int = 0, gdalformat='KEA', out_img_ext='kea', out_file_str='_filled', double_direction: bool = True, recheck_ends: bool = True, n_iters: int = 3, cls_names_col: str = 'class_name')

A function which will fill gaps within a time series of classification using values from images either side of the input image. The function initially goes through the images in the order they are within the input list and then optionally (double_direction) goes back through in the reverse direction before optionally rechecking the end images again (recheck_ends). At can also be useful to apply this operation a number of times to fill all gaps and therefore the n_iters (default 3) allows the the fill to be iteratively applied.

Parameters:

input_imgs – list of input file paths
out_dir – output directory where the output images will be written
no_data_val – the no data value used within the input images (i.e., regions to be filled.
gdalformat – the output file format. (default: KEA)
out_img_ext – the output file extension (default: kea)
out_file_str – an optional addition to the end of the output file name (Default: “_filled”)
double_direction – Option to do the fill in both directions. If false then will just go through the order of the input list. (Default: True).
recheck_ends – Option to have the end images rechecked as these a missed if when they are the start image in the sequence.
n_iters – The number of iterations to be applied.
cls_names_col – If a KEA then the column name for the class column. Used for both the input and output images.

Utilities

rsgislib.classification.collapse_classes(input_img: str, output_img: str, gdalformat: str, class_col: str, class_int_col: str)

Collapses an attribute table with a large number of classified clumps (segments) to a attribute table with a single row per class (i.e. a classification rather than segmentation.

Parameters:

input_img – is a string containing the name and path of the input file with attribute table.
output_img – is a string containing the name and path of the output file.
gdalformat – is a string with the output image format for the GDAL driver.
class_col – is a string with the name of the column with the class names - internally this will be treated as a string column even if a numerical column is specified.
class_int_col – is a sting specifying the name of a column with the integer class representation. This is an optional parameter but if specified then the int reprentation of the classes will be reserved.

rsgislib.classification.gen_rgb_img_from_clr_tbl(input_img: str, output_img: str, gdalformat: str)

Generates a 3 band colour image from the colour table in the input file.

Parameters:

input_img – is a string containing the name and path of the input file with attribute table.
output_img – is a string containing the name and path of the output file.
gdalformat – is a string with the output image format for the GDAL driver.

Accuracy Assessment Samples

rsgislib.classification.generate_random_accuracy_pts(input_img: str, out_vec_file: str, out_vec_lyr: str, out_format: str, rat_class_col: str, vec_class_col: str, vec_ref_col: str, num_pts: int, seed: int, del_exist_vec: bool)

Generates a set of random points for accuracy assessment.

Parameters:

input_img – is a string containing the name and path of the input image with attribute table.
out_vec_file – is a string containing the name and path of the output vector file.
out_vec_lyr – is a string containing the vector file layer name.
out_format – the output vector file format (e.g., GPKG)
rat_class_col – is a string specifying the name of the column in the image file containing the class names.
vec_class_col – is a string specifying the output column in the vector file for the classified class names.
vec_ref_col – is a string specifying an output column in the vector file which can be used in the accuracy assessment for the reference data.
num_pts – is an int specifying the total number of points which should be created.
seed – is an int specifying the seed for the random number generator. (Optional: Default 10)
del_exist_vec – is a bool, specifying whether to force removal of the output vector if it exists. (Optional: Default False)

rsgislib.classification.generate_stratified_random_accuracy_pts(input_img: str, out_vec_file: str, out_vec_lyr: str, out_format: str, rat_class_col: str, vec_class_col: str, vec_ref_col: str, num_pts: int, seed: int, del_exist_vec: bool, use_pxl_lst: bool)

Generates a set of stratified random points for accuracy assessment.

Parameters:

input_img – is a string containing the name and path of the input image with attribute table.
out_vec_file – is a string containing the name and path of the output vector file.
out_vec_lyr – is a string containing the vector file layer name.
out_format – the output vector file format (e.g., GPKG)
rat_class_col – is a string specifying the name of the column in the image file containing the class names.
vec_class_col – is a string specifying the output column in the vector file for the classified class names.
vec_ref_col – is a string specifying an output column in the vector file which can be used in the accuracy assessment for the reference data.
num_pts – is an int specifying the number of points for each class which should be created.
seed – is an int specifying the seed for the random number generator. (Optional: Default 10)
del_exist_vec – is a bool, specifying whether to force removal of the output vector if it exists. (Optional: Default False)
use_pxl_lst – is a bool, if there are only a small number of pixels then creating a list of all the pixel locations will speed up processing. (Optional: Default False)

rsgislib.classification.generate_stratified_prop_random_accuracy_pts(input_img: str, out_vec_file: str, out_vec_lyr: str, out_format: str, rat_class_col: str, vec_class_col: str, vec_ref_col: str, num_pts: int, min_num_pts: int, seed: int, del_exist_vec: bool)

Generates a set of stratified random points for accuracy assessment with the number of: point per class proportional to the area mapped.

Parameters:

input_img – is a string containing the name and path of the input image with attribute table.
out_vec_file – is a string containing the name and path of the output vector file.
out_vec_lyr – is a string containing the vector file layer name.
out_format – the output vector file format (e.g., GPKG)
rat_class_col – is a string specifying the name of the column in the image file containing the class names.
vec_class_col – is a string specifying the output column in the vector file for the classified class names.
vec_ref_col – is a string specifying an output column in the vector file which can be used in the accuracy assessment for the reference data.
num_pts – is the total number of points to be created (note, with rounding this might not be the exact output).
min_num_pts – is the minimum number of points to be created for each class.
seed – is an int specifying the seed for the random number generator. (Optional: Default 10)
del_exist_vec – is a bool, specifying whether to force removal of the output vector if it exists. (Optional: Default False)
use_pxl_lst – is a bool, if there are only a small number of pixels then creating a list of all the pixel locations will speed up processing. (Optional: Default False)

rsgislib.classification.pop_class_info_accuracy_pts(input_img: str, vec_file: str, vec_lyr: str, rat_class_col: str, vec_class_col: str, vec_ref_col: str = None, vec_process_col: str = None)

Generates a set of stratified random points for accuracy assessment.

Parameters:

input_img – is a string containing the name and path of the input image with attribute table.
vec_file – is a string containing the name and path of the input vector file.
vec_lyr – is a string containing the vector file layer name.
rat_class_col – is a string specifying the name of the column in the image file containing the class names.
vec_class_col – is a string specifying the output column in the vector file for the classified class names.
vec_ref_col – is an optional string specifying an output column in the vector file which can be used in the accuracy assessment for the reference data.
vec_process_col – is an optional string specifying an output column in the vector file which is used allocate points as processed or otherwise.

rsgislib.classification.create_acc_pt_sets(vec_file: str, vec_lyr: str, out_vec_file_base: str, out_vec_lyr: str, cls_col: str, n_sets: int, sets_col: str = 'set_id', out_format: str = 'GeoJSON', out_ext: str = 'geojson', shuffle_rows: bool = True, rnd_seed: int = None)

A function which splits a vector layer into n_sets where a ‘class’ column is used to ensure that there are the same number of samples per ‘class’ within each set. An example of where this function might be used is to split a set of accuracy assessment point for assessing the classification accuracy into multiple sets. Note, the output vector layers

Parameters:

vec_file – Input vector file/path
vec_lyr – Input vector layer name
out_vec_file_base – The output vector file base name and path. Note, the output file name will be: base{n_set}.out_ext. If you want a character (e.g., underscore) between the basename and the set number then include in the basename. Example, out/path/vec_file_name_
out_vec_lyr – the output vector layer name. The same layer name is used for all the output files.
cls_col – The column in the vector file which has values for the classes
n_sets – The number of sets you want the input vector sorted into.
sets_col – A column added to the output files with an integer representing the set the row belongs to so if vector files are merged again then the set information is not lost. (Default: ‘set_id’)
out_format – The output vector file format (Default: GeoJSON)
out_ext – the output vector file format extension (Default: gpkg)
shuffle_rows – Boolean specifying whether the vector layer rows should be shuffled before splitting into sets (Default: True)
rnd_seed – If shuffling the rows then this random seed can be used to ensure the shuffling is the same between runs.

rsgislib.classification.classaccuracymetrics.create_modelled_acc_pts(err_matrix: List[List[float]], cls_lst: List[str], n_pts: int, shuffle_pts: bool = True, rnd_seed: int = 42) → Tuple[array, array]

A function which generates a set of of modelled accuracy assessment points which would produce the error matrix passed to the function. The output of this function can be used with the classaccuracymetrics.calc_class_pt_accuracy_metrics function to calculate accuracy metrics for these points.

The input error matrix is represented by n x n list of lists, where the first axis is the reference class and the second the ‘classification’.

Parameters:

err_matrix – a list of lists representing the error matrix which should be square with the same number of classes and order as the cls_lst. The error matrix should sum to 1 with the individual class values relative to the proportion of the scene and class accuracy.
cls_lst – A list of class names
n_pts – the number of output points produced
rnd_seed – a seed for the random generator which shuffles the output.

Returns:

a tuple with two numpy arrays of size n where the first is the ‘reference’

and the second is the ‘classification’.

rsgislib.classification.classaccuracymetrics.create_norm_modelled_err_matrix(cls_areas: List[float], ref_smpl_accs: List[List[float]]) → List[List[float]]

A function which creates a normalised error matrix (as required by create_modelled_acc_pts function) using the class areas and relative accuracies of the reference samples.

Parameters:

cls_areas – a list of relative class areas (i.e., percentage are for each class). The list must be either add up to 100 or 1. (e.g., [10, 40, 30, 20] would mean that there is 10% of the area mapped as class 1, 40% for class2, 30 for class3 and 20 for class4.
ref_smpl_accs – The accuracy of the classes relative to the reference samples. This is an n x n square matrix where n is the number of classes. Each row is the relative accuracy of the reference samples for the class. Each row must either sum to 1 or 100.

Returns:

an n x n square matrix which is normalised for the class areas.

Accuracy Assessment Stats

rsgislib.classification.classaccuracymetrics.calc_acc_metrics_vecsamples(vec_file: str, vec_lyr: str, ref_col: str, cls_col: str, cls_area_dict: Dict[str, float], out_json_file: str = None, out_csv_file: str = None) → Dict

A function which calculates classification accuracy metrics using a set of reference samples in a vector file and the classification image defining the area classified. This would be often be used alongside the ClassAccuracy QGIS plugin.

Parameters:

vec_file – the input vector file with the reference points
vec_lyr – the input vector layer name with the reference points.
ref_col – the name of the reference classification column in the input vector file.
cls_col – the name of the classification column in the input vector file.
cls_area_dict – A dictionary with the class names as keys and areas as the values These are used to normalise the accuracy metrics to the area of each class.
out_json_file – if specified the generated metrics and confusion matrix are written to a JSON file (Default=None).
out_csv_file – if specified the generated metrics and confusion matrix are written to a CSV file (Default=None).

Returns:

dict (matching JSON output) with the classification accuracy stats

Example:

import rsgislib from rsgislib.classification import classaccuracymetrics

vec_file = “Sonoma_county_classification_refPoints.gpkg” vec_lyr = “ref_points” ref_col = “reference_classes” cls_col = “classes” cls_img = “Sonoma_county_Landsat8_2015_utm_RandomForest.kea” img_cls_name_col = “RF_classes” img_hist_col = “Histogram” out_json_file = “Sonoma_county_class_acc_metrics.json”

classaccuracymetrics.calc_acc_metrics_vecsamples(vec_in_file, vec_in_lyr,
ref_col, cls_col, cls_img, img_cls_name_col, img_hist_col, out_json_file)

rsgislib.classification.classaccuracymetrics.calc_acc_metrics_vecsamples_img(vec_file: str, vec_lyr: str, ref_col: str, cls_col: str, cls_img: str, img_cls_name_col: str = 'ClassName', img_hist_col: str = 'Histogram', out_json_file: str = None, out_csv_file: str = None) → Dict

A function which calculates classification accuracy metrics using a set of reference samples in a vector file and the classification image defining the area classified. This would be often be used alongside the ClassAccuracy QGIS plugin.

Parameters:

vec_file – the input vector file with the reference points
vec_lyr – the input vector layer name with the reference points.
ref_col – the name of the reference classification column in the input vector file.
cls_col – the name of the classification column in the input vector file.
cls_img – an image of the classification from which the area (pixel counts) of each class are extracted to normalise the confusion matrix. Should have a RAT with class names and histogram.
img_cls_name_col – The name of the column in the image attribute table which specifies the class name.
img_hist_col – The name of the column in the image attribute table which contains the histogram (i.e., number of pixels within the class).
out_json_file – if specified the generated metrics and confusion matrix are written to a JSON file (Default=None).
out_csv_file – if specified the generated metrics and confusion matrix are written to a CSV file (Default=None).

Returns:

dict (matching JSON output) with the classification accuracy stats

Example:

import rsgislib from rsgislib.classification import classaccuracymetrics

vec_file = “Sonoma_county_classification_refPoints.gpkg” vec_lyr = “ref_points” ref_col = “reference_classes” cls_col = “classes” cls_img = “Sonoma_county_Landsat8_2015_utm_RandomForest.kea” img_cls_name_col = “RF_classes” img_hist_col = “Histogram” out_json_file = “Sonoma_county_class_acc_metrics.json”

classaccuracymetrics.calc_acc_metrics_vecsamples(vec_in_file, vec_in_lyr,
ref_col, cls_col, cls_img, img_cls_name_col, img_hist_col, out_json_file)

rsgislib.classification.classaccuracymetrics.calc_acc_ptonly_metrics_vecsamples(vec_file: str, vec_lyr: str, ref_col: str, cls_col: str, out_json_file: str = None, out_csv_file: str = None) → Dict

A function which calculates classification accuracy metrics using a set of reference samples in a vector file. This would be often be used alongside the ClassAccuracy QGIS plugin.

Parameters:

vec_file – the input vector file with the reference points
vec_lyr – the input vector layer name with the reference points.
ref_col – the name of the reference classification column in the input vector file.
cls_col – the name of the classification column in the input vector file.
out_json_file – if specified the generated metrics and confusion matrix are written to a JSON file (Default=None).
out_csv_file – if specified the generated metrics and confusion matrix are written to a CSV file (Default=None).

Returns:

dict (matching JSON output) with the classification accuracy stats

vec_file = "sonoma_county_classification_ref_pts.gpkg"
vec_lyr = "ref_points"
ref_col = "reference_classes"
cls_col = "classes"
out_json_file = "Sonoma_county_class_acc_metrics.json"

import rsgislib
from rsgislib.classification import classaccuracymetrics

classaccuracymetrics.calc_acc_ptonly_metrics_vecsamples(vec_file, vec_lyr,
                                                        ref_col, cls_col,
                                                        out_json_file,
                                                        out_csv_file=None)

rsgislib.classification.classaccuracymetrics.calc_acc_ptonly_metrics_vecsamples_bootstrap_conf_interval(vec_file: str, vec_lyr: str, ref_col: str, cls_col: str, out_json_file: str = None, sample_frac: float = 0.2, sample_n_smps: int = None, bootstrap_n: int = 1000) → Dict

A function which calculates classification accuracy metric confidence intervals using a bootstrapping approach. This function uses a set of reference samples in a vector file and would be often be used alongside the ClassAccuracy QGIS plugin.

Parameters:

vec_file – the input vector file with the reference points
vec_lyr – the input vector layer name with the reference points.
ref_col – the name of the reference classification column in the input vector file.
cls_col – the name of the classification column in the input vector file.
out_json_file – if specified the generated metrics and confusion matrix are written to a JSON file (Default=None).
sample_frac – The fraction of the whole dataset selected for each bootstrap iteration. If sample_n_smps is not None.
sample_n_smps – Rather than a fraction of the dataset the number of samples can be specified. If None, then sample_frac will be used to calculate sample_n_smps.
bootstrap_n – The number of bootstrap iterations.

Returns:

dict with mean/median and bootstrap intervals.

rsgislib.classification.classaccuracymetrics.calc_acc_ptonly_metrics_vecsamples_f1_conf_inter_sets(vec_files: ~typing.List[str], vec_lyrs: ~typing.List[str], ref_col: str, cls_col: str, tmp_dir: str, conf_inter: int = 95, conf_thres: float = 0.05, out_plot_file: str = None, out_stats_file: str = None, sample_frac: float = 0.2, sample_n_smps: int = None, bootstrap_n: int = 1000, shuffle_vec_file_order: bool = False, use_rand_choice: bool = False, n_choices: int = None) -> (<class 'bool'>, <class 'int'>, typing.List[float], typing.List[float])

A function which calculates the f1-score and the confidence interval for each the point sets provided. Where the points a cumulatively combined increasing the number of points used for the analysis. Therefore, if there were 3 files in the input list vec_files, 3 f1-score and uncertainies would be calculated using the following point sets:

vec_files[0]
vec_files[0] + vec_files[1]
vec_files[0] + vec_files[1] + vec_files[2]

Parameters:

vec_files – list of input files which must be the same length as vec_lyrs
vec_lyrs – list of input layer names which must be the same length as vec_files
ref_col – the name of the reference classification column in the input vector file.
cls_col – the name of the classification column in the input vector file.
tmp_dir – A temporary directory where intermediate files can be written.
conf_inter – The confidence interval to be used. Options are 90, 95 or 99. The default is 95.
conf_thres – the threshold used to defined whether the confidence interval is below a user threshold. Value should be between 0-1. The default is 0.05 (i.e., 5%).
out_plot_file – Optionally an output plot of the f1-scores and upper and lower confidence intervals can be outputted. If None (default) then no plot will be produced. Otherwise, a file path and name. File format can be PNG or PDF. Use file extension of the output file to specify.
out_stats_file – Optionally, output a JSON file will all the stats for each set which can be used to do your own analysis of the results of the sets.
sample_frac – The fraction of the whole dataset selected for each bootstrap iteration. If sample_n_smps is not None.
sample_n_smps – Rather than a fraction of the dataset the number of samples can be specified. If None, then sample_frac will be used to calculate sample_n_smps.
bootstrap_n – The number of bootstrap iterations.
shuffle_vec_file_order – boolean to specify that the order of the vector files should be shuffled. (Default: False)
use_rand_choice – boolean specifying whether the order of the files should be based on a random choice - allowing the number of iterations to be different than the number of files past to the function. (Default: False). Note. files might be used more than once.
n_choices – if use_rand_choice=True, then you can specify the number of iterations which will be used for the analysis. This might be different from the number of vector files past.

Returns:

(bool, int, list, list). 1. Did the confidence interval fall below the the confidence threshold. 2. the index of the point it first fell below the threshold. 3. list of f1-scores and 4. list of f1-score confidence intervals.

rsgislib.classification.classaccuracymetrics.summarise_multi_acc_ptonly_metrics(acc_json_files: List[str], out_acc_json_sum_file: str)

A function which takes a list of JSON files outputted from the calc_acc_ptonly_metrics_vecsamples function and creates a JSON with summary statistics the individual accuracy metrics. This is useful if you have calculated your accuracy using a number of individual plots and you want to compare the accuracies from the individual plots rather than just produce an overall summary.

Parameters:

acc_json_files – list of input JSON files.
out_acc_json_sum_file – file path the output JSON file.

rsgislib.classification.classaccuracymetrics.calc_class_pt_accuracy_metrics(ref_samples: array, pred_samples: array, cls_names: array) → Dict

A function which calculates a set of classification accuracy metrics for a set of reference and predicted samples.

Parameters:

ref_samples – a 1d array of reference samples represented by a numeric class id
pred_samples – a 1d array of predicted samples represented by a numeric class id
cls_names – a 1d list of the class names (labels) in the order of the class ids.

Returns:

dict with classification accuracy metrics

rsgislib.classification.classaccuracymetrics.calc_class_accuracy_metrics(ref_samples: array, pred_samples: array, cls_area: array, cls_names: array) → Dict

A function which calculates a set of classification accuracy metrics for a set of reference and predicted samples. the area classified for each class is used to allow further metrics to be calculated.

Parameters:

ref_samples – a 1d array of reference samples represented by a numeric class id
pred_samples – a 1d array of predicted samples represented by a numeric class id
cls_area – a 1d array with the area of each class classified (i.e., pixel count)
cls_names – a 1d list of the class names (labels) in the order of the class ids.

Returns:

dict with classification accuracy metrics

rsgislib.classification.classaccuracymetrics.cls_quantity_accuracy(y_true: array, y_pred: array, cls_area: array) → Dict

A function to calculate quantity allocation & disagreement for a land cover classification. The labels must be integers from 1 - N, where N is the number of classes.

Parameters:

y_true – A list or 1D numpy array of true labels.
y_pred – A list or 1D numpy array of predicted labels.
cls_area – A dict or 1D numpy array of area/n_pixels identified by the classifier. len(cls_area) == numpy.unique(y_true).

Returns:

dict with ‘Quantity Disagreement (Q)’, ‘Allocation Disagreement (A)’, ‘Proportion Correct (C)’, ‘Total Disagreement (D)’.

Reference: Pontius, R. G., Jr, & Millones, M. (2011). Death to Kappa: birth of quantity disagreement and allocation disagreement for accuracy assessment. International Journal of Remote Sensing, 32(15), 4407–4429.

rsgislib.classification.classaccuracymetrics.calc_sampled_acc_metrics(ref_samples: array, pred_samples: array, cls_names: array, smpls_lst: List[int], out_metrics_file: str, n_repeats: int = 10, out_usr_metrics_plot: str = None, out_prod_metrics_plot: str = None, out_ref_usr_plot: str = None, out_ref_prod_plot: str = None, cls_colours: Dict[str, List[float]] = None, y_plt_usr_min: float = None, y_plt_usr_max: float = None, y_plt_prod_min: float = None, y_plt_prod_max: float = None, ref_line_clr: List = (0.0, 0.0, 0.0), in_loop: bool = False)

A function which calculates users and producers accuracies for the inputted reference and predicted samples by under sampling the points (with bootstrapping) to try and estimate the number of points which are needed to get a reliable estimate of the whole population of samples.

This function was original written alongside create_modelled_acc_pts to aid the estimation of the number of accuracy assessment points required.

Be careful not to use under-sampling values which are too small as you maybe not sample all the classes and therefore get an error.

Parameters:

ref_samples – a 1d array of reference samples represented by a numeric class id
pred_samples – a 1d array of predicted samples represented by a numeric class id
cls_names – a 1d list of the class names (labels) in the order of the class ids.
smpls_lst – list of n samples to use for under sampling the input data (e.g., [400, 500, 600, 700]). Clearly the number of samples cannot be more than the total number of points. Also, be careful not to values which are too small as you maybe not sample all the classes.
out_metrics_file – an output json file which will have the calculated statistics for future reference.
n_repeats – the number of bootstrap repeats for the sub-sampling. This is used to calculate the 95th confidence interval for each estimate.
out_usr_metrics_plot – A file path for an optional plot for the users accuracies. (Default: None - no plot produced).
out_prod_metrics_plot – A file path for an optional plot for the producers accuracies. (Default: None - no plot produced).
out_ref_usr_plot – A file path for an optional plot for the users reference accuracies. (Default: None - no plot produced).
out_ref_prod_plot – A file path for an optional plot for the producers reference accuracies. (Default: None - no plot produced).
cls_colours – an optional dict with class colours. The key value should be the class name while the value should be a list of 3 float between 0-1 representing RGB values.
y_plt_usr_min – Optional minimum y value for users plot.
y_plt_usr_max – Optional maximum y value for users plot.
y_plt_prod_min – Optional minimum y value for producers plot.
y_plt_prod_max – Optional maximum y value for producers plot.
ref_line_clr – The colour of the reference line added to the out_usr_metrics_plot and out_prod_metrics_plot. The default is black (0.0, 0.0, 0.0).
in_loop – True is called within a loop so tqdm progress bar will then be passed a position parameter of 1.

Classification Utility Classes

class rsgislib.classification.ClassSimpleInfoObj(id=None, file_h5=None, red=None, green=None, blue=None)

This is a class to store the information associated within the classification.

Parameters:

id – Output pixel value for this class
file_h5 – hdf5 file (from rsgislib.zonalstats.extract_zone_img_band_values_to_hdf) with the training data for the class
red – Red colour for visualisation (0-255)
green – Green colour for visualisation (0-255)
blue – Blue colour for visualisation (0-255)
id – Output pixel value for this class
file_h5 – hdf5 file (from rsgislib.zonalstats.extract_zone_img_band_values_to_hdf) with the training data for the class
red – Red colour for visualisation (0-255)
green – Green colour for visualisation (0-255)
blue – Blue colour for visualisation (0-255)

class rsgislib.classification.ClassInfoObj(id=None, out_id=None, train_file_h5=None, test_file_h5=None, valid_file_h5=None, red=None, green=None, blue=None)

This is a class to store the information associated within the classification.

Parameters:

id – Internal unique ID value for this class (must start 0 and be consecutive between the classes)
out_id – External unique ID for ther class which will be used as the output image pixel value, can be any integer.
train_file_h5 – hdf5 file (from rsgislib.zonalstats.extract_zone_img_band_values_to_hdf) with the training data for the class
test_file_h5 – hdf5 file (from rsgislib.zonalstats.extract_zone_img_band_values_to_hdf) with the testing data for the class
valid_file_h5 – hdf5 file (from rsgislib.zonalstats.extract_zone_img_band_values_to_hdf) with the validation data for the class
red – Red colour for visualisation (0-255)
green – Green colour for visualisation (0-255)
blue – Blue colour for visualisation (0-255)
id – Internal unique ID value for this class (must start 0 and be consecutive between the classes)
out_id – External unique ID for ther class which will be used as the output image pixel value, can be any integer.
train_file_h5 – hdf5 file (from rsgislib.zonalstats.extract_zone_img_band_values_to_hdf) with the training data for the class
test_file_h5 – hdf5 file (from rsgislib.zonalstats.extract_zone_img_band_values_to_hdf) with the testing data for the class
valid_file_h5 – hdf5 file (from rsgislib.zonalstats.extract_zone_img_band_values_to_hdf) with the validation data for the class
red – Red colour for visualisation (0-255)
green – Green colour for visualisation (0-255)
blue – Blue colour for visualisation (0-255)

class rsgislib.classification.ClassVecSamplesInfoObj(id=None, class_name=None, vec_file=None, vec_lyr=None, file_h5=None)

This is a class to store the information associated with the classification vector training regions.

Parameters:

id – Unique ID for the class (will probably be the pixel value for this class)
class_name – Unique name for the class.
vec_file – A vector file path with the training samples
vec_lyr – The vector layer name within the vecfile for the training samples.
file_h5 – A file path for a HDF5 file where the pixel values for these samples will be stored.
id – Unique ID for the class (will probably be the pixel value for this class)
class_name – Unique name for the class.
vec_file – A vector file path with the training samples
vec_lyr – The vector layer name within the vec_file for the training samples.
file_h5 – A file path for a HDF5 file where the pixel values for these samples will be stored.

class rsgislib.classification.SamplesInfoObj(class_name=None, class_id=None, mask_img=None, mask_pxl_val=None, out_samp_img_file=None, num_samps=None, samples_h5_file=None, red=None, green=None, blue=None)

This is a class to store the information associated within the classification.

Parameters:

class_name – The name of the class
class_id – Is the classification numeric ID (i.e., output pixel value)
mask_img – The input image mask from which samples are taken
mask_pxl_val – The pixel value within the mask for the class
out_samp_img_file – Temporary file which will store the sampled pixels.
num_samps – The number of samples required.
samples_h5_file – File location for the HDF5 file with the input image values for training.
red – for visualisation red value.
green – for visualisation green value.
blue – for visualisation blue value.
class_name – The name of the class
class_id – Is the classification numeric ID (i.e., output pixel value)
mask_img – The input image mask from which samples are taken
mask_pxl_val – The pixel value within the mask for the class
out_samp_img_file – Temporary file which will store the sampled pixels.
num_samps – The number of samples required.
samples_h5_file – File location for the HDF5 file with the input image values for training.
red – for visualisation red value.
green – for visualisation green value.
blue – for visualisation blue value.