cyclic_boosting.binning package#

Submodules#

cyclic_boosting.binning.bin_number_transformer module#

class cyclic_boosting.binning.bin_number_transformer.BinNumberTransformer(n_bins=100, feature_properties=None, weight_column=None, epsilon=1e-09, tolerance=0.1, inplace=False)[source]#

Bases: ECdfTransformer

This transformer bins feature-variables in X into integral bins, depending on each feature’s feature property. Features with discrete preprocessing (not continuous, but ordered or unordered) are enumerated by their unique values, ascending from the lowest (Thus, a column with 10, 11, 12 would be binned as 0, 1, 2).

If no feature_properties are passed, all columns in X are treated as cyclic_boosting.flags.IS_CONTINUOUS. If a feature_properties dictionary is supplied, it must contain feature properties for each feature in X.

Not-a-number values in the input feature matrix are mapped to cyclic_boosting.binning.MISSING_VALUE_AS_BINNO in the transform step. This value can then be treated as a missing value by Cyclic Boosting.

The feature property cyclic_boosting.flags.HAS_MAGIC_INT_MISSING enables missing-value treatment for values of -999 and -9 in integer-typed feature columns (for both continuous and non-continuous features).

Binning is performed for each feature-column individually. For example, two columns with the same value range can end up with totally different bin numbers. Also, the n_bins argument which is typically an integer, can be indivualized by passing a dict that provides column-names and the respective number of bins, that should be used for continuous preprocessing.

During the fit, all features are treated in the same way as in ECdfTransformer. During the transform step, each feature value is transformed to the number of its feature bin. The range of bin numbers is:

[0, trans.bins_and_cdfs_[feature_no][1].shape[0] - 1)

For the estimated parameters see ECdfTransformer.

Parameters:
  • n_bins (int) – Maximum number of bins used to estimate the empirical CDF. n_bins is ignored for features with discrete preprocessing. If a dict is passed, the feature names/indices should be the keys and the n_bins are the values. Example : {'feature a': 150, 'feature b': 20}

  • feature_properties (dict) – Dictionary listing the names of all features as keys and their preprocessing flags as values. When using a numpy feature matrix X with no column names the keys of the feature properties are the column indices.

  • weight_column (str or int) – Optional column label or column index for the weight column. If not set all samples receive the same weight 1.

  • epsilon (float) –

    Used thresholds for the comparison of float values:

    • epsilon * 1.0 for the comparison of CDF values

    • epsilon * minimal_bin_width for the comparison with bin boundaries of a given feature

    Default value for epsilon: 1e-9

  • tolerance (double) – Relative tolerance of the minimum bin weight. (E.g. if you specify 100 bins and a tolerance of 0.05 the bins are required to have only 0.95% of the total bin weights instead of 1.0%)

Examples

>>> feature_1 = np.asarray([2.1, 2.2, 2.5, 3.1, 3.3, 3.7, 4.1, 4.4])
>>> X = np.c_[feature_1]
>>> from cyclic_boosting.binning import BinNumberTransformer
>>> trans = BinNumberTransformer(n_bins=4, epsilon=1e-8)
>>> trans = trans.fit(X)
>>> # only one input column
>>> column, epsilon, bins_cdfs = trans.bins_and_cdfs_[0]
>>> assert column == 0, np.allclose(epsilon, 1e-8 * 0.1)
>>> bins_cdfs
array([[ 2.1 ,  0.  ],
       [ 2.2 ,  0.25],
       [ 3.1 ,  0.5 ],
       [ 3.7 ,  0.75],
       [ 4.4 ,  1.  ]])
>>> X_test = np.c_[[1.9, 2.15, 2.4, 2.2, 3.6, 3.5, 4.3, 5.1]]
>>> trans.transform(X_test)
array([[0],
       [0],
       [1],
       [0],
       [2],
       [2],
       [3],
       [3]], dtype=int8)
get_feature_bin_boundaries()[source]#
set_transform_request(*, X_orig: bool | None | str = '$UNCHANGED$') BinNumberTransformer#

Request metadata passed to the transform method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to transform if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to transform.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

New in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters:

X_orig (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for X_orig parameter in transform.

Returns:

self – The updated object.

Return type:

object

transform(X_orig: DataFrame | ndarray, y: ndarray | None = None) DataFrame | ndarray[source]#
cyclic_boosting.binning.bin_number_transformer.column_selector(X, column)[source]#

Dispatches to column selection via pandas or numpy, depending on the type of X

cyclic_boosting.binning.bin_number_transformer.column_setter(X, column, rhs)[source]#

Dispatches to column selection via pandas or numpy, depending on the type of X

cyclic_boosting.binning.ecdf_transformer module#

class cyclic_boosting.binning.ecdf_transformer.ConstFunction(val)[source]#

Bases: object

class cyclic_boosting.binning.ecdf_transformer.ECdfTransformer(n_bins=100, feature_properties=None, weight_column=None, epsilon=1e-09, tolerance=0.1)[source]#

Bases: BaseEstimator, TransformerMixin

Transform features to the empirical CDF scale of the training data.

CDF = \(P\left(X \leq x\right)\) = cumulative distribution function. See CDF on wikipedia

Each feature found in feature_properties is considered in separation.

In fit(), (up to) n_bins bin boundaries with approximately equal number of data points are determined. For discrete values, the complete CDF is stored and n_bins is ignored.

In transform(), each feature value is associated with the corresponding bin by binary search. For features with cyclic_boosting.flags.IS_CONTINUOUS set the empirical CDF is then interpolated between the left and the right bin boundary. For out-of-range features, the bin boundaries are taken. For features with cyclic_boosting.flags.IS_ORDERED or cyclic_boosting.flags.IS_UNORDERED only values that have been seen in the fit are transformed to the corresponding empirical CDF values. For all values, not within epsilon of the values seen in the fit, numpy.nan is returned. Missing values(numpy.nan) stay missing values and are not transformed regardless of the feature_properties set and feature values seen in fit(). For all features the feature property cyclic_boosting.flags.HAS_MISSING is assumed.

Parameters:
  • n_bins (int, dict) – Maximum number of bins used to estimate the empirical CDF. n_bins is ignored for features with discrete preprocessing. If a dict is passed, the feature names/indices should be the keys and the n_bins are the values. Example : {'feature a': 150, 'feature b': 20}

  • feature_properties (dict) –

    Dictionary listing the names of all features as keys and their preprocessing flags as values. When using a numpy feature matrix X with no column names the keys of the feature properties are the column indices. If no feature_properties are passed, all columns in X are treated as cyclic_boosting.flags.IS_CONTINUOUS. For more information about feature properties:

  • weight_column – Optional column label or column index for the weight column. If not set all samples receive the same weight 1.

  • epsilon (float) –

    Used thresholds for the comparison of float values:

    • epsilon * 1.0 for the comparison of CDF values

    • epsilon * minimal_bin_width for the comparison with bin boundaries of a given feature

    Default value for epsilon: 1e-9

  • tolerance (double) – Relative tolerance of the minimum bin weight. (E.g. if you specify 100 bins and a tolerance of 0.05 the bins are required to have only 0.95% of the total bin weights instead of 1.0%)

Guarantees for continuous features (cyclic_boosting.flags.IS_CONTINUOUS set for feature)

  • The estimated number of bins \(n_\text{bins\_estimated}\) is always smaller equal than the number of bins requested by the user \(n_\text{bins}\).

    \[n_\text{bins\_estimated} \leq n_\text{bins}\]
  • The bin boundaries are chosen such that each bin contains at least a fraction of \(\frac{1}{n_\text{bins}}\) of all values.

Guarantees for discrete features (flags.UNORDERED or flags.ORDERED set for feature)

  • The estimated number of bins \(n_\text{bins\_estimated}\) is equal to the number of unique values \(n_\text{unique\_values}\) found.

    \[n_\text{bins\_estimated} \Leftrightarrow n_\text{unique\_values}\]

Estimated parameters

bins_and_cdfs_#

For each feature, a tuple containing

  • the column name or index

  • the epsilon used for comparisons to bin boundaries; it is the constructor parameter epsilon multiplied by the smallest bin width

  • and the numpy.ndarray of shape (at most n_bins + 1, 2)

    This is a matrix containing the bin boundaries (column 0) and the corresponding cumulative probabilities (column 1) is learned in the fit. The matrix looks for one feature x like this:

    \[\begin{split}\begin{pmatrix} x_\text{min} & P\left(X < x_\text{min}\right) = 0 \\ x_\text{boundary1} & P\left(X \leq x_\text{boundary1}\right) \\ x_\text{boundary2} & P\left(X \leq x_\text{boundary2}\right) \\ \ldots & \ldots \\ x_\text{max} & P\left(X \leq x_\text{max}\right) = 1 \\ \end{pmatrix}\end{split}\]

    For mixed discrete and continuous features, there might be fewer than n_bins bins. For discrete features n_bins is ignored and the cdf is calculated for each unique value. type of bins_and_cdfs_: item list of tuple

Examples

>>> feature_1 = np.asarray([2.1, 2.2, 2.5, 3.1, 3.3, 3.7, 4.1, 4.4])
>>> X = np.c_[feature_1]
>>> eps = 1e-8
>>> from cyclic_boosting.binning import ECdfTransformer
>>> trans = ECdfTransformer(n_bins=4, epsilon=eps)
>>> trans = trans.fit(X)
>>> # only one input column
>>> column, epsilon, bins_cdfs = trans.bins_and_cdfs_[0]
>>> assert column == 0 and np.allclose(epsilon, eps * 0.1)
>>> bins_cdfs
array([[ 2.1 ,  0.  ],
       [ 2.2 ,  0.25],
       [ 3.1 ,  0.5 ],
       [ 3.7 ,  0.75],
       [ 4.4 ,  1.  ]])
>>> X_test = np.c_[[1.9, 2.4, 2.2, 3.6, 3.5, 4.3, 5.1]]
>>> trans.transform(X_test)
array([[ 0.        ],
       [ 0.30555556],
       [ 0.25      ],
       [ 0.70833333],
       [ 0.66666667],
       [ 0.96428571],
       [ 1.        ]])
fit(X, y=None)[source]#
transform(X, y=None)[source]#
cyclic_boosting.binning.ecdf_transformer.calculate_cdf_from_weighted_data(z, w)[source]#

Calculate the cdf value for each unique value in z weighted with the sample weights in w. All values not finite values in z and unique values of z with weight zero are ignored.

Parameters:
  • z (numpy.ndarray of float64) – input array

  • w (numpy.ndarray) – sample weights

Returns:

Tuple consisting of an array containing the valid unique z values, an array containing the cdf values for the valid z values, the total weight sum and the number of non finite values in z.

Return type:

tuple of two numpy.ndarray, a double and an int

Examples

>>> z = np.array([1., 2., 3., 4., 5., 6., np.nan, 6.])
>>> w = np.array([4., 2., 2., 1., 0., 1., 1.,     0.])
>>> z_unique, cdfs, wsum, n_nan = calculate_cdf_from_weighted_data(z, w)
>>> wsum
10.0
>>> n_nan
1
>>> z_unique  # array of unique values of z
array([ 1.,  2.,  3.,  4.,  6.])
>>> cdfs  # corresponding cdf values to z_unique
array([ 0.4,  0.6,  0.8,  0.9,  1. ])
cyclic_boosting.binning.ecdf_transformer.get_X_column(X, column, array_for_1_dim=True)[source]#

Picks columns from pandas.DataFrame or numpy.ndarray.

Parameters:
  • X (pandas.DataFrame or numpy.ndarray) – Data Source from which columns are picked.

  • column – The format depends on the type of X. For pandas.DataFrame you can give a string or a list/tuple of strings naming the columns. For numpy.ndarray an integer or a list/tuple of integers indexing the columns.

  • array_for_1_dim (bool) – In default mode (set to True) the return type for a one dimensional access is a np.ndarray with shape (n, ). If set to False it is a np.ndarray with shape (1, n).

cyclic_boosting.binning.ecdf_transformer.get_feature_column_names_or_indices(X: DataFrame | ndarray, exclude_columns: List[str] | List[int] | None = None) List[str] | List[int][source]#

Extract the column names from X. If X is a numpy matrix each column is labeled with an integer starting from zero.

Parameters:
  • X (numpy.ndarray(dim=2) or pandas.DataFrame) – input matrix

  • exclude_columns (list of int or str) – column names or indices to omit.

Return type:

list

>>> X = np.c_[[0, 1], [1,0], [3, 5]]
>>> from cyclic_boosting.binning import get_feature_column_names_or_indices
>>> get_feature_column_names_or_indices(X)
[0, 1, 2]
>>> get_feature_column_names_or_indices(X, exclude_columns=[1])
[0, 2]
>>> get_feature_column_names_or_indices(X, exclude_columns=[1, 1])
[0, 2]
>>> get_feature_column_names_or_indices(X, exclude_columns=[0, 1, 2])
[]
>>> X = pd.DataFrame(X, columns = ['b', 'c', 'a'])
>>> get_feature_column_names_or_indices(X, exclude_columns=['a'])
['b', 'c']
>>> get_feature_column_names_or_indices(X, exclude_columns=['d'])
['b', 'c', 'a']
cyclic_boosting.binning.ecdf_transformer.get_weight_column(X, weight_column=None)[source]#

Check if a weight column is present and return it if possible. If no weight columns is present in X a weight column with only ones of same length than X is created and returned.

Parameters:
  • X (numpy.ndarray(dim=2) or pandas.DataFrame) – Samples feature matrix.

  • weight_column (int or string or NoneType) – Name or index of the weight column or None.

Return type:

numpy.ndarray

>>> X = np.c_[[0., 1], [1,0], [3, 5]]
>>> from cyclic_boosting.binning import get_weight_column
>>> get_weight_column(X)
array([ 1.,  1.])
>>> get_weight_column(X, 0)
array([ 0.,  1.])
>>> get_weight_column(X, 2)
array([ 3.,  5.])
>>> X = pd.DataFrame(X, columns = ['b', 'c', 'a'])
>>> get_weight_column(X)
array([ 1.,  1.])
>>> get_weight_column(X, 'c')
array([ 1.,  0.])
cyclic_boosting.binning.ecdf_transformer.reduce_cdf_and_boundaries_to_nbins(bins_x, cdf_x, n_bins, epsilon, tolerance)[source]#

Section the cdf spectrum into n_bin parts of equal statistics, and find all events beloning into these bins by filtering all suitable events in the event-wise cdf_x array.

Often, events cannot be distributed exactly with equal statistics over all bins, therefore the tolerance argument allows for bins to be of a weight below 1.0 / n_bins.

A minimum weight of 1.0 / n_bins - tolerance per bin is guaranteed.

This function is used internally in the method cyclic_boosting.binning.ECdfTransformer().

Parameters:
  • bins_x (np.ndarray) – strictly increasing array containing all bin boundaries, length is the number of evenets.

  • cdf_x (np.ndarray) – Strictly increasing array containing the cdf values corresponding to the bin boundaries in bin_x. Contains one value for each event.

  • n_bins (int) – Maximum number of bins that ought to be returned. This also determines the minimum weight per bin, which is 1 / n_bins.

  • epsilon (double) – Threshold for the comparison of CDFs

  • tolerance (double) – Relative tolerance of the minimum bin weight. (E.g. if you specify 100 bins and a tolerance of 0.05 the bins are required to have only 0.95% of the total bin weights instead of 1.0%)

Returns:

  • The reduced input arrays bins_x and cdf_x, now with maximum

  • length n_bins, tuple of numpy.ndarrays(dim=1)

Module contents#

class cyclic_boosting.binning.BinNumberTransformer(n_bins=100, feature_properties=None, weight_column=None, epsilon=1e-09, tolerance=0.1, inplace=False)[source]#

Bases: ECdfTransformer

This transformer bins feature-variables in X into integral bins, depending on each feature’s feature property. Features with discrete preprocessing (not continuous, but ordered or unordered) are enumerated by their unique values, ascending from the lowest (Thus, a column with 10, 11, 12 would be binned as 0, 1, 2).

If no feature_properties are passed, all columns in X are treated as cyclic_boosting.flags.IS_CONTINUOUS. If a feature_properties dictionary is supplied, it must contain feature properties for each feature in X.

Not-a-number values in the input feature matrix are mapped to cyclic_boosting.binning.MISSING_VALUE_AS_BINNO in the transform step. This value can then be treated as a missing value by Cyclic Boosting.

The feature property cyclic_boosting.flags.HAS_MAGIC_INT_MISSING enables missing-value treatment for values of -999 and -9 in integer-typed feature columns (for both continuous and non-continuous features).

Binning is performed for each feature-column individually. For example, two columns with the same value range can end up with totally different bin numbers. Also, the n_bins argument which is typically an integer, can be indivualized by passing a dict that provides column-names and the respective number of bins, that should be used for continuous preprocessing.

During the fit, all features are treated in the same way as in ECdfTransformer. During the transform step, each feature value is transformed to the number of its feature bin. The range of bin numbers is:

[0, trans.bins_and_cdfs_[feature_no][1].shape[0] - 1)

For the estimated parameters see ECdfTransformer.

Parameters:
  • n_bins (int) – Maximum number of bins used to estimate the empirical CDF. n_bins is ignored for features with discrete preprocessing. If a dict is passed, the feature names/indices should be the keys and the n_bins are the values. Example : {'feature a': 150, 'feature b': 20}

  • feature_properties (dict) – Dictionary listing the names of all features as keys and their preprocessing flags as values. When using a numpy feature matrix X with no column names the keys of the feature properties are the column indices.

  • weight_column (str or int) – Optional column label or column index for the weight column. If not set all samples receive the same weight 1.

  • epsilon (float) –

    Used thresholds for the comparison of float values:

    • epsilon * 1.0 for the comparison of CDF values

    • epsilon * minimal_bin_width for the comparison with bin boundaries of a given feature

    Default value for epsilon: 1e-9

  • tolerance (double) – Relative tolerance of the minimum bin weight. (E.g. if you specify 100 bins and a tolerance of 0.05 the bins are required to have only 0.95% of the total bin weights instead of 1.0%)

Examples

>>> feature_1 = np.asarray([2.1, 2.2, 2.5, 3.1, 3.3, 3.7, 4.1, 4.4])
>>> X = np.c_[feature_1]
>>> from cyclic_boosting.binning import BinNumberTransformer
>>> trans = BinNumberTransformer(n_bins=4, epsilon=1e-8)
>>> trans = trans.fit(X)
>>> # only one input column
>>> column, epsilon, bins_cdfs = trans.bins_and_cdfs_[0]
>>> assert column == 0, np.allclose(epsilon, 1e-8 * 0.1)
>>> bins_cdfs
array([[ 2.1 ,  0.  ],
       [ 2.2 ,  0.25],
       [ 3.1 ,  0.5 ],
       [ 3.7 ,  0.75],
       [ 4.4 ,  1.  ]])
>>> X_test = np.c_[[1.9, 2.15, 2.4, 2.2, 3.6, 3.5, 4.3, 5.1]]
>>> trans.transform(X_test)
array([[0],
       [0],
       [1],
       [0],
       [2],
       [2],
       [3],
       [3]], dtype=int8)
get_feature_bin_boundaries()[source]#
set_transform_request(*, X_orig: bool | None | str = '$UNCHANGED$') BinNumberTransformer#

Request metadata passed to the transform method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to transform if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to transform.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

New in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters:

X_orig (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for X_orig parameter in transform.

Returns:

self – The updated object.

Return type:

object

transform(X_orig: DataFrame | ndarray, y: ndarray | None = None) DataFrame | ndarray[source]#
class cyclic_boosting.binning.ECdfTransformer(n_bins=100, feature_properties=None, weight_column=None, epsilon=1e-09, tolerance=0.1)[source]#

Bases: BaseEstimator, TransformerMixin

Transform features to the empirical CDF scale of the training data.

CDF = \(P\left(X \leq x\right)\) = cumulative distribution function. See CDF on wikipedia

Each feature found in feature_properties is considered in separation.

In fit(), (up to) n_bins bin boundaries with approximately equal number of data points are determined. For discrete values, the complete CDF is stored and n_bins is ignored.

In transform(), each feature value is associated with the corresponding bin by binary search. For features with cyclic_boosting.flags.IS_CONTINUOUS set the empirical CDF is then interpolated between the left and the right bin boundary. For out-of-range features, the bin boundaries are taken. For features with cyclic_boosting.flags.IS_ORDERED or cyclic_boosting.flags.IS_UNORDERED only values that have been seen in the fit are transformed to the corresponding empirical CDF values. For all values, not within epsilon of the values seen in the fit, numpy.nan is returned. Missing values(numpy.nan) stay missing values and are not transformed regardless of the feature_properties set and feature values seen in fit(). For all features the feature property cyclic_boosting.flags.HAS_MISSING is assumed.

Parameters:
  • n_bins (int, dict) – Maximum number of bins used to estimate the empirical CDF. n_bins is ignored for features with discrete preprocessing. If a dict is passed, the feature names/indices should be the keys and the n_bins are the values. Example : {'feature a': 150, 'feature b': 20}

  • feature_properties (dict) –

    Dictionary listing the names of all features as keys and their preprocessing flags as values. When using a numpy feature matrix X with no column names the keys of the feature properties are the column indices. If no feature_properties are passed, all columns in X are treated as cyclic_boosting.flags.IS_CONTINUOUS. For more information about feature properties:

  • weight_column – Optional column label or column index for the weight column. If not set all samples receive the same weight 1.

  • epsilon (float) –

    Used thresholds for the comparison of float values:

    • epsilon * 1.0 for the comparison of CDF values

    • epsilon * minimal_bin_width for the comparison with bin boundaries of a given feature

    Default value for epsilon: 1e-9

  • tolerance (double) – Relative tolerance of the minimum bin weight. (E.g. if you specify 100 bins and a tolerance of 0.05 the bins are required to have only 0.95% of the total bin weights instead of 1.0%)

Guarantees for continuous features (cyclic_boosting.flags.IS_CONTINUOUS set for feature)

  • The estimated number of bins \(n_\text{bins\_estimated}\) is always smaller equal than the number of bins requested by the user \(n_\text{bins}\).

    \[n_\text{bins\_estimated} \leq n_\text{bins}\]
  • The bin boundaries are chosen such that each bin contains at least a fraction of \(\frac{1}{n_\text{bins}}\) of all values.

Guarantees for discrete features (flags.UNORDERED or flags.ORDERED set for feature)

  • The estimated number of bins \(n_\text{bins\_estimated}\) is equal to the number of unique values \(n_\text{unique\_values}\) found.

    \[n_\text{bins\_estimated} \Leftrightarrow n_\text{unique\_values}\]

Estimated parameters

bins_and_cdfs_#

For each feature, a tuple containing

  • the column name or index

  • the epsilon used for comparisons to bin boundaries; it is the constructor parameter epsilon multiplied by the smallest bin width

  • and the numpy.ndarray of shape (at most n_bins + 1, 2)

    This is a matrix containing the bin boundaries (column 0) and the corresponding cumulative probabilities (column 1) is learned in the fit. The matrix looks for one feature x like this:

    \[\begin{split}\begin{pmatrix} x_\text{min} & P\left(X < x_\text{min}\right) = 0 \\ x_\text{boundary1} & P\left(X \leq x_\text{boundary1}\right) \\ x_\text{boundary2} & P\left(X \leq x_\text{boundary2}\right) \\ \ldots & \ldots \\ x_\text{max} & P\left(X \leq x_\text{max}\right) = 1 \\ \end{pmatrix}\end{split}\]

    For mixed discrete and continuous features, there might be fewer than n_bins bins. For discrete features n_bins is ignored and the cdf is calculated for each unique value. type of bins_and_cdfs_: item list of tuple

Examples

>>> feature_1 = np.asarray([2.1, 2.2, 2.5, 3.1, 3.3, 3.7, 4.1, 4.4])
>>> X = np.c_[feature_1]
>>> eps = 1e-8
>>> from cyclic_boosting.binning import ECdfTransformer
>>> trans = ECdfTransformer(n_bins=4, epsilon=eps)
>>> trans = trans.fit(X)
>>> # only one input column
>>> column, epsilon, bins_cdfs = trans.bins_and_cdfs_[0]
>>> assert column == 0 and np.allclose(epsilon, eps * 0.1)
>>> bins_cdfs
array([[ 2.1 ,  0.  ],
       [ 2.2 ,  0.25],
       [ 3.1 ,  0.5 ],
       [ 3.7 ,  0.75],
       [ 4.4 ,  1.  ]])
>>> X_test = np.c_[[1.9, 2.4, 2.2, 3.6, 3.5, 4.3, 5.1]]
>>> trans.transform(X_test)
array([[ 0.        ],
       [ 0.30555556],
       [ 0.25      ],
       [ 0.70833333],
       [ 0.66666667],
       [ 0.96428571],
       [ 1.        ]])
fit(X, y=None)[source]#
transform(X, y=None)[source]#
cyclic_boosting.binning.get_bin_bounds(binners, feat_group)[source]#

Gets the bin boundaries for each feature group.

Parameters:
  • binners (list) – List of binners.

  • feat_group (str or tuple of str) – A feature property for which the bin boundaries should be extracted from the binners.

cyclic_boosting.binning.get_column_index(X, column_name_or_index)[source]#

Integer column index of pandas.Dataframe or numpy.ndarray.

Parameters:
  • X (numpy.ndarray(dim=2) or pandas.DataFrame) – input matrix

  • column_name_or_index (string or int) – column name or index

Return type:

int

cyclic_boosting.binning.get_feature_column_names_or_indices(X: DataFrame | ndarray, exclude_columns: List[str] | List[int] | None = None) List[str] | List[int][source]#

Extract the column names from X. If X is a numpy matrix each column is labeled with an integer starting from zero.

Parameters:
  • X (numpy.ndarray(dim=2) or pandas.DataFrame) – input matrix

  • exclude_columns (list of int or str) – column names or indices to omit.

Return type:

list

>>> X = np.c_[[0, 1], [1,0], [3, 5]]
>>> from cyclic_boosting.binning import get_feature_column_names_or_indices
>>> get_feature_column_names_or_indices(X)
[0, 1, 2]
>>> get_feature_column_names_or_indices(X, exclude_columns=[1])
[0, 2]
>>> get_feature_column_names_or_indices(X, exclude_columns=[1, 1])
[0, 2]
>>> get_feature_column_names_or_indices(X, exclude_columns=[0, 1, 2])
[]
>>> X = pd.DataFrame(X, columns = ['b', 'c', 'a'])
>>> get_feature_column_names_or_indices(X, exclude_columns=['a'])
['b', 'c']
>>> get_feature_column_names_or_indices(X, exclude_columns=['d'])
['b', 'c', 'a']
cyclic_boosting.binning.get_weight_column(X, weight_column=None)[source]#

Check if a weight column is present and return it if possible. If no weight columns is present in X a weight column with only ones of same length than X is created and returned.

Parameters:
  • X (numpy.ndarray(dim=2) or pandas.DataFrame) – Samples feature matrix.

  • weight_column (int or string or NoneType) – Name or index of the weight column or None.

Return type:

numpy.ndarray

>>> X = np.c_[[0., 1], [1,0], [3, 5]]
>>> from cyclic_boosting.binning import get_weight_column
>>> get_weight_column(X)
array([ 1.,  1.])
>>> get_weight_column(X, 0)
array([ 0.,  1.])
>>> get_weight_column(X, 2)
array([ 3.,  5.])
>>> X = pd.DataFrame(X, columns = ['b', 'c', 'a'])
>>> get_weight_column(X)
array([ 1.,  1.])
>>> get_weight_column(X, 'c')
array([ 1.,  0.])
cyclic_boosting.binning.minimal_difference(values)[source]#

Minimal difference of consecutive array values excluding zero differences.

Parameters:

values (numpy.ndarray with dim=1.) – Array values

cyclic_boosting.binning.reduce_cdf_and_boundaries_to_nbins(bins_x, cdf_x, n_bins, epsilon, tolerance)[source]#

Section the cdf spectrum into n_bin parts of equal statistics, and find all events beloning into these bins by filtering all suitable events in the event-wise cdf_x array.

Often, events cannot be distributed exactly with equal statistics over all bins, therefore the tolerance argument allows for bins to be of a weight below 1.0 / n_bins.

A minimum weight of 1.0 / n_bins - tolerance per bin is guaranteed.

This function is used internally in the method cyclic_boosting.binning.ECdfTransformer().

Parameters:
  • bins_x (np.ndarray) – strictly increasing array containing all bin boundaries, length is the number of evenets.

  • cdf_x (np.ndarray) – Strictly increasing array containing the cdf values corresponding to the bin boundaries in bin_x. Contains one value for each event.

  • n_bins (int) – Maximum number of bins that ought to be returned. This also determines the minimum weight per bin, which is 1 / n_bins.

  • epsilon (double) – Threshold for the comparison of CDFs

  • tolerance (double) – Relative tolerance of the minimum bin weight. (E.g. if you specify 100 bins and a tolerance of 0.05 the bins are required to have only 0.95% of the total bin weights instead of 1.0%)

Returns:

  • The reduced input arrays bins_x and cdf_x, now with maximum

  • length n_bins, tuple of numpy.ndarrays(dim=1)