cyclic_boosting.binning package#
Submodules#
cyclic_boosting.binning.bin_number_transformer module#
- class cyclic_boosting.binning.bin_number_transformer.BinNumberTransformer(n_bins=100, feature_properties=None, weight_column=None, epsilon=1e-09, tolerance=0.1, inplace=False)[source]#
Bases:
ECdfTransformer
This transformer bins feature-variables in
X
into integral bins, depending on each feature’s feature property. Features with discrete preprocessing (not continuous, but ordered or unordered) are enumerated by their unique values, ascending from the lowest (Thus, a column with10, 11, 12
would be binned as0, 1, 2
).If no
feature_properties
are passed, all columns inX
are treated ascyclic_boosting.flags.IS_CONTINUOUS
. If afeature_properties
dictionary is supplied, it must contain feature properties for each feature inX
.Not-a-number values in the input feature matrix are mapped to
cyclic_boosting.binning.MISSING_VALUE_AS_BINNO
in the transform step. This value can then be treated as a missing value by Cyclic Boosting.The feature property
cyclic_boosting.flags.HAS_MAGIC_INT_MISSING
enables missing-value treatment for values of -999 and -9 in integer-typed feature columns (for both continuous and non-continuous features).Binning is performed for each feature-column individually. For example, two columns with the same value range can end up with totally different bin numbers. Also, the
n_bins
argument which is typically an integer, can be indivualized by passing a dict that provides column-names and the respective number of bins, that should be used for continuous preprocessing.During the fit, all features are treated in the same way as in
ECdfTransformer
. During the transform step, each feature value is transformed to the number of its feature bin. The range of bin numbers is:[0, trans.bins_and_cdfs_[feature_no][1].shape[0] - 1)
For the estimated parameters see
ECdfTransformer
.- Parameters:
n_bins (int) – Maximum number of bins used to estimate the empirical CDF.
n_bins
is ignored for features with discrete preprocessing. If a dict is passed, the feature names/indices should be the keys and the n_bins are the values. Example :{'feature a': 150, 'feature b': 20}
feature_properties (dict) – Dictionary listing the names of all features as keys and their preprocessing flags as values. When using a numpy feature matrix X with no column names the keys of the feature properties are the column indices.
weight_column (str or int) – Optional column label or column index for the weight column. If not set all samples receive the same weight 1.
epsilon (float) –
Used thresholds for the comparison of float values:
epsilon * 1.0
for the comparison of CDF valuesepsilon * minimal_bin_width
for the comparison with bin boundaries of a given feature
Default value for epsilon: 1e-9
tolerance (double) – Relative tolerance of the minimum bin weight. (E.g. if you specify 100 bins and a tolerance of 0.05 the bins are required to have only 0.95% of the total bin weights instead of 1.0%)
Examples
>>> feature_1 = np.asarray([2.1, 2.2, 2.5, 3.1, 3.3, 3.7, 4.1, 4.4]) >>> X = np.c_[feature_1]
>>> from cyclic_boosting.binning import BinNumberTransformer >>> trans = BinNumberTransformer(n_bins=4, epsilon=1e-8) >>> trans = trans.fit(X)
>>> # only one input column >>> column, epsilon, bins_cdfs = trans.bins_and_cdfs_[0] >>> assert column == 0, np.allclose(epsilon, 1e-8 * 0.1) >>> bins_cdfs array([[ 2.1 , 0. ], [ 2.2 , 0.25], [ 3.1 , 0.5 ], [ 3.7 , 0.75], [ 4.4 , 1. ]])
>>> X_test = np.c_[[1.9, 2.15, 2.4, 2.2, 3.6, 3.5, 4.3, 5.1]] >>> trans.transform(X_test) array([[0], [0], [1], [0], [2], [2], [3], [3]], dtype=int8)
- set_transform_request(*, X_orig: bool | None | str = '$UNCHANGED$') BinNumberTransformer #
Request metadata passed to the
transform
method.Note that this method is only relevant if
enable_metadata_routing=True
(seesklearn.set_config()
). Please see User Guide on how the routing mechanism works.The options for each parameter are:
True
: metadata is requested, and passed totransform
if provided. The request is ignored if metadata is not provided.False
: metadata is not requested and the meta-estimator will not pass it totransform
.None
: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str
: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED
) retains the existing request. This allows you to change the request for some parameters and not others.New in version 1.3.
Note
This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a
Pipeline
. Otherwise it has no effect.- Parameters:
X_orig (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for
X_orig
parameter intransform
.- Returns:
self – The updated object.
- Return type:
object
cyclic_boosting.binning.ecdf_transformer module#
- class cyclic_boosting.binning.ecdf_transformer.ECdfTransformer(n_bins=100, feature_properties=None, weight_column=None, epsilon=1e-09, tolerance=0.1)[source]#
Bases:
BaseEstimator
,TransformerMixin
Transform features to the empirical CDF scale of the training data.
CDF = \(P\left(X \leq x\right)\) = cumulative distribution function. See CDF on wikipedia
Each feature found in
feature_properties
is considered in separation.In
fit()
, (up to)n_bins
bin boundaries with approximately equal number of data points are determined. For discrete values, the complete CDF is stored andn_bins
is ignored.In
transform()
, each feature value is associated with the corresponding bin by binary search. For features withcyclic_boosting.flags.IS_CONTINUOUS
set the empirical CDF is then interpolated between the left and the right bin boundary. For out-of-range features, the bin boundaries are taken. For features withcyclic_boosting.flags.IS_ORDERED
orcyclic_boosting.flags.IS_UNORDERED
only values that have been seen in the fit are transformed to the corresponding empirical CDF values. For all values, not within epsilon of the values seen in the fit, numpy.nan is returned. Missing values(numpy.nan
) stay missing values and are not transformed regardless of thefeature_properties
set and feature values seen infit()
. For all features the feature propertycyclic_boosting.flags.HAS_MISSING
is assumed.- Parameters:
n_bins (int, dict) – Maximum number of bins used to estimate the empirical CDF.
n_bins
is ignored for features with discrete preprocessing. If a dict is passed, the feature names/indices should be the keys and the n_bins are the values. Example :{'feature a': 150, 'feature b': 20}
feature_properties (dict) –
Dictionary listing the names of all features as keys and their preprocessing flags as values. When using a numpy feature matrix X with no column names the keys of the feature properties are the column indices. If no
feature_properties
are passed, all columns inX
are treated as cyclic_boosting.flags.IS_CONTINUOUS. For more information about feature properties:See also
weight_column – Optional column label or column index for the weight column. If not set all samples receive the same weight 1.
epsilon (float) –
Used thresholds for the comparison of float values:
epsilon * 1.0
for the comparison of CDF valuesepsilon * minimal_bin_width
for the comparison with bin boundaries of a given feature
Default value for epsilon: 1e-9
tolerance (double) – Relative tolerance of the minimum bin weight. (E.g. if you specify 100 bins and a tolerance of 0.05 the bins are required to have only 0.95% of the total bin weights instead of 1.0%)
Guarantees for continuous features (cyclic_boosting.flags.IS_CONTINUOUS set for feature)
The estimated number of bins \(n_\text{bins\_estimated}\) is always smaller equal than the number of bins requested by the user \(n_\text{bins}\).
\[n_\text{bins\_estimated} \leq n_\text{bins}\]The bin boundaries are chosen such that each bin contains at least a fraction of \(\frac{1}{n_\text{bins}}\) of all values.
Guarantees for discrete features (flags.UNORDERED or flags.ORDERED set for feature)
The estimated number of bins \(n_\text{bins\_estimated}\) is equal to the number of unique values \(n_\text{unique\_values}\) found.
\[n_\text{bins\_estimated} \Leftrightarrow n_\text{unique\_values}\]
Estimated parameters
- bins_and_cdfs_#
For each feature, a tuple containing
the column name or index
the epsilon used for comparisons to bin boundaries; it is the constructor parameter
epsilon
multiplied by the smallest bin widthand the
numpy.ndarray
of shape(at most n_bins + 1, 2)
This is a matrix containing the bin boundaries (column 0) and the corresponding cumulative probabilities (column 1) is learned in the fit. The matrix looks for one feature
x
like this:\[\begin{split}\begin{pmatrix} x_\text{min} & P\left(X < x_\text{min}\right) = 0 \\ x_\text{boundary1} & P\left(X \leq x_\text{boundary1}\right) \\ x_\text{boundary2} & P\left(X \leq x_\text{boundary2}\right) \\ \ldots & \ldots \\ x_\text{max} & P\left(X \leq x_\text{max}\right) = 1 \\ \end{pmatrix}\end{split}\]For mixed discrete and continuous features, there might be fewer than
n_bins
bins. For discrete featuresn_bins
is ignored and thecdf
is calculated for each unique value. type of bins_and_cdfs_: itemlist
oftuple
Examples
>>> feature_1 = np.asarray([2.1, 2.2, 2.5, 3.1, 3.3, 3.7, 4.1, 4.4]) >>> X = np.c_[feature_1] >>> eps = 1e-8
>>> from cyclic_boosting.binning import ECdfTransformer >>> trans = ECdfTransformer(n_bins=4, epsilon=eps) >>> trans = trans.fit(X)
>>> # only one input column >>> column, epsilon, bins_cdfs = trans.bins_and_cdfs_[0] >>> assert column == 0 and np.allclose(epsilon, eps * 0.1) >>> bins_cdfs array([[ 2.1 , 0. ], [ 2.2 , 0.25], [ 3.1 , 0.5 ], [ 3.7 , 0.75], [ 4.4 , 1. ]])
>>> X_test = np.c_[[1.9, 2.4, 2.2, 3.6, 3.5, 4.3, 5.1]] >>> trans.transform(X_test) array([[ 0. ], [ 0.30555556], [ 0.25 ], [ 0.70833333], [ 0.66666667], [ 0.96428571], [ 1. ]])
- cyclic_boosting.binning.ecdf_transformer.calculate_cdf_from_weighted_data(z, w)[source]#
Calculate the cdf value for each unique value in z weighted with the sample weights in w. All values not finite values in z and unique values of z with weight zero are ignored.
- Parameters:
z (numpy.ndarray of float64) – input array
w (numpy.ndarray) – sample weights
- Returns:
Tuple consisting of an array containing the valid unique z values, an array containing the cdf values for the valid z values, the total weight sum and the number of non finite values in z.
- Return type:
tuple of two
numpy.ndarray
, a double and an int
Examples
>>> z = np.array([1., 2., 3., 4., 5., 6., np.nan, 6.]) >>> w = np.array([4., 2., 2., 1., 0., 1., 1., 0.]) >>> z_unique, cdfs, wsum, n_nan = calculate_cdf_from_weighted_data(z, w) >>> wsum 10.0 >>> n_nan 1 >>> z_unique # array of unique values of z array([ 1., 2., 3., 4., 6.]) >>> cdfs # corresponding cdf values to z_unique array([ 0.4, 0.6, 0.8, 0.9, 1. ])
- cyclic_boosting.binning.ecdf_transformer.get_X_column(X, column, array_for_1_dim=True)[source]#
Picks columns from
pandas.DataFrame
ornumpy.ndarray
.- Parameters:
X (
pandas.DataFrame
ornumpy.ndarray
) – Data Source from which columns are picked.column – The format depends on the type of X. For
pandas.DataFrame
you can give a string or a list/tuple of strings naming the columns. Fornumpy.ndarray
an integer or a list/tuple of integers indexing the columns.array_for_1_dim (bool) – In default mode (set to True) the return type for a one dimensional access is a np.ndarray with shape (n, ). If set to False it is a np.ndarray with shape (1, n).
- cyclic_boosting.binning.ecdf_transformer.get_feature_column_names_or_indices(X: DataFrame | ndarray, exclude_columns: List[str] | List[int] | None = None) List[str] | List[int] [source]#
Extract the column names from X. If X is a numpy matrix each column is labeled with an integer starting from zero.
- Parameters:
X (numpy.ndarray(dim=2) or pandas.DataFrame) – input matrix
exclude_columns (list of int or str) – column names or indices to omit.
- Return type:
list
>>> X = np.c_[[0, 1], [1,0], [3, 5]] >>> from cyclic_boosting.binning import get_feature_column_names_or_indices >>> get_feature_column_names_or_indices(X) [0, 1, 2]
>>> get_feature_column_names_or_indices(X, exclude_columns=[1]) [0, 2]
>>> get_feature_column_names_or_indices(X, exclude_columns=[1, 1]) [0, 2]
>>> get_feature_column_names_or_indices(X, exclude_columns=[0, 1, 2]) []
>>> X = pd.DataFrame(X, columns = ['b', 'c', 'a']) >>> get_feature_column_names_or_indices(X, exclude_columns=['a']) ['b', 'c']
>>> get_feature_column_names_or_indices(X, exclude_columns=['d']) ['b', 'c', 'a']
- cyclic_boosting.binning.ecdf_transformer.get_weight_column(X, weight_column=None)[source]#
Check if a weight column is present and return it if possible. If no weight columns is present in X a weight column with only
ones
of same length than X is created and returned.- Parameters:
X (numpy.ndarray(dim=2) or pandas.DataFrame) – Samples feature matrix.
weight_column (int or string or
NoneType
) – Name or index of the weight column or None.
- Return type:
numpy.ndarray
>>> X = np.c_[[0., 1], [1,0], [3, 5]] >>> from cyclic_boosting.binning import get_weight_column >>> get_weight_column(X) array([ 1., 1.]) >>> get_weight_column(X, 0) array([ 0., 1.]) >>> get_weight_column(X, 2) array([ 3., 5.])
>>> X = pd.DataFrame(X, columns = ['b', 'c', 'a']) >>> get_weight_column(X) array([ 1., 1.])
>>> get_weight_column(X, 'c') array([ 1., 0.])
- cyclic_boosting.binning.ecdf_transformer.reduce_cdf_and_boundaries_to_nbins(bins_x, cdf_x, n_bins, epsilon, tolerance)[source]#
Section the cdf spectrum into n_bin parts of equal statistics, and find all events beloning into these bins by filtering all suitable events in the event-wise cdf_x array.
Often, events cannot be distributed exactly with equal statistics over all bins, therefore the
tolerance
argument allows for bins to be of a weight below 1.0 / n_bins.A minimum weight of 1.0 / n_bins - tolerance per bin is guaranteed.
This function is used internally in the method
cyclic_boosting.binning.ECdfTransformer()
.- Parameters:
bins_x (np.ndarray) – strictly increasing array containing all bin boundaries, length is the number of evenets.
cdf_x (np.ndarray) – Strictly increasing array containing the cdf values corresponding to the bin boundaries in bin_x. Contains one value for each event.
n_bins (int) – Maximum number of bins that ought to be returned. This also determines the minimum weight per bin, which is 1 / n_bins.
epsilon (double) – Threshold for the comparison of CDFs
tolerance (double) – Relative tolerance of the minimum bin weight. (E.g. if you specify 100 bins and a tolerance of 0.05 the bins are required to have only 0.95% of the total bin weights instead of 1.0%)
- Returns:
The
reduced
input arrays bins_x and cdf_x, now with maximumlength n_bins, tuple of numpy.ndarrays(dim=1)
Module contents#
- class cyclic_boosting.binning.BinNumberTransformer(n_bins=100, feature_properties=None, weight_column=None, epsilon=1e-09, tolerance=0.1, inplace=False)[source]#
Bases:
ECdfTransformer
This transformer bins feature-variables in
X
into integral bins, depending on each feature’s feature property. Features with discrete preprocessing (not continuous, but ordered or unordered) are enumerated by their unique values, ascending from the lowest (Thus, a column with10, 11, 12
would be binned as0, 1, 2
).If no
feature_properties
are passed, all columns inX
are treated ascyclic_boosting.flags.IS_CONTINUOUS
. If afeature_properties
dictionary is supplied, it must contain feature properties for each feature inX
.Not-a-number values in the input feature matrix are mapped to
cyclic_boosting.binning.MISSING_VALUE_AS_BINNO
in the transform step. This value can then be treated as a missing value by Cyclic Boosting.The feature property
cyclic_boosting.flags.HAS_MAGIC_INT_MISSING
enables missing-value treatment for values of -999 and -9 in integer-typed feature columns (for both continuous and non-continuous features).Binning is performed for each feature-column individually. For example, two columns with the same value range can end up with totally different bin numbers. Also, the
n_bins
argument which is typically an integer, can be indivualized by passing a dict that provides column-names and the respective number of bins, that should be used for continuous preprocessing.During the fit, all features are treated in the same way as in
ECdfTransformer
. During the transform step, each feature value is transformed to the number of its feature bin. The range of bin numbers is:[0, trans.bins_and_cdfs_[feature_no][1].shape[0] - 1)
For the estimated parameters see
ECdfTransformer
.- Parameters:
n_bins (int) – Maximum number of bins used to estimate the empirical CDF.
n_bins
is ignored for features with discrete preprocessing. If a dict is passed, the feature names/indices should be the keys and the n_bins are the values. Example :{'feature a': 150, 'feature b': 20}
feature_properties (dict) – Dictionary listing the names of all features as keys and their preprocessing flags as values. When using a numpy feature matrix X with no column names the keys of the feature properties are the column indices.
weight_column (str or int) – Optional column label or column index for the weight column. If not set all samples receive the same weight 1.
epsilon (float) –
Used thresholds for the comparison of float values:
epsilon * 1.0
for the comparison of CDF valuesepsilon * minimal_bin_width
for the comparison with bin boundaries of a given feature
Default value for epsilon: 1e-9
tolerance (double) – Relative tolerance of the minimum bin weight. (E.g. if you specify 100 bins and a tolerance of 0.05 the bins are required to have only 0.95% of the total bin weights instead of 1.0%)
Examples
>>> feature_1 = np.asarray([2.1, 2.2, 2.5, 3.1, 3.3, 3.7, 4.1, 4.4]) >>> X = np.c_[feature_1]
>>> from cyclic_boosting.binning import BinNumberTransformer >>> trans = BinNumberTransformer(n_bins=4, epsilon=1e-8) >>> trans = trans.fit(X)
>>> # only one input column >>> column, epsilon, bins_cdfs = trans.bins_and_cdfs_[0] >>> assert column == 0, np.allclose(epsilon, 1e-8 * 0.1) >>> bins_cdfs array([[ 2.1 , 0. ], [ 2.2 , 0.25], [ 3.1 , 0.5 ], [ 3.7 , 0.75], [ 4.4 , 1. ]])
>>> X_test = np.c_[[1.9, 2.15, 2.4, 2.2, 3.6, 3.5, 4.3, 5.1]] >>> trans.transform(X_test) array([[0], [0], [1], [0], [2], [2], [3], [3]], dtype=int8)
- set_transform_request(*, X_orig: bool | None | str = '$UNCHANGED$') BinNumberTransformer #
Request metadata passed to the
transform
method.Note that this method is only relevant if
enable_metadata_routing=True
(seesklearn.set_config()
). Please see User Guide on how the routing mechanism works.The options for each parameter are:
True
: metadata is requested, and passed totransform
if provided. The request is ignored if metadata is not provided.False
: metadata is not requested and the meta-estimator will not pass it totransform
.None
: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str
: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED
) retains the existing request. This allows you to change the request for some parameters and not others.New in version 1.3.
Note
This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a
Pipeline
. Otherwise it has no effect.- Parameters:
X_orig (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for
X_orig
parameter intransform
.- Returns:
self – The updated object.
- Return type:
object
- class cyclic_boosting.binning.ECdfTransformer(n_bins=100, feature_properties=None, weight_column=None, epsilon=1e-09, tolerance=0.1)[source]#
Bases:
BaseEstimator
,TransformerMixin
Transform features to the empirical CDF scale of the training data.
CDF = \(P\left(X \leq x\right)\) = cumulative distribution function. See CDF on wikipedia
Each feature found in
feature_properties
is considered in separation.In
fit()
, (up to)n_bins
bin boundaries with approximately equal number of data points are determined. For discrete values, the complete CDF is stored andn_bins
is ignored.In
transform()
, each feature value is associated with the corresponding bin by binary search. For features withcyclic_boosting.flags.IS_CONTINUOUS
set the empirical CDF is then interpolated between the left and the right bin boundary. For out-of-range features, the bin boundaries are taken. For features withcyclic_boosting.flags.IS_ORDERED
orcyclic_boosting.flags.IS_UNORDERED
only values that have been seen in the fit are transformed to the corresponding empirical CDF values. For all values, not within epsilon of the values seen in the fit, numpy.nan is returned. Missing values(numpy.nan
) stay missing values and are not transformed regardless of thefeature_properties
set and feature values seen infit()
. For all features the feature propertycyclic_boosting.flags.HAS_MISSING
is assumed.- Parameters:
n_bins (int, dict) – Maximum number of bins used to estimate the empirical CDF.
n_bins
is ignored for features with discrete preprocessing. If a dict is passed, the feature names/indices should be the keys and the n_bins are the values. Example :{'feature a': 150, 'feature b': 20}
feature_properties (dict) –
Dictionary listing the names of all features as keys and their preprocessing flags as values. When using a numpy feature matrix X with no column names the keys of the feature properties are the column indices. If no
feature_properties
are passed, all columns inX
are treated as cyclic_boosting.flags.IS_CONTINUOUS. For more information about feature properties:See also
weight_column – Optional column label or column index for the weight column. If not set all samples receive the same weight 1.
epsilon (float) –
Used thresholds for the comparison of float values:
epsilon * 1.0
for the comparison of CDF valuesepsilon * minimal_bin_width
for the comparison with bin boundaries of a given feature
Default value for epsilon: 1e-9
tolerance (double) – Relative tolerance of the minimum bin weight. (E.g. if you specify 100 bins and a tolerance of 0.05 the bins are required to have only 0.95% of the total bin weights instead of 1.0%)
Guarantees for continuous features (cyclic_boosting.flags.IS_CONTINUOUS set for feature)
The estimated number of bins \(n_\text{bins\_estimated}\) is always smaller equal than the number of bins requested by the user \(n_\text{bins}\).
\[n_\text{bins\_estimated} \leq n_\text{bins}\]The bin boundaries are chosen such that each bin contains at least a fraction of \(\frac{1}{n_\text{bins}}\) of all values.
Guarantees for discrete features (flags.UNORDERED or flags.ORDERED set for feature)
The estimated number of bins \(n_\text{bins\_estimated}\) is equal to the number of unique values \(n_\text{unique\_values}\) found.
\[n_\text{bins\_estimated} \Leftrightarrow n_\text{unique\_values}\]
Estimated parameters
- bins_and_cdfs_#
For each feature, a tuple containing
the column name or index
the epsilon used for comparisons to bin boundaries; it is the constructor parameter
epsilon
multiplied by the smallest bin widthand the
numpy.ndarray
of shape(at most n_bins + 1, 2)
This is a matrix containing the bin boundaries (column 0) and the corresponding cumulative probabilities (column 1) is learned in the fit. The matrix looks for one feature
x
like this:\[\begin{split}\begin{pmatrix} x_\text{min} & P\left(X < x_\text{min}\right) = 0 \\ x_\text{boundary1} & P\left(X \leq x_\text{boundary1}\right) \\ x_\text{boundary2} & P\left(X \leq x_\text{boundary2}\right) \\ \ldots & \ldots \\ x_\text{max} & P\left(X \leq x_\text{max}\right) = 1 \\ \end{pmatrix}\end{split}\]For mixed discrete and continuous features, there might be fewer than
n_bins
bins. For discrete featuresn_bins
is ignored and thecdf
is calculated for each unique value. type of bins_and_cdfs_: itemlist
oftuple
Examples
>>> feature_1 = np.asarray([2.1, 2.2, 2.5, 3.1, 3.3, 3.7, 4.1, 4.4]) >>> X = np.c_[feature_1] >>> eps = 1e-8
>>> from cyclic_boosting.binning import ECdfTransformer >>> trans = ECdfTransformer(n_bins=4, epsilon=eps) >>> trans = trans.fit(X)
>>> # only one input column >>> column, epsilon, bins_cdfs = trans.bins_and_cdfs_[0] >>> assert column == 0 and np.allclose(epsilon, eps * 0.1) >>> bins_cdfs array([[ 2.1 , 0. ], [ 2.2 , 0.25], [ 3.1 , 0.5 ], [ 3.7 , 0.75], [ 4.4 , 1. ]])
>>> X_test = np.c_[[1.9, 2.4, 2.2, 3.6, 3.5, 4.3, 5.1]] >>> trans.transform(X_test) array([[ 0. ], [ 0.30555556], [ 0.25 ], [ 0.70833333], [ 0.66666667], [ 0.96428571], [ 1. ]])
- cyclic_boosting.binning.get_bin_bounds(binners, feat_group)[source]#
Gets the bin boundaries for each feature group.
- Parameters:
binners (list) – List of binners.
feat_group (str or tuple of str) – A feature property for which the bin boundaries should be extracted from the binners.
- cyclic_boosting.binning.get_column_index(X, column_name_or_index)[source]#
Integer column index of pandas.Dataframe or numpy.ndarray.
- Parameters:
X (numpy.ndarray(dim=2) or pandas.DataFrame) – input matrix
column_name_or_index (string or int) – column name or index
- Return type:
int
- cyclic_boosting.binning.get_feature_column_names_or_indices(X: DataFrame | ndarray, exclude_columns: List[str] | List[int] | None = None) List[str] | List[int] [source]#
Extract the column names from X. If X is a numpy matrix each column is labeled with an integer starting from zero.
- Parameters:
X (numpy.ndarray(dim=2) or pandas.DataFrame) – input matrix
exclude_columns (list of int or str) – column names or indices to omit.
- Return type:
list
>>> X = np.c_[[0, 1], [1,0], [3, 5]] >>> from cyclic_boosting.binning import get_feature_column_names_or_indices >>> get_feature_column_names_or_indices(X) [0, 1, 2]
>>> get_feature_column_names_or_indices(X, exclude_columns=[1]) [0, 2]
>>> get_feature_column_names_or_indices(X, exclude_columns=[1, 1]) [0, 2]
>>> get_feature_column_names_or_indices(X, exclude_columns=[0, 1, 2]) []
>>> X = pd.DataFrame(X, columns = ['b', 'c', 'a']) >>> get_feature_column_names_or_indices(X, exclude_columns=['a']) ['b', 'c']
>>> get_feature_column_names_or_indices(X, exclude_columns=['d']) ['b', 'c', 'a']
- cyclic_boosting.binning.get_weight_column(X, weight_column=None)[source]#
Check if a weight column is present and return it if possible. If no weight columns is present in X a weight column with only
ones
of same length than X is created and returned.- Parameters:
X (numpy.ndarray(dim=2) or pandas.DataFrame) – Samples feature matrix.
weight_column (int or string or
NoneType
) – Name or index of the weight column or None.
- Return type:
numpy.ndarray
>>> X = np.c_[[0., 1], [1,0], [3, 5]] >>> from cyclic_boosting.binning import get_weight_column >>> get_weight_column(X) array([ 1., 1.]) >>> get_weight_column(X, 0) array([ 0., 1.]) >>> get_weight_column(X, 2) array([ 3., 5.])
>>> X = pd.DataFrame(X, columns = ['b', 'c', 'a']) >>> get_weight_column(X) array([ 1., 1.])
>>> get_weight_column(X, 'c') array([ 1., 0.])
- cyclic_boosting.binning.minimal_difference(values)[source]#
Minimal difference of consecutive array values excluding zero differences.
- Parameters:
values (
numpy.ndarray
with dim=1.) – Array values
- cyclic_boosting.binning.reduce_cdf_and_boundaries_to_nbins(bins_x, cdf_x, n_bins, epsilon, tolerance)[source]#
Section the cdf spectrum into n_bin parts of equal statistics, and find all events beloning into these bins by filtering all suitable events in the event-wise cdf_x array.
Often, events cannot be distributed exactly with equal statistics over all bins, therefore the
tolerance
argument allows for bins to be of a weight below 1.0 / n_bins.A minimum weight of 1.0 / n_bins - tolerance per bin is guaranteed.
This function is used internally in the method
cyclic_boosting.binning.ECdfTransformer()
.- Parameters:
bins_x (np.ndarray) – strictly increasing array containing all bin boundaries, length is the number of evenets.
cdf_x (np.ndarray) – Strictly increasing array containing the cdf values corresponding to the bin boundaries in bin_x. Contains one value for each event.
n_bins (int) – Maximum number of bins that ought to be returned. This also determines the minimum weight per bin, which is 1 / n_bins.
epsilon (double) – Threshold for the comparison of CDFs
tolerance (double) – Relative tolerance of the minimum bin weight. (E.g. if you specify 100 bins and a tolerance of 0.05 the bins are required to have only 0.95% of the total bin weights instead of 1.0%)
- Returns:
The
reduced
input arrays bins_x and cdf_x, now with maximumlength n_bins, tuple of numpy.ndarrays(dim=1)