model_selection
splits
¶
get_splitter(stratify_cols=None, group_cols=None, n_splits=5, random_state=1414)
¶
Get a cross-validation splitter based on input parameters.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
stratify_cols |
Collection[str]
|
Column names for stratification. Defaults to None. |
None
|
group_cols |
Collection[str]
|
Column names for grouping. Defaults to None. |
None
|
n_splits |
int
|
Number of splits in the cross-validation. Defaults to 5. |
5
|
random_state |
int
|
Seed for random number generator. Defaults to 1414. |
1414
|
Returns:
Name | Type | Description |
---|---|---|
BaseCrossValidator |
BaseCrossValidator
|
A cross-validation splitter based on the input parameters. |
Source code in aimet_ml/model_selection/splits.py
24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 |
|
join_cols(df, cols, sep='_')
¶
Concatenate the specified columns of a DataFrame with a separator.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df |
DataFrame
|
The DataFrame to operate on. |
required |
cols |
Collection[str]
|
Column names to concatenate. |
required |
sep |
str
|
The separator to use between the column values. Defaults to "_". |
'_'
|
Returns:
Type | Description |
---|---|
Series
|
pd.Series: A Series containing the concatenated values. |
Source code in aimet_ml/model_selection/splits.py
7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
|
split_dataset(dataset_df, val_fraction=0.1, test_n_splits=5, stratify_cols=None, group_cols=None, train_split_name_format='train_fold_{}', val_split_name_format='val_fold_{}', test_split_name_format='test_fold_{}', random_seed=1414)
¶
Split a dataset into k-fold cross-validation sets with stratification and grouping.
The dataset will be split into k-fold cross-validation sets, each containing development and test sets. For each fold, the development set will be further split into training and validation sets. The final data splits include k test sets, k training sets, and k validation sets.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
dataset_df |
DataFrame
|
The input DataFrame to be split. |
required |
val_fraction |
Union[float, int]
|
The fraction of data to be used for validation. If a float is given, it's rounded to the nearest fraction. If an integer (n) is given, the fraction is calculated as 1/n. Defaults to 0.1. |
0.1
|
test_n_splits |
int
|
Number of cross-validation splits. Defaults to 5. |
5
|
stratify_cols |
Collection[str]
|
Column names for stratification. Defaults to None. |
None
|
group_cols |
Collection[str]
|
Column names for grouping. Defaults to None. |
None
|
train_split_name_format |
str
|
Format for naming training splits. Defaults to "train_fold_{}". |
'train_fold_{}'
|
val_split_name_format |
str
|
Format for naming validation splits. Defaults to "val_fold_{}". |
'val_fold_{}'
|
test_split_name_format |
str
|
Format for naming validation splits. Defaults to "test_fold_{}". |
'test_fold_{}'
|
random_seed |
int
|
Random seed for reproducibility. Defaults to 1414. |
1414
|
Returns:
Type | Description |
---|---|
Dict[str, DataFrame]
|
Dict[str, pd.DataFrame]: A dictionary containing the split DataFrames. |
Source code in aimet_ml/model_selection/splits.py
127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 |
|
split_dataset_single_test(dataset_df, test_fraction=0.2, val_n_splits=5, stratify_cols=None, group_cols=None, test_split_name='test', dev_split_name='dev', train_split_name_format='train_fold_{}', val_split_name_format='val_fold_{}', random_seed=1414)
¶
Split a dataset into development, test, and cross-validation sets with stratification and grouping.
The dataset will be split into a development set and a test set. The development set will then be further split into k-fold cross-validation sets, each containing its own training and validation sets. The final data splits include a test set, k training sets, and k validation sets.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
dataset_df |
DataFrame
|
The input DataFrame to be split. |
required |
test_fraction |
Union[float, int]
|
The fraction of data to be used for testing. If a float is given, it's rounded to the nearest fraction. If an integer (n) is given, the fraction is calculated as 1/n. Defaults to 0.2. |
0.2
|
val_n_splits |
int
|
Number of cross-validation splits. Defaults to 5. |
5
|
stratify_cols |
Collection[str]
|
Column names for stratification. Defaults to None. |
None
|
group_cols |
Collection[str]
|
Column names for grouping. Defaults to None. |
None
|
test_split_name |
str
|
Name for the test split. Defaults to "test". |
'test'
|
dev_split_name |
str
|
Name for the development split. Defaults to "dev". |
'dev'
|
train_split_name_format |
str
|
Format for naming training splits. Defaults to "train_fold_{}". |
'train_fold_{}'
|
val_split_name_format |
str
|
Format for naming validation splits. Defaults to "val_fold_{}". |
'val_fold_{}'
|
random_seed |
int
|
Random seed for reproducibility. Defaults to 1414. |
1414
|
Returns:
Type | Description |
---|---|
Dict[str, DataFrame]
|
Dict[str, pd.DataFrame]: A dictionary containing the split DataFrames. |
Source code in aimet_ml/model_selection/splits.py
188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 |
|
stratified_group_split(dataset_df, test_fraction=0.2, stratify_cols=None, group_cols=None, random_seed=1414)
¶
Split a dataset into development and test sets with stratification and grouping.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
dataset_df |
DataFrame
|
The input DataFrame to be split. |
required |
test_fraction |
Union[float, int]
|
The fraction of data to be used for testing. If a float (0, 1) is given, it's rounded to the nearest fraction. If an integer (n > 1) is given, the fraction is calculated as 1/n. Defaults to 0.2. |
0.2
|
stratify_cols |
Collection[str]
|
Column names for stratification. Defaults to None. |
None
|
group_cols |
Collection[str]
|
Column names for grouping. Defaults to None. |
None
|
random_seed |
int
|
Random seed for reproducibility. Defaults to 1414. |
1414
|
Returns:
Type | Description |
---|---|
Tuple[DataFrame, DataFrame]
|
Tuple[pd.DataFrame, pd.DataFrame]: A tuple containing the development and test DataFrames. |
Source code in aimet_ml/model_selection/splits.py
66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 |
|