diff --git a/common-primitives/HISTORY.md b/common-primitives/HISTORY.md deleted file mode 100644 index 5daa8a3..0000000 --- a/common-primitives/HISTORY.md +++ /dev/null @@ -1,363 +0,0 @@ -## v0.8.0 - -* Removed multi-targets support in `classification.light_gbm.Common` and fixed - categorical attributes handling. - [!118](https://gitlab.com/datadrivendiscovery/common-primitives/merge_requests/118) -* Unified date parsing across primitives. - Added `raise_error` hyper-parameter to `data_preprocessing.datetime_range_filter.Common`. - This bumped the version of the primitive. - [!117](https://gitlab.com/datadrivendiscovery/common-primitives/merge_requests/117) -* `evaluation.kfold_time_series_split.Common` now parses the datetime column - before sorting. `fuzzy_time_parsing` hyper-parameter was added to the primitive. - This bumped the version of the primitive. - [!110](https://gitlab.com/datadrivendiscovery/common-primitives/merge_requests/110) -* Added option `equal` to hyper-parameter `match_logic` of primitive - `data_transformation.extract_columns_by_semantic_types.Common` to support set equality - when determining columns to extract. This bumped the version of the primitive. - [!116](https://gitlab.com/datadrivendiscovery/common-primitives/merge_requests/116) -* Fixed `data_preprocessing.one_hot_encoder.MakerCommon` to work with the - latest core package. -* `data_cleaning.tabular_extractor.Common` has been fixed to work with the - latest version of sklearn. - [!113](https://gitlab.com/datadrivendiscovery/common-primitives/merge_requests/113) -* ISI side of `data_augmentation.datamart_augmentation.Common` and - `data_augmentation.datamart_download.Common` has been updated. - [!108](https://gitlab.com/datadrivendiscovery/common-primitives/merge_requests/108) -* Improved how pipelines and pipeline runs for all primitives are managed. - Many more pipelines and pipeline runs were added. -* `evaluation.kfold_timeseries_split.Common` has been renamed to `evaluation.kfold_time_series_split.Common`. -* Fixed `data_preprocessing.dataset_sample.Common` on empty input. - [!95](https://gitlab.com/datadrivendiscovery/common-primitives/merge_requests/95) -* `data_preprocessing.datetime_range_filter.Common` does not assume local timezone - when parsing dates. - [#115](https://gitlab.com/datadrivendiscovery/common-primitives/issues/115) -* Added `fuzzy_time_parsing` hyper-parameter to `data_transformation.column_parser.Common`. - This bumped the version of the primitive. -* Fixed `data_transformation.column_parser.Common` to work correctly with `python-dateutil==2.8.1`. - [#119](https://gitlab.com/datadrivendiscovery/common-primitives/issues/119). -* Refactored `data_preprocessing.one_hot_encoder.MakerCommon` to address some issues. - [#66](https://gitlab.com/datadrivendiscovery/common-primitives/issues/66) - [#75](https://gitlab.com/datadrivendiscovery/common-primitives/issues/75) - [!96](https://gitlab.com/datadrivendiscovery/common-primitives/merge_requests/96) -* Added support for handling of numeric columns to `data_preprocessing.regex_filter.Common` and `data_preprocessing.term_filter.Common`. - [!101](https://gitlab.com/datadrivendiscovery/common-primitives/merge_requests/101) - [!104](https://gitlab.com/datadrivendiscovery/common-primitives/merge_requests/104) -* Fixed exception in `produce` method in `data_transformation.datetime_field_compose.Common` caused by using incorrect type for dataframe indexer. - [!102](https://gitlab.com/datadrivendiscovery/common-primitives/merge_requests/102) -* Added primitives: - * `data_transformation.grouping_field_compose.Common` - -## v0.7.0 - -* Renamed primitives: - * `data_transformation.add_semantic_types.DataFrameCommon` to `data_transformation.add_semantic_types.Common` - * `data_transformation.remove_semantic_types.DataFrameCommon` to `data_transformation.remove_semantic_types.Common` - * `data_transformation.replace_semantic_types.DataFrameCommon` to `data_transformation.replace_semantic_types.Common` - * `operator.column_map.DataFrameCommon` to `operator.column_map.Common` - * `regression.xgboost_gbtree.DataFrameCommon` to `regression.xgboost_gbtree.Common` - * `classification.light_gbm.DataFrameCommon` to `classification.light_gbm.Common` - * `classification.xgboost_gbtree.DataFrameCommon` to `classification.xgboost_gbtree.Common` - * `classification.xgboost_dart.DataFrameCommon` to `classification.xgboost_dart.Common` - * `classification.random_forest.DataFrameCommon` to `classification.random_forest.Common` - * `data_transformation.extract_columns.DataFrameCommon` to `data_transformation.extract_columns.Common` - * `data_transformation.extract_columns_by_semantic_types.DataFrameCommon` to `data_transformation.extract_columns_by_semantic_types.Common` - * `data_transformation.extract_columns_by_structural_types.DataFrameCommon` to `data_transformation.extract_columns_by_structural_types.Common` - * `data_transformation.cut_audio.DataFrameCommon` to `data_transformation.cut_audio.Common` - * `data_transformation.column_parser.DataFrameCommon` to `data_transformation.column_parser.Common` - * `data_transformation.remove_columns.DataFrameCommon` to `data_transformation.remove_columns.Common` - * `data_transformation.remove_duplicate_columns.DataFrameCommon` to `data_transformation.remove_duplicate_columns.Common` - * `data_transformation.horizontal_concat.DataFrameConcat` to `data_transformation.horizontal_concat.DataFrameCommon` - * `data_transformation.construct_predictions.DataFrameCommon` to `data_transformation.construct_predictions.Common` - * `data_transformation.datetime_field_compose.DataFrameCommon` to `data_transformation.datetime_field_compose.Common` - * `data_preprocessing.label_encoder.DataFrameCommon` to `data_preprocessing.label_encoder.Common` - * `data_preprocessing.label_decoder.DataFrameCommon` to `data_preprocessing.label_decoder.Common` - * `data_preprocessing.image_reader.DataFrameCommon` to `data_preprocessing.image_reader.Common` - * `data_preprocessing.text_reader.DataFrameCommon` to `data_preprocessing.text_reader.Common` - * `data_preprocessing.video_reader.DataFrameCommon` to `data_preprocessing.video_reader.Common` - * `data_preprocessing.csv_reader.DataFrameCommon` to `data_preprocessing.csv_reader.Common` - * `data_preprocessing.audio_reader.DataFrameCommon` to `data_preprocessing.audio_reader.Common` - * `data_preprocessing.regex_filter.DataFrameCommon` to `data_preprocessing.regex_filter.Common` - * `data_preprocessing.term_filter.DataFrameCommon` to `data_preprocessing.term_filter.Common` - * `data_preprocessing.numeric_range_filter.DataFrameCommon` to `data_preprocessing.numeric_range_filter.Common` - * `data_preprocessing.datetime_range_filter.DataFrameCommon` to `data_preprocessing.datetime_range_filter.Common` - -## v0.6.0 - -* Added `match_logic`, `negate`, and `add_index_columns` hyper-parameters - to `data_transformation.extract_columns_by_structural_types.DataFrameCommon` - and `data_transformation.extract_columns_by_semantic_types.DataFrameCommon` - primitives. -* `feature_extraction.sparse_pca.Common` has been removed and is now available as part of realML. - [!89](https://gitlab.com/datadrivendiscovery/common-primitives/merge_requests/89) -* Added new primitives: - * `data_preprocessing.datetime_range_filter.DataFrameCommon` - * `data_transformation.datetime_field_compose.DataFrameCommon` - * `d3m.primitives.data_preprocessing.flatten.DataFrameCommon` - * `data_augmentation.datamart_augmentation.Common` - * `data_augmentation.datamart_download.Common` - * `data_preprocessing.dataset_sample.Common` - - [#53](https://gitlab.com/datadrivendiscovery/common-primitives/issues/53) - [!86](https://gitlab.com/datadrivendiscovery/common-primitives/merge_requests/86) - [!87](https://gitlab.com/datadrivendiscovery/common-primitives/merge_requests/87) - [!85](https://gitlab.com/datadrivendiscovery/common-primitives/merge_requests/85) - [!63](https://gitlab.com/datadrivendiscovery/common-primitives/merge_requests/63) - [!92](https://gitlab.com/datadrivendiscovery/common-primitives/merge_requests/92) - [!93](https://gitlab.com/datadrivendiscovery/common-primitives/merge_requests/93) - [!81](https://gitlab.com/datadrivendiscovery/common-primitives/merge_requests/81) - -* Fixed `fit` method to return correct value for `operator.column_map.DataFrameCommon`, - `operator.dataset_map.DataFrameCommon`, and `schema_discovery.profiler.Common`. -* Some not maintained primitives have been disabled. If you are using them, consider adopting them. - * `classification.bayesian_logistic_regression.Common` - * `regression.convolutional_neural_net.TorchCommon` - * `operator.diagonal_mvn.Common` - * `regression.feed_forward_neural_net.TorchCommon` - * `data_preprocessing.image_reader.Common` - * `clustering.k_means.Common` - * `regression.linear_regression.Common` - * `regression.loss.TorchCommon` - * `feature_extraction.pca.Common` -* `data_transformation.update_semantic_types.DatasetCommon` has been removed. - Use `data_transformation.add_semantic_types.DataFrameCommon`, - `data_transformation.remove_semantic_types.DataFrameCommon`, - or `data_transformation.replace_semantic_types.DataFrameCommon` together with - `operator.dataset_map.DataFrameCommon` primitive to obtain previous functionality. - [#83](https://gitlab.com/datadrivendiscovery/common-primitives/issues/83) -* `data_transformation.remove_columns.DatasetCommon` has been removed. - Use `data_transformation.remove_columns.DataFrameCommon` together with - `operator.dataset_map.DataFrameCommon` primitive to obtain previous functionality. - [#83](https://gitlab.com/datadrivendiscovery/common-primitives/issues/83) -* Some primitives which operate on Dataset have been converted to operate - on DataFrame and renamed. Use them together with `operator.dataset_map.DataFrameCommon` - primitive to obtain previous functionality. - * `data_preprocessing.regex_filter.DatasetCommon` to `data_preprocessing.regex_filter.DataFrameCommon` - * `data_preprocessing.term_filter.DatasetCommon` to `data_preprocessing.term_filter.DataFrameCommon` - * `data_preprocessing.numeric_range_filter.DatasetCommon` to `data_preprocessing.numeric_range_filter.DataFrameCommon` - - [#83](https://gitlab.com/datadrivendiscovery/common-primitives/issues/83) - [!84](https://gitlab.com/datadrivendiscovery/common-primitives/merge_requests/84) - -* `schema_discovery.profiler.Common` has been improved: - * More options added to `detect_semantic_types`. - * Added new `remove_unknown_type` hyper-parameter. - -## v0.5.0 - -* `evaluation.compute_scores.Common` primitive has been moved to the core - package and renamed to `evaluation.compute_scores.Core`. -* `metafeature_extraction.compute_metafeatures.Common` has been renamed to - `metalearning.metafeature_extractor.Common` -* `evaluation.compute_scores.Common` has now a `add_normalized_scores` hyper-parameter - to control adding also a column with normalized scores to the output, which is now - added by default. -* `data_preprocessing.text_reader.DataFrameCommon` primitive has been fixed. -* `data_transformation.rename_duplicate_name.DataFrameCommon` primitive was - fixed to handle all types of column names. - [#73](https://gitlab.com/datadrivendiscovery/common-primitives/issues/73) - [!65](https://gitlab.com/datadrivendiscovery/common-primitives/merge_requests/65) -* Added new primitives: - * `data_cleaning.tabular_extractor.Common` - * `data_preprocessing.one_hot_encoder.PandasCommon` - * `schema_discovery.profiler.Common` - * `data_transformation.ravel.DataFrameRowCommon` - * `operator.column_map.DataFrameCommon` - * `operator.dataset_map.DataFrameCommon` - * `data_transformation.normalize_column_references.Common` - * `data_transformation.normalize_graphs.Common` - * `feature_extraction.sparse_pca.Common` - * `evaluation.kfold_timeseries_split.Common` - - [#57](https://gitlab.com/datadrivendiscovery/common-primitives/issues/57) - [!42](https://gitlab.com/datadrivendiscovery/common-primitives/merge_requests/42) - [!44](https://gitlab.com/datadrivendiscovery/common-primitives/merge_requests/44) - [!47](https://gitlab.com/datadrivendiscovery/common-primitives/merge_requests/47) - [!71](https://gitlab.com/datadrivendiscovery/common-primitives/merge_requests/71) - [!73](https://gitlab.com/datadrivendiscovery/common-primitives/merge_requests/73) - [!77](https://gitlab.com/datadrivendiscovery/common-primitives/merge_requests/77) - [!66](https://gitlab.com/datadrivendiscovery/common-primitives/merge_requests/66) - [!67](https://gitlab.com/datadrivendiscovery/common-primitives/merge_requests/67) - -* Added hyper-parameter `error_on_no_columns` to `classification.random_forest.DataFrameCommon`. -* Common primitives have been updated to latest changes in d3m core package. -* Many utility functions from `utils.py` have been moved to the d3m core package. - -## v0.4.0 - -* Renamed `data_preprocessing.one_hot_encoder.Common` to - `data_preprocessing.one_hot_encoder.MakerCommon` and reimplement it. - [!54](https://gitlab.com/datadrivendiscovery/common-primitives/merge_requests/54) -* Added new primitives: - * `classification.xgboost_gbtree.DataFrameCommon` - * `classification.xgboost_dart.DataFrameCommon` - * `regression.xgboost_gbtree.DataFrameCommon` - * `classification.light_gbm.DataFrameCommon` - * `data_transformation.rename_duplicate_name.DataFrameCommon` - - [!45](https://gitlab.com/datadrivendiscovery/common-primitives/merge_requests/45) - [!46](https://gitlab.com/datadrivendiscovery/common-primitives/merge_requests/46) - [!49](https://gitlab.com/datadrivendiscovery/common-primitives/merge_requests/49) - -* Made sure `utils.select_columns` works also when given a tuple of columns, and not a list. - [!58](https://gitlab.com/datadrivendiscovery/common-primitives/merge_requests/58) -* `classification.random_forest.DataFrameCommon` updated so that produced columns have - names matching column names during fitting. Moreover, `produce_feature_importances` - return a `DataFrame` with each column being one feature and having one row with - importances. - [!59](https://gitlab.com/datadrivendiscovery/common-primitives/merge_requests/59) -* `regression.feed_forward_neural_net.TorchCommon` updated to support - selection of columns using semantic types. - [!57](https://gitlab.com/datadrivendiscovery/common-primitives/merge_requests/57) - -## v0.3.0 - -* Made `evaluation.redact_columns.Common` primitive more general so that it can - redact any columns based on their semantic type and not just targets. -* Renamed primitives: - * `datasets.Denormalize` to `data_transformation.denormalize.Common` - * `datasets.DatasetToDataFrame` to `data_transformation.dataset_to_dataframe.Common` - * `evaluation.ComputeScores` to `evaluation.compute_scores.Common` - * `evaluation.RedactTargets` to `evaluation.redact_columns.Common` - * `evaluation.KFoldDatasetSplit` to `evaluation.kfold_dataset_split.Common` - * `evaluation.TrainScoreDatasetSplit` to `evaluation.train_score_dataset_split.Common` - * `evaluation.NoSplitDatasetSplit` to `evaluation.no_split_dataset_split.Common` - * `evaluation.FixedSplitDatasetSplit` to `evaluation.fixed_split_dataset_split.Commmon` - * `classifier.RandomForest` to `classification.random_forest.DataFrameCommon` - * `metadata.ComputeMetafeatures` to `metafeature_extraction.compute_metafeatures.Common` - * `audio.CutAudio` to `data_transformation.cut_audio.DataFrameCommon` - * `data.ListToNDArray` to `data_transformation.list_to_ndarray.Common` - * `data.StackNDArrayColumn` to `data_transformation.stack_ndarray_column.Common` - * `data.AddSemanticTypes` to `data_transformation.add_semantic_types.DataFrameCommon` - * `data.RemoveSemanticTypes` to `data_transformation.remove_semantic_types.DataFrameCommon` - * `data.ConstructPredictions` to `data_transformation.construct_predictions.DataFrameCommon` - * `data.ColumnParser` to `data_transformation.column_parser.DataFrameCommon` - * `data.CastToType` to `data_transformation.cast_to_type.Common` - * `data.ExtractColumns` to `data_transformation.extract_columns.DataFrameCommon` - * `data.ExtractColumnsBySemanticTypes` to `data_transformation.extract_columns_by_semantic_types.DataFrameCommon` - * `data.ExtractColumnsByStructuralTypes` to `data_transformation.extract_columns_by_structural_types.DataFrameCommon` - * `data.RemoveColumns` to `data_transformation.remove_columns.DataFrameCommon` - * `data.RemoveDuplicateColumns` to `data_transformation.remove_duplicate_columns.DataFrameCommon` - * `data.HorizontalConcat` to `data_transformation.horizontal_concat.DataFrameConcat` - * `data.DataFrameToNDArray` to `data_transformation.dataframe_to_ndarray.Common` - * `data.NDArrayToDataFrame` to `data_transformation.ndarray_to_dataframe.Common` - * `data.DataFrameToList` to `data_transformation.dataframe_to_list.Common` - * `data.ListToDataFrame` to `data_transformation.list_to_dataframe.Common` - * `data.NDArrayToList` to `data_transformation.ndarray_to_list.Common` - * `data.ReplaceSemanticTypes` to `data_transformation.replace_semantic_types.DataFrameCommon` - * `data.UnseenLabelEncoder` to `data_preprocessing.label_encoder.DataFrameCommon` - * `data.UnseenLabelDecoder` to `data_preprocessing.label_decoder.DataFrameCommon` - * `data.ImageReader` to `data_preprocessing.image_reader.DataFrameCommon` - * `data.TextReader` to `data_preprocessing.text_reader.DataFrameCommon` - * `data.VideoReader` to `data_preprocessing.video_reader.DataFrameCommon` - * `data.CSVReader` to `data_preprocessing.csv_reader.DataFrameCommon` - * `data.AudioReader` to `data_preprocessing.audio_reader.DataFrameCommon` - * `datasets.UpdateSemanticTypes` to `data_transformation.update_semantic_types.DatasetCommon` - * `datasets.RemoveColumns` to `data_transformation.remove_columns.DatasetCommon` - * `datasets.RegexFilter` to `data_preprocessing.regex_filter.DatasetCommon` - * `datasets.TermFilter` to `data_preprocessing.term_filter.DatasetCommon` - * `datasets.NumericRangeFilter` to `data_preprocessing.numeric_range_filter.DatasetCommon` - * `common_primitives.BayesianLogisticRegression` to `classification.bayesian_logistic_regression.Common` - * `common_primitives.ConvolutionalNeuralNet` to `regression.convolutional_neural_net.TorchCommon` - * `common_primitives.DiagonalMVN` to `operator.diagonal_mvn.Common` - * `common_primitives.FeedForwardNeuralNet` to `regression.feed_forward_neural_net.TorchCommon` - * `common_primitives.ImageReader` to `data_preprocessing.image_reader.Common` - * `common_primitives.KMeans` to `clustering.kmeans.Common` - * `common_primitives.LinearRegression` to `regression.linear_regression.Common` - * `common_primitives.Loss` to `regression.loss.TorchCommon` - * `common_primitives.PCA` to `feature_extraction.pca.Common` - * `common_primitives.OneHotMaker` to `data_preprocessing.one_hot_encoder.Common` -* Fixed pickling issue of `classifier.RandomFores`. - [#47](https://gitlab.com/datadrivendiscovery/common-primitives/issues/47) - [!48](https://gitlab.com/datadrivendiscovery/common-primitives/merge_requests/48) -* `data.ColumnParser` primitive has now additional hyper-parameter `replace_index_columns` - which controls whether index columns are still replaced when otherwise appending returned - parsed columns or not. -* Made `data.RemoveDuplicateColumns` fit and remember duplicate columns during training. - [#45](https://gitlab.com/datadrivendiscovery/common-primitives/issues/45) -* Added `match_logic` hyper-parameter to the `data.ReplaceSemanticTypes` primitive - which allows one to control how multiple specified semantic types match. -* Added new primitives: - * `metadata.ComputeMetafeatures` - * `datasets.RegexFilter` - * `datasets.TermFilter` - * `datasets.NumericRangeFilter` - * `evaluation.NoSplitDatasetSplit` - * `evaluation.FixedSplitDatasetSplit` -* Column parser fixed to parse columns with `http://schema.org/DateTime` semantic type. -* Simplified logic (and made it more predictable) of `combine_columns` utility function when - using `new` `return_result` and `add_index_columns` set to true. Now if output already contains - any index column, input index columns are not added. And if there are no index columns, - all input index columns are added at the beginning. -* Fixed `_can_use_inputs_column` in `classifier.RandomForest`. Added check of structural type, so - only columns with numerical structural types are processed. -* Correctly set column names in `evaluation.ComputeScores` primitive's output. -* Cast indices and columns to match predicted columns' dtypes. - [#33](https://gitlab.com/datadrivendiscovery/common-primitives/issues/33) -* `datasets.DatasetToDataFrame` primitive does not try to generate metadata automatically - because this is not really needed (metadata can just be copied from the dataset). This - speeds up the primitive. - [#34](https://gitlab.com/datadrivendiscovery/common-primitives/issues/34) -* Made it uniform that whenever we are generating lists of all column names - we try first to get the name from the metadata and fallback to one in DataFrame. - Instead of using a column index in the latter case. -* Made splitting primitives, `classifier.RandomForest` and `data.UnseenLabelEncoder` - be picklable even unfitted. -* Fixed entry point for `audio.CutAudio` primitive. - -## v0.2.0 - -* Made those primitives operate on semantic types and support different ways to return results. -* Added or updated many primitives: - * `data.ExtractColumns` - * `data.ExtractColumnsBySemanticTypes` - * `data.ExtractColumnsByStructuralTypes` - * `data.RemoveColumns` - * `data.RemoveDuplicateColumns` - * `data.HorizontalConcat` - * `data.CastToType` - * `data.ColumnParser` - * `data.ConstructPredictions` - * `data.DataFrameToNDArray` - * `data.NDArrayToDataFrame` - * `data.DataFrameToList` - * `data.ListToDataFrame` - * `data.NDArrayToList` - * `data.ListToNDArray` - * `data.StackNDArrayColumn` - * `data.AddSemanticTypes` - * `data.RemoveSemanticTypes` - * `data.ReplaceSemanticTypes` - * `data.UnseenLabelEncoder` - * `data.UnseenLabelDecoder` - * `data.ImageReader` - * `data.TextReader` - * `data.VideoReader` - * `data.CSVReader` - * `data.AudioReader` - * `datasets.Denormalize` - * `datasets.DatasetToDataFrame` - * `datasets.UpdateSemanticTypes` - * `datasets.RemoveColumns` - * `evaluation.RedactTargets` - * `evaluation.ComputeScores` - * `evaluation.KFoldDatasetSplit` - * `evaluation.TrainScoreDatasetSplit` - * `audio.CutAudio` - * `classifier.RandomForest` -* Starting list enabled primitives in the [`entry_points.ini`](./entry_points.ini) file. -* Created `devel` branch which contains primitives coded against the - future release of the `d3m` core package (its `devel` branch). - `master` branch of this repository is made against the latest stable - release of the `d3m` core package. -* Dropped support for Python 2.7 and require Python 3.6. -* Renamed repository and package to `common-primitives` and `common_primitives`, - respectively. -* Repository migrated to gitlab.com and made public. - -## v0.1.1 - -* Made common primitives work on Python 2.7. - -## v0.1.0 - -* Initial set of common primitives. diff --git a/common-primitives/HOW_TO_MANAGE.md b/common-primitives/HOW_TO_MANAGE.md deleted file mode 100644 index 9e0d3db..0000000 --- a/common-primitives/HOW_TO_MANAGE.md +++ /dev/null @@ -1,94 +0,0 @@ -# How to publish primitive annotations - -As contributors add or update their primitives they might want to publish -primitive annotations for added primitives. When doing this it is important -to republish also all other primitive annotations already published from this -package. This is because only one version of the package can be installed at -a time and all primitive annotations have to point to the same package in -their `installation` metadata. - -Steps to publish primitive annotations: -* Operate in a virtual env with the following installed: - * Target core package installed. - * [Test primitives](https://gitlab.com/datadrivendiscovery/tests-data/tree/master/primitives) - with the same version of primitives which are currently published in `primitives` - repository. Remember to install them in `-e` editable mode. -* Update `HISTORY.md` for `vNEXT` release with information about primitives - added or updated. If there was no package release since they were updated last, - do not duplicate entries but just update any existing entries for those primitives - instead, so that once released it is clear what has changed in a release as a whole. -* Make sure tests for primitives being published (primitives added, updated, - and primitives previously published which should be now republished) pass. -* Update `entry_points.ini` and add new primitives. Leave active - only those entries for primitives being (re)published and comment out all others. - * If this is the first time primitives are published after a release of a new `d3m` - core package, leave active only those which were updated to work with - the new `d3m` core package. Leave to others to update, verify, and publish - other common primitives. -* In clone of `primitives` repository prepare a branch of the up-to-date `master` branch - to add/update primitive annotations. If existing annotations for common primitives - are already there the best is to first remove them to make sure annotations for - removed primitives do not stay around. We will re-add all primitives in the next step. -* Run `add.sh` in root of this package, which will add primitive annotations - to `primitives`. See instructions in the script for more information. -* Verify changes in the `primitives`, add and commit files to git. -* Publish a branch in `primitives` and make a merge request. - -# How to release a new version - -A new version is always released from `master` branch against a stable release -of `d3m` core package. A new version should be released when there are major -changes to the package (many new primitives added, larger breaking changes). -Sync up with other developers of the repo to suggest a release, or do a release. - -* On `master` branch: - * Make sure `HISTORY.md` file is updated with all changes since the last release. - * Change a version in `common_primitives/__init__.py` to the to-be-released version, without `v` prefix. - * Change `vNEXT` in `HISTORY.md` to the to-be-released version, with `v` prefix. - * Commit with message `Bumping version for release.` - * `git push` - * Wait for CI to run tests successfully. - * Tag with version prefixed with `v`, e.g., for version `0.2.0`: `git tag v0.2.0` - * `git push` & `git push --tags` - * Change a version in `common_primitives/__init__.py` back to `devel` string. - * Add a new empty `vNEXT` version on top of `HISTORY.md`. - * Commit with message `Version bump for development.` - * `git push` -* On `devel` branch: - * Merge `master` into `devel` branch: `git merge master` - * Update the branch according to the section below. - * `git push` - -# How to update `master` branch after a release of new `d3m` core package - -Hopefully, `devel` branch already contains code which works against the released -`d3m` core package. So merge `devel` branch into `master` branch and update -files according to the following section. - -# Keeping `master` and `devel` branches in sync - -Because `master` and `devel` branches mostly contain the same code, -just made against different version of `d3m` core package, it is common -to merge branches into each other as needed to keep them in sync. -When doing so, the following are files which are specific to branches: - -* `.gitlab-ci.yml` has a `DEPENDENCY_REF` environment variable which - has to point to `master` on `master` branch of this repository, - and `devel` on `devel` branch of this repository. - -# How to add an example pipeline - -Every common primitive (except those used in non-standard pipelines, like splitting primitives) -should have at least one example pipeline and associated pipeline run. - -Add example pipelines into a corresponding sub-directory based on primitive's suffix into `pipelines` -directory in the repository. If a pipeline uses multiple common primitives, add it for only one -primitive and create symbolic links for other primitives. - -Create a `fit-score` pipeline run as [described in primitives index repository](https://gitlab.com/datadrivendiscovery/primitives#adding-a-primitive). -Compress it with `gzip` and store it under `pipeline_runs` directory in the repository. -Similarly, add it only for one primitive and create symbolic links for others, if pipeline run -corresponds to a pipeline with multiple common primitives. - -Use `git-add.sh` script to assure all files larger than 100 KB are added as git LFS files to -the repository. diff --git a/common-primitives/LICENSE.txt b/common-primitives/LICENSE.txt deleted file mode 100644 index 261eeb9..0000000 --- a/common-primitives/LICENSE.txt +++ /dev/null @@ -1,201 +0,0 @@ - Apache License - Version 2.0, January 2004 - http://www.apache.org/licenses/ - - TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION - - 1. Definitions. - - "License" shall mean the terms and conditions for use, reproduction, - and distribution as defined by Sections 1 through 9 of this document. - - "Licensor" shall mean the copyright owner or entity authorized by - the copyright owner that is granting the License. - - "Legal Entity" shall mean the union of the acting entity and all - other entities that control, are controlled by, or are under common - control with that entity. For the purposes of this definition, - "control" means (i) the power, direct or indirect, to cause the - direction or management of such entity, whether by contract or - otherwise, or (ii) ownership of fifty percent (50%) or more of the - outstanding shares, or (iii) beneficial ownership of such entity. - - "You" (or "Your") shall mean an individual or Legal Entity - exercising permissions granted by this License. - - "Source" form shall mean the preferred form for making modifications, - including but not limited to software source code, documentation - source, and configuration files. - - "Object" form shall mean any form resulting from mechanical - transformation or translation of a Source form, including but - not limited to compiled object code, generated documentation, - and conversions to other media types. - - "Work" shall mean the work of authorship, whether in Source or - Object form, made available under the License, as indicated by a - copyright notice that is included in or attached to the work - (an example is provided in the Appendix below). - - "Derivative Works" shall mean any work, whether in Source or Object - form, that is based on (or derived from) the Work and for which the - editorial revisions, annotations, elaborations, or other modifications - represent, as a whole, an original work of authorship. For the purposes - of this License, Derivative Works shall not include works that remain - separable from, or merely link (or bind by name) to the interfaces of, - the Work and Derivative Works thereof. - - "Contribution" shall mean any work of authorship, including - the original version of the Work and any modifications or additions - to that Work or Derivative Works thereof, that is intentionally - submitted to Licensor for inclusion in the Work by the copyright owner - or by an individual or Legal Entity authorized to submit on behalf of - the copyright owner. For the purposes of this definition, "submitted" - means any form of electronic, verbal, or written communication sent - to the Licensor or its representatives, including but not limited to - communication on electronic mailing lists, source code control systems, - and issue tracking systems that are managed by, or on behalf of, the - Licensor for the purpose of discussing and improving the Work, but - excluding communication that is conspicuously marked or otherwise - designated in writing by the copyright owner as "Not a Contribution." - - "Contributor" shall mean Licensor and any individual or Legal Entity - on behalf of whom a Contribution has been received by Licensor and - subsequently incorporated within the Work. - - 2. Grant of Copyright License. Subject to the terms and conditions of - this License, each Contributor hereby grants to You a perpetual, - worldwide, non-exclusive, no-charge, royalty-free, irrevocable - copyright license to reproduce, prepare Derivative Works of, - publicly display, publicly perform, sublicense, and distribute the - Work and such Derivative Works in Source or Object form. - - 3. Grant of Patent License. Subject to the terms and conditions of - this License, each Contributor hereby grants to You a perpetual, - worldwide, non-exclusive, no-charge, royalty-free, irrevocable - (except as stated in this section) patent license to make, have made, - use, offer to sell, sell, import, and otherwise transfer the Work, - where such license applies only to those patent claims licensable - by such Contributor that are necessarily infringed by their - Contribution(s) alone or by combination of their Contribution(s) - with the Work to which such Contribution(s) was submitted. If You - institute patent litigation against any entity (including a - cross-claim or counterclaim in a lawsuit) alleging that the Work - or a Contribution incorporated within the Work constitutes direct - or contributory patent infringement, then any patent licenses - granted to You under this License for that Work shall terminate - as of the date such litigation is filed. - - 4. Redistribution. You may reproduce and distribute copies of the - Work or Derivative Works thereof in any medium, with or without - modifications, and in Source or Object form, provided that You - meet the following conditions: - - (a) You must give any other recipients of the Work or - Derivative Works a copy of this License; and - - (b) You must cause any modified files to carry prominent notices - stating that You changed the files; and - - (c) You must retain, in the Source form of any Derivative Works - that You distribute, all copyright, patent, trademark, and - attribution notices from the Source form of the Work, - excluding those notices that do not pertain to any part of - the Derivative Works; and - - (d) If the Work includes a "NOTICE" text file as part of its - distribution, then any Derivative Works that You distribute must - include a readable copy of the attribution notices contained - within such NOTICE file, excluding those notices that do not - pertain to any part of the Derivative Works, in at least one - of the following places: within a NOTICE text file distributed - as part of the Derivative Works; within the Source form or - documentation, if provided along with the Derivative Works; or, - within a display generated by the Derivative Works, if and - wherever such third-party notices normally appear. The contents - of the NOTICE file are for informational purposes only and - do not modify the License. You may add Your own attribution - notices within Derivative Works that You distribute, alongside - or as an addendum to the NOTICE text from the Work, provided - that such additional attribution notices cannot be construed - as modifying the License. - - You may add Your own copyright statement to Your modifications and - may provide additional or different license terms and conditions - for use, reproduction, or distribution of Your modifications, or - for any such Derivative Works as a whole, provided Your use, - reproduction, and distribution of the Work otherwise complies with - the conditions stated in this License. - - 5. Submission of Contributions. Unless You explicitly state otherwise, - any Contribution intentionally submitted for inclusion in the Work - by You to the Licensor shall be under the terms and conditions of - this License, without any additional terms or conditions. - Notwithstanding the above, nothing herein shall supersede or modify - the terms of any separate license agreement you may have executed - with Licensor regarding such Contributions. - - 6. Trademarks. This License does not grant permission to use the trade - names, trademarks, service marks, or product names of the Licensor, - except as required for reasonable and customary use in describing the - origin of the Work and reproducing the content of the NOTICE file. - - 7. Disclaimer of Warranty. Unless required by applicable law or - agreed to in writing, Licensor provides the Work (and each - Contributor provides its Contributions) on an "AS IS" BASIS, - WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or - implied, including, without limitation, any warranties or conditions - of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A - PARTICULAR PURPOSE. You are solely responsible for determining the - appropriateness of using or redistributing the Work and assume any - risks associated with Your exercise of permissions under this License. - - 8. Limitation of Liability. In no event and under no legal theory, - whether in tort (including negligence), contract, or otherwise, - unless required by applicable law (such as deliberate and grossly - negligent acts) or agreed to in writing, shall any Contributor be - liable to You for damages, including any direct, indirect, special, - incidental, or consequential damages of any character arising as a - result of this License or out of the use or inability to use the - Work (including but not limited to damages for loss of goodwill, - work stoppage, computer failure or malfunction, or any and all - other commercial damages or losses), even if such Contributor - has been advised of the possibility of such damages. - - 9. Accepting Warranty or Additional Liability. While redistributing - the Work or Derivative Works thereof, You may choose to offer, - and charge a fee for, acceptance of support, warranty, indemnity, - or other liability obligations and/or rights consistent with this - License. However, in accepting such obligations, You may act only - on Your own behalf and on Your sole responsibility, not on behalf - of any other Contributor, and only if You agree to indemnify, - defend, and hold each Contributor harmless for any liability - incurred by, or claims asserted against, such Contributor by reason - of your accepting any such warranty or additional liability. - - END OF TERMS AND CONDITIONS - - APPENDIX: How to apply the Apache License to your work. - - To apply the Apache License to your work, attach the following - boilerplate notice, with the fields enclosed by brackets "[]" - replaced with your own identifying information. (Don't include - the brackets!) The text should be enclosed in the appropriate - comment syntax for the file format. We also recommend that a - file or class name and description of purpose be included on the - same "printed page" as the copyright notice for easier - identification within third-party archives. - - Copyright [yyyy] [name of copyright owner] - - Licensed under the Apache License, Version 2.0 (the "License"); - you may not use this file except in compliance with the License. - You may obtain a copy of the License at - - http://www.apache.org/licenses/LICENSE-2.0 - - Unless required by applicable law or agreed to in writing, software - distributed under the License is distributed on an "AS IS" BASIS, - WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - See the License for the specific language governing permissions and - limitations under the License. diff --git a/common-primitives/MANIFEST.in b/common-primitives/MANIFEST.in deleted file mode 100644 index 3e677d0..0000000 --- a/common-primitives/MANIFEST.in +++ /dev/null @@ -1,2 +0,0 @@ -include README.md -include LICENSE.txt diff --git a/common-primitives/README.md b/common-primitives/README.md deleted file mode 100644 index fe2fbcf..0000000 --- a/common-primitives/README.md +++ /dev/null @@ -1,83 +0,0 @@ -# Common D3M primitives - -A common set of primitives for D3M project, maintained together. -It contains example primitives, various glue primitives, and other primitives performers -contributed. - -## Installation - -This package works on Python 3.6+ and pip 19+. - -This package additional dependencies which are specified in primitives' metadata, -but if you are manually installing the package, you have to first run, for Ubuntu: - -``` -$ apt-get install build-essential libopenblas-dev libcap-dev ffmpeg -$ pip3 install python-prctl -``` - -To install common primitives from inside a cloned repository, run: - -``` -$ pip3 install -e . -``` - -When cloning a repository, clone it recursively to get also git submodules: - -``` -$ git clone --recursive https://gitlab.com/datadrivendiscovery/common-primitives.git -``` - -## Changelog - -See [HISTORY.md](./HISTORY.md) for summary of changes to this package. - -## Repository structure - -`master` branch contains latest code of common primitives made against the latest stable -release of the [`d3m` core package](https://gitlab.com/datadrivendiscovery/d3m) (its `master` branch). -`devel` branch contains latest code of common primitives made against the -future release of the `d3m` core package (its `devel` branch). - -Releases are [tagged](https://gitlab.com/datadrivendiscovery/d3m/tags) but they are not done -regularly. Each primitive has its own versions as well, which are not related to package versions. -Generally is the best to just use the latest code available in `master` or `devel` -branches (depending which version of the core package you are using). - -## Testing locally - -For each commit to this repository, tests run automatically in the -[GitLab CI](https://gitlab.com/datadrivendiscovery/common-primitives/pipelines). - -If you don't want to wait for the GitLab CI test results and run the tests locally, -you can install and use the [GitLab runner](https://docs.gitlab.com/runner/install/) in your system. - -With the local GitLab runner, you can run the tests defined in the [.gitlab-ci.yml](.gitlab-ci.yml) -file of this repository, such as: - -``` -$ gitlab-runner exec docker style_check -$ gitlab-runner exec docker type_check -``` - -You can also just try to run tests available under `/tests` by running: - -``` -$ python3 run_tests.py -``` - -## Contribute - -Feel free to contribute more primitives to this repository. The idea is that we build -a common set of primitives which can help both as an example, but also to have shared -maintenance of some primitives, especially glue primitives. - -All primitives are written in Python 3 and are type checked using -[mypy](http://www.mypy-lang.org/), so typing annotations are required. - -## About Data Driven Discovery Program - -DARPA Data Driven Discovery (D3M) Program is researching ways to get machines to build -machine learning pipelines automatically. It is split into three layers: -TA1 (primitives), TA2 (systems which combine primitives automatically into pipelines -and executes them), and TA3 (end-users interfaces). diff --git a/common-primitives/add.sh b/common-primitives/add.sh deleted file mode 100755 index 7059b16..0000000 --- a/common-primitives/add.sh +++ /dev/null @@ -1,24 +0,0 @@ -#!/bin/bash -e - -# Assumption is that this repository is cloned into "common-primitives" directory -# which is a sibling of "d3m-primitives" directory with D3M public primitives. - -D3M_VERSION="$(python3 -c 'import d3m; print(d3m.__version__)')" - -for PRIMITIVE_SUFFIX in $(./list_primitives.py --suffix); do - echo "$PRIMITIVE_SUFFIX" - python3 -m d3m index describe -i 4 "d3m.primitives.$PRIMITIVE_SUFFIX" > primitive.json - pushd ../d3m-primitives > /dev/null - ./add.py ../common-primitives/primitive.json - popd > /dev/null - if [[ -e "pipelines/$PRIMITIVE_SUFFIX" ]]; then - PRIMITIVE_PATH="$(echo ../d3m-primitives/v$D3M_VERSION/common-primitives/d3m.primitives.$PRIMITIVE_SUFFIX/*)" - mkdir -p "$PRIMITIVE_PATH/pipelines" - find pipelines/$PRIMITIVE_SUFFIX/ \( -name '*.json' -or -name '*.yaml' -or -name '*.yml' -or -name '*.json.gz' -or -name '*.yaml.gz' -or -name '*.yml.gz' \) -exec cp '{}' "$PRIMITIVE_PATH/pipelines" ';' - fi - if [[ -e "pipeline_runs/$PRIMITIVE_SUFFIX" ]]; then - PRIMITIVE_PATH="$(echo ../d3m-primitives/v$D3M_VERSION/common-primitives/d3m.primitives.$PRIMITIVE_SUFFIX/*)" - mkdir -p "$PRIMITIVE_PATH/pipeline_runs" - find pipeline_runs/$PRIMITIVE_SUFFIX/ \( -name '*.yml.gz' -or -name '*.yaml.gz' \) -exec cp '{}' "$PRIMITIVE_PATH/pipeline_runs" ';' - fi -done diff --git a/common-primitives/entry_points.ini b/common-primitives/entry_points.ini deleted file mode 100644 index 5dac201..0000000 --- a/common-primitives/entry_points.ini +++ /dev/null @@ -1,63 +0,0 @@ -[d3m.primitives] -data_preprocessing.one_hot_encoder.MakerCommon = common_primitives.one_hot_maker:OneHotMakerPrimitive -data_preprocessing.one_hot_encoder.PandasCommon = common_primitives.pandas_onehot_encoder:PandasOneHotEncoderPrimitive -data_transformation.extract_columns.Common = common_primitives.extract_columns:ExtractColumnsPrimitive -data_transformation.extract_columns_by_semantic_types.Common = common_primitives.extract_columns_semantic_types:ExtractColumnsBySemanticTypesPrimitive -data_transformation.extract_columns_by_structural_types.Common = common_primitives.extract_columns_structural_types:ExtractColumnsByStructuralTypesPrimitive -data_transformation.remove_columns.Common = common_primitives.remove_columns:RemoveColumnsPrimitive -data_transformation.remove_duplicate_columns.Common = common_primitives.remove_duplicate_columns:RemoveDuplicateColumnsPrimitive -data_transformation.horizontal_concat.DataFrameCommon = common_primitives.horizontal_concat:HorizontalConcatPrimitive -data_transformation.cast_to_type.Common = common_primitives.cast_to_type:CastToTypePrimitive -data_transformation.column_parser.Common = common_primitives.column_parser:ColumnParserPrimitive -data_transformation.construct_predictions.Common = common_primitives.construct_predictions:ConstructPredictionsPrimitive -data_transformation.dataframe_to_ndarray.Common = common_primitives.dataframe_to_ndarray:DataFrameToNDArrayPrimitive -data_transformation.ndarray_to_dataframe.Common = common_primitives.ndarray_to_dataframe:NDArrayToDataFramePrimitive -data_transformation.dataframe_to_list.Common = common_primitives.dataframe_to_list:DataFrameToListPrimitive -data_transformation.list_to_dataframe.Common = common_primitives.list_to_dataframe:ListToDataFramePrimitive -data_transformation.ndarray_to_list.Common = common_primitives.ndarray_to_list:NDArrayToListPrimitive -data_transformation.list_to_ndarray.Common = common_primitives.list_to_ndarray:ListToNDArrayPrimitive -data_transformation.stack_ndarray_column.Common = common_primitives.stack_ndarray_column:StackNDArrayColumnPrimitive -data_transformation.add_semantic_types.Common = common_primitives.add_semantic_types:AddSemanticTypesPrimitive -data_transformation.remove_semantic_types.Common = common_primitives.remove_semantic_types:RemoveSemanticTypesPrimitive -data_transformation.replace_semantic_types.Common = common_primitives.replace_semantic_types:ReplaceSemanticTypesPrimitive -data_transformation.denormalize.Common = common_primitives.denormalize:DenormalizePrimitive -data_transformation.datetime_field_compose.Common = common_primitives.datetime_field_compose:DatetimeFieldComposePrimitive -data_transformation.grouping_field_compose.Common = common_primitives.grouping_field_compose:GroupingFieldComposePrimitive -data_transformation.dataset_to_dataframe.Common = common_primitives.dataset_to_dataframe:DatasetToDataFramePrimitive -data_transformation.cut_audio.Common = common_primitives.cut_audio:CutAudioPrimitive -data_transformation.rename_duplicate_name.DataFrameCommon = common_primitives.rename_duplicate_columns:RenameDuplicateColumnsPrimitive -#data_transformation.normalize_column_references.Common = common_primitives.normalize_column_references:NormalizeColumnReferencesPrimitive -#data_transformation.normalize_graphs.Common = common_primitives.normalize_graphs:NormalizeGraphsPrimitive -data_transformation.ravel.DataFrameRowCommon = common_primitives.ravel:RavelAsRowPrimitive -data_preprocessing.label_encoder.Common = common_primitives.unseen_label_encoder:UnseenLabelEncoderPrimitive -data_preprocessing.label_decoder.Common = common_primitives.unseen_label_decoder:UnseenLabelDecoderPrimitive -data_preprocessing.image_reader.Common = common_primitives.dataframe_image_reader:DataFrameImageReaderPrimitive -data_preprocessing.text_reader.Common = common_primitives.text_reader:TextReaderPrimitive -data_preprocessing.video_reader.Common = common_primitives.video_reader:VideoReaderPrimitive -data_preprocessing.csv_reader.Common = common_primitives.csv_reader:CSVReaderPrimitive -data_preprocessing.audio_reader.Common = common_primitives.audio_reader:AudioReaderPrimitive -data_preprocessing.regex_filter.Common = common_primitives.regex_filter:RegexFilterPrimitive -data_preprocessing.term_filter.Common = common_primitives.term_filter:TermFilterPrimitive -data_preprocessing.numeric_range_filter.Common = common_primitives.numeric_range_filter:NumericRangeFilterPrimitive -data_preprocessing.datetime_range_filter.Common = common_primitives.datetime_range_filter:DatetimeRangeFilterPrimitive -data_preprocessing.dataset_sample.Common = common_primitives.dataset_sample:DatasetSamplePrimitive -#data_preprocessing.time_interval_transform.Common = common_primitives.time_interval_transform:TimeIntervalTransformPrimitive -data_cleaning.tabular_extractor.Common = common_primitives.tabular_extractor:AnnotatedTabularExtractorPrimitive -evaluation.redact_columns.Common = common_primitives.redact_columns:RedactColumnsPrimitive -evaluation.kfold_dataset_split.Common = common_primitives.kfold_split:KFoldDatasetSplitPrimitive -evaluation.kfold_time_series_split.Common = common_primitives.kfold_split_timeseries:KFoldTimeSeriesSplitPrimitive -evaluation.train_score_dataset_split.Common = common_primitives.train_score_split:TrainScoreDatasetSplitPrimitive -evaluation.no_split_dataset_split.Common = common_primitives.no_split:NoSplitDatasetSplitPrimitive -evaluation.fixed_split_dataset_split.Commmon = common_primitives.fixed_split:FixedSplitDatasetSplitPrimitive -classification.random_forest.Common = common_primitives.random_forest:RandomForestClassifierPrimitive -classification.light_gbm.Common = common_primitives.lgbm_classifier:LightGBMClassifierPrimitive -classification.xgboost_gbtree.Common = common_primitives.xgboost_gbtree:XGBoostGBTreeClassifierPrimitive -classification.xgboost_dart.Common = common_primitives.xgboost_dart:XGBoostDartClassifierPrimitive -regression.xgboost_gbtree.Common = common_primitives.xgboost_regressor:XGBoostGBTreeRegressorPrimitive -schema_discovery.profiler.Common = common_primitives.simple_profiler:SimpleProfilerPrimitive -operator.column_map.Common = common_primitives.column_map:DataFrameColumnMapPrimitive -operator.dataset_map.DataFrameCommon = common_primitives.dataset_map:DataFrameDatasetMapPrimitive -data_preprocessing.flatten.DataFrameCommon = common_primitives.dataframe_flatten:DataFrameFlattenPrimitive -metalearning.metafeature_extractor.Common = common_primitives.compute_metafeatures:ComputeMetafeaturesPrimitive -data_augmentation.datamart_augmentation.Common = common_primitives.datamart_augment:DataMartAugmentPrimitive -data_augmentation.datamart_download.Common = common_primitives.datamart_download:DataMartDownloadPrimitive diff --git a/common-primitives/git-add.sh b/common-primitives/git-add.sh deleted file mode 100755 index 896ab85..0000000 --- a/common-primitives/git-add.sh +++ /dev/null @@ -1,5 +0,0 @@ -#!/bin/bash -e - -# This requires git LFS 2.9.0 or newer. - -find * -type f -size +100k -exec git lfs track --filename '{}' + diff --git a/common-primitives/git-check.sh b/common-primitives/git-check.sh deleted file mode 100755 index 8a6b468..0000000 --- a/common-primitives/git-check.sh +++ /dev/null @@ -1,21 +0,0 @@ -#!/bin/bash -e - -if git rev-list --objects --all \ -| git cat-file --batch-check='%(objecttype) %(objectname) %(objectsize) %(rest)' \ -| sed -n 's/^blob //p' \ -| awk '$2 >= 100*(2^10)' \ -| awk '{print $3}' \ -| egrep -v '(^|/).gitattributes$' ; then - echo "Repository contains committed objects larger than 100 KB." - exit 1 -fi - -if git lfs ls-files --name-only | xargs -r stat -c '%s %n' | awk '$1 < 100*(2^10)' | awk '{print $2}' | grep . ; then - echo "Repository contains LFS objects smaller than 100 KB." - exit 1 -fi - -if git lfs ls-files --name-only | xargs -r stat -c '%s %n' | awk '$1 >= 2*(2^30)' | awk '{print $2}' | grep . ; then - echo "Repository contains LFS objects not smaller than 2 GB." - exit 1 -fi diff --git a/common-primitives/list_primitives.py b/common-primitives/list_primitives.py deleted file mode 100755 index 0d5da96..0000000 --- a/common-primitives/list_primitives.py +++ /dev/null @@ -1,32 +0,0 @@ -#!/usr/bin/env python3 - -import argparse -import configparser -import re - - -class CaseSensitiveConfigParser(configparser.ConfigParser): - optionxform = staticmethod(str) - - -parser = argparse.ArgumentParser(description='List enabled common primitives.') -group = parser.add_mutually_exclusive_group(required=True) -group.add_argument('--suffix', action='store_true', help='list primitive suffixes of all enabled common primitives') -group.add_argument('--python', action='store_true', help='list Python paths of all enabled common primitives') -group.add_argument('--files', action='store_true', help='list file paths of all enabled common primitives') - -args = parser.parse_args() - -entry_points = CaseSensitiveConfigParser() -entry_points.read('entry_points.ini') - -for primitive_suffix, primitive_path in entry_points.items('d3m.primitives'): - if args.python: - print("d3m.primitives.{primitive_suffix}".format(primitive_suffix=primitive_suffix)) - elif args.suffix: - print(primitive_suffix) - elif args.files: - primitive_path = re.sub(':.+$', '', primitive_path) - primitive_path = re.sub('\.', '/', primitive_path) - print("{primitive_path}.py".format(primitive_path=primitive_path)) - diff --git a/common-primitives/pipeline_runs/classification.light_gbm.DataFrameCommon/1.yaml.gz b/common-primitives/pipeline_runs/classification.light_gbm.DataFrameCommon/1.yaml.gz deleted file mode 100644 index 0529242..0000000 Binary files a/common-primitives/pipeline_runs/classification.light_gbm.DataFrameCommon/1.yaml.gz and /dev/null differ diff --git a/common-primitives/pipeline_runs/classification.random_forest.DataFrameCommon/1.yaml.gz b/common-primitives/pipeline_runs/classification.random_forest.DataFrameCommon/1.yaml.gz deleted file mode 100644 index b742e77..0000000 Binary files a/common-primitives/pipeline_runs/classification.random_forest.DataFrameCommon/1.yaml.gz and /dev/null differ diff --git a/common-primitives/pipeline_runs/classification.random_forest.DataFrameCommon/pipeline_run_extract_structural_types.yml.gz b/common-primitives/pipeline_runs/classification.random_forest.DataFrameCommon/pipeline_run_extract_structural_types.yml.gz deleted file mode 120000 index 91f49b3..0000000 --- a/common-primitives/pipeline_runs/classification.random_forest.DataFrameCommon/pipeline_run_extract_structural_types.yml.gz +++ /dev/null @@ -1 +0,0 @@ -../data_transformation.extract_columns_by_structural_types.Common/pipeline_run.yml.gz \ No newline at end of file diff --git a/common-primitives/pipeline_runs/classification.xgboost_dart.DataFrameCommon/1.yaml.gz b/common-primitives/pipeline_runs/classification.xgboost_dart.DataFrameCommon/1.yaml.gz deleted file mode 100644 index 144a9cf..0000000 Binary files a/common-primitives/pipeline_runs/classification.xgboost_dart.DataFrameCommon/1.yaml.gz and /dev/null differ diff --git a/common-primitives/pipeline_runs/classification.xgboost_gbtree.DataFrameCommon/1.yaml.gz b/common-primitives/pipeline_runs/classification.xgboost_gbtree.DataFrameCommon/1.yaml.gz deleted file mode 100644 index 1bc8198..0000000 Binary files a/common-primitives/pipeline_runs/classification.xgboost_gbtree.DataFrameCommon/1.yaml.gz and /dev/null differ diff --git a/common-primitives/pipeline_runs/data_augmentation.datamart_augmentation.Common/2.yaml.gz b/common-primitives/pipeline_runs/data_augmentation.datamart_augmentation.Common/2.yaml.gz deleted file mode 100644 index e449db8..0000000 Binary files a/common-primitives/pipeline_runs/data_augmentation.datamart_augmentation.Common/2.yaml.gz and /dev/null differ diff --git a/common-primitives/pipeline_runs/data_preprocessing.dataset_sample.Common/1.yaml.gz b/common-primitives/pipeline_runs/data_preprocessing.dataset_sample.Common/1.yaml.gz deleted file mode 100644 index 824ea25..0000000 Binary files a/common-primitives/pipeline_runs/data_preprocessing.dataset_sample.Common/1.yaml.gz and /dev/null differ diff --git a/common-primitives/pipeline_runs/data_preprocessing.one_hot_encoder.PandasCommon/pipeline_run_extract_structural_types.yml.gz b/common-primitives/pipeline_runs/data_preprocessing.one_hot_encoder.PandasCommon/pipeline_run_extract_structural_types.yml.gz deleted file mode 120000 index 91f49b3..0000000 --- a/common-primitives/pipeline_runs/data_preprocessing.one_hot_encoder.PandasCommon/pipeline_run_extract_structural_types.yml.gz +++ /dev/null @@ -1 +0,0 @@ -../data_transformation.extract_columns_by_structural_types.Common/pipeline_run.yml.gz \ No newline at end of file diff --git a/common-primitives/pipeline_runs/data_transformation.column_parser.DataFrameCommon/1.yaml.gz b/common-primitives/pipeline_runs/data_transformation.column_parser.DataFrameCommon/1.yaml.gz deleted file mode 120000 index b531842..0000000 --- a/common-primitives/pipeline_runs/data_transformation.column_parser.DataFrameCommon/1.yaml.gz +++ /dev/null @@ -1 +0,0 @@ -../classification.light_gbm.DataFrameCommon/1.yaml.gz \ No newline at end of file diff --git a/common-primitives/pipeline_runs/data_transformation.column_parser.DataFrameCommon/pipeline_run_extract_structural_types.yml.gz b/common-primitives/pipeline_runs/data_transformation.column_parser.DataFrameCommon/pipeline_run_extract_structural_types.yml.gz deleted file mode 120000 index 91f49b3..0000000 --- a/common-primitives/pipeline_runs/data_transformation.column_parser.DataFrameCommon/pipeline_run_extract_structural_types.yml.gz +++ /dev/null @@ -1 +0,0 @@ -../data_transformation.extract_columns_by_structural_types.Common/pipeline_run.yml.gz \ No newline at end of file diff --git a/common-primitives/pipeline_runs/data_transformation.column_parser.DataFrameCommon/pipeline_run_group_field_compose.yml.gz b/common-primitives/pipeline_runs/data_transformation.column_parser.DataFrameCommon/pipeline_run_group_field_compose.yml.gz deleted file mode 120000 index 0a4dd35..0000000 --- a/common-primitives/pipeline_runs/data_transformation.column_parser.DataFrameCommon/pipeline_run_group_field_compose.yml.gz +++ /dev/null @@ -1 +0,0 @@ -../data_transformation.grouping_field_compose.Common/pipeline_run.yml.gz \ No newline at end of file diff --git a/common-primitives/pipeline_runs/data_transformation.construct_predictions.DataFrameCommon/1.yaml.gz b/common-primitives/pipeline_runs/data_transformation.construct_predictions.DataFrameCommon/1.yaml.gz deleted file mode 120000 index b531842..0000000 --- a/common-primitives/pipeline_runs/data_transformation.construct_predictions.DataFrameCommon/1.yaml.gz +++ /dev/null @@ -1 +0,0 @@ -../classification.light_gbm.DataFrameCommon/1.yaml.gz \ No newline at end of file diff --git a/common-primitives/pipeline_runs/data_transformation.construct_predictions.DataFrameCommon/pipeline_run_extract_structural_types.yml.gz b/common-primitives/pipeline_runs/data_transformation.construct_predictions.DataFrameCommon/pipeline_run_extract_structural_types.yml.gz deleted file mode 120000 index 91f49b3..0000000 --- a/common-primitives/pipeline_runs/data_transformation.construct_predictions.DataFrameCommon/pipeline_run_extract_structural_types.yml.gz +++ /dev/null @@ -1 +0,0 @@ -../data_transformation.extract_columns_by_structural_types.Common/pipeline_run.yml.gz \ No newline at end of file diff --git a/common-primitives/pipeline_runs/data_transformation.dataset_to_dataframe.Common/1.yaml.gz b/common-primitives/pipeline_runs/data_transformation.dataset_to_dataframe.Common/1.yaml.gz deleted file mode 120000 index b531842..0000000 --- a/common-primitives/pipeline_runs/data_transformation.dataset_to_dataframe.Common/1.yaml.gz +++ /dev/null @@ -1 +0,0 @@ -../classification.light_gbm.DataFrameCommon/1.yaml.gz \ No newline at end of file diff --git a/common-primitives/pipeline_runs/data_transformation.dataset_to_dataframe.Common/pipeline_run_extract_structural_types.yml.gz b/common-primitives/pipeline_runs/data_transformation.dataset_to_dataframe.Common/pipeline_run_extract_structural_types.yml.gz deleted file mode 120000 index 91f49b3..0000000 --- a/common-primitives/pipeline_runs/data_transformation.dataset_to_dataframe.Common/pipeline_run_extract_structural_types.yml.gz +++ /dev/null @@ -1 +0,0 @@ -../data_transformation.extract_columns_by_structural_types.Common/pipeline_run.yml.gz \ No newline at end of file diff --git a/common-primitives/pipeline_runs/data_transformation.dataset_to_dataframe.Common/pipeline_run_group_field_compose.yml.gz b/common-primitives/pipeline_runs/data_transformation.dataset_to_dataframe.Common/pipeline_run_group_field_compose.yml.gz deleted file mode 120000 index 0a4dd35..0000000 --- a/common-primitives/pipeline_runs/data_transformation.dataset_to_dataframe.Common/pipeline_run_group_field_compose.yml.gz +++ /dev/null @@ -1 +0,0 @@ -../data_transformation.grouping_field_compose.Common/pipeline_run.yml.gz \ No newline at end of file diff --git a/common-primitives/pipeline_runs/data_transformation.extract_columns_by_semantic_types.DataFrameCommon/1.yaml.gz b/common-primitives/pipeline_runs/data_transformation.extract_columns_by_semantic_types.DataFrameCommon/1.yaml.gz deleted file mode 120000 index b531842..0000000 --- a/common-primitives/pipeline_runs/data_transformation.extract_columns_by_semantic_types.DataFrameCommon/1.yaml.gz +++ /dev/null @@ -1 +0,0 @@ -../classification.light_gbm.DataFrameCommon/1.yaml.gz \ No newline at end of file diff --git a/common-primitives/pipeline_runs/data_transformation.extract_columns_by_semantic_types.DataFrameCommon/pipeline_run_extract_structural_types.yml.gz b/common-primitives/pipeline_runs/data_transformation.extract_columns_by_semantic_types.DataFrameCommon/pipeline_run_extract_structural_types.yml.gz deleted file mode 120000 index 91f49b3..0000000 --- a/common-primitives/pipeline_runs/data_transformation.extract_columns_by_semantic_types.DataFrameCommon/pipeline_run_extract_structural_types.yml.gz +++ /dev/null @@ -1 +0,0 @@ -../data_transformation.extract_columns_by_structural_types.Common/pipeline_run.yml.gz \ No newline at end of file diff --git a/common-primitives/pipeline_runs/data_transformation.extract_columns_by_structural_types.Common/pipeline_run.yml.gz b/common-primitives/pipeline_runs/data_transformation.extract_columns_by_structural_types.Common/pipeline_run.yml.gz deleted file mode 100644 index 3e1ee3c..0000000 Binary files a/common-primitives/pipeline_runs/data_transformation.extract_columns_by_structural_types.Common/pipeline_run.yml.gz and /dev/null differ diff --git a/common-primitives/pipeline_runs/data_transformation.grouping_field_compose.Common/pipeline_run.yml.gz b/common-primitives/pipeline_runs/data_transformation.grouping_field_compose.Common/pipeline_run.yml.gz deleted file mode 100644 index 8f8bdf0..0000000 Binary files a/common-primitives/pipeline_runs/data_transformation.grouping_field_compose.Common/pipeline_run.yml.gz and /dev/null differ diff --git a/common-primitives/pipeline_runs/data_transformation.horizontal_concat.DataFrameConcat/1.yaml.gz b/common-primitives/pipeline_runs/data_transformation.horizontal_concat.DataFrameConcat/1.yaml.gz deleted file mode 120000 index cc4d8fa..0000000 --- a/common-primitives/pipeline_runs/data_transformation.horizontal_concat.DataFrameConcat/1.yaml.gz +++ /dev/null @@ -1 +0,0 @@ -../data_preprocessing.one_hot_encoder.MakerCommon/1.yaml.gz \ No newline at end of file diff --git a/common-primitives/pipeline_runs/data_transformation.remove_columns.Common/pipeline_run_extract_structural_types.yml.gz b/common-primitives/pipeline_runs/data_transformation.remove_columns.Common/pipeline_run_extract_structural_types.yml.gz deleted file mode 120000 index 91f49b3..0000000 --- a/common-primitives/pipeline_runs/data_transformation.remove_columns.Common/pipeline_run_extract_structural_types.yml.gz +++ /dev/null @@ -1 +0,0 @@ -../data_transformation.extract_columns_by_structural_types.Common/pipeline_run.yml.gz \ No newline at end of file diff --git a/common-primitives/pipeline_runs/regression.xgboost_gbtree.DataFrameCommon/1.yml b/common-primitives/pipeline_runs/regression.xgboost_gbtree.DataFrameCommon/1.yml deleted file mode 100644 index 6484df1..0000000 --- a/common-primitives/pipeline_runs/regression.xgboost_gbtree.DataFrameCommon/1.yml +++ /dev/null @@ -1,4729 +0,0 @@ -context: TESTING -datasets: -- digest: 36ab84076f277efd634abdabc2d094b09e079aa2c99253d433647c062972171b - id: 26_radon_seed_dataset_TRAIN -end: '2019-06-19T09:54:49.049658Z' -environment: - engine_version: 2019.6.7 - id: 6fdad0c4-dcb1-541d-a2a7-d2d9590c26dd - reference_engine_version: 2019.6.7 - resources: - cpu: - constraints: - cpu_shares: 1024 - devices: - - name: Intel Core Processor (Broadwell) - - name: Intel Core Processor (Broadwell) - - name: Intel Core Processor (Broadwell) - - name: Intel Core Processor (Broadwell) - - name: Intel Core Processor (Broadwell) - - name: Intel Core Processor (Broadwell) - - name: Intel Core Processor (Broadwell) - - name: Intel Core Processor (Broadwell) - logical_present: 8 - physical_present: 8 - memory: - total_memory: 25281880064 - worker_id: f5ffb488-a883-509d-ad4f-5e819913cb14 -id: 67cc5311-80d6-5f1f-96f6-7f7b9d1308a0 -pipeline: - digest: 79e82bee63a0db3dd75d41de0caa2054b860a8211fa9895a061fdb8f9f1c444d - id: 0f636602-6299-411b-9873-4b974cd393ba -problem: - digest: 01ab113ff802b57fe872f7b4e4422789921d033c41bc8ad0bd6e0d041291ed6f - id: 26_radon_seed_problem -random_seed: 0 -run: - phase: FIT - results: - predictions: - header: - - d3mIndex - - log_radon - values: - - - 0 - - 1 - - 2 - - 3 - - 4 - - 6 - - 7 - - 8 - - 9 - - 10 - - 11 - - 12 - - 13 - - 14 - - 15 - - 16 - - 17 - - 18 - - 19 - - 20 - - 21 - - 22 - - 24 - - 25 - - 26 - - 27 - - 28 - - 29 - - 32 - - 34 - - 35 - - 36 - - 37 - - 38 - - 40 - - 41 - - 42 - - 43 - - 45 - - 46 - - 47 - - 48 - - 50 - - 51 - - 52 - - 53 - - 55 - - 56 - - 57 - - 58 - - 59 - - 61 - - 62 - - 64 - - 68 - - 69 - - 71 - - 73 - - 74 - - 75 - - 79 - - 80 - - 81 - - 83 - - 84 - - 85 - - 87 - - 88 - - 89 - - 90 - - 91 - - 92 - - 93 - - 94 - - 95 - - 98 - - 99 - - 100 - - 101 - - 102 - - 103 - - 104 - - 105 - - 106 - - 108 - - 111 - - 112 - - 113 - - 114 - - 115 - - 116 - - 117 - - 118 - - 119 - - 121 - - 122 - - 123 - - 124 - - 125 - - 126 - - 127 - - 128 - - 129 - - 130 - - 131 - - 132 - - 133 - - 134 - - 135 - - 138 - - 140 - - 142 - - 143 - - 144 - - 145 - - 146 - - 147 - - 148 - - 149 - - 150 - - 151 - - 152 - - 153 - - 154 - - 155 - - 156 - - 157 - - 158 - - 159 - - 160 - - 161 - - 162 - - 163 - - 164 - - 166 - - 167 - - 169 - - 170 - - 171 - - 172 - - 173 - - 175 - - 176 - - 177 - - 178 - - 179 - - 180 - - 181 - - 182 - - 183 - - 184 - - 185 - - 186 - - 187 - - 188 - - 189 - - 190 - - 191 - - 193 - - 194 - - 195 - - 196 - - 197 - - 199 - - 200 - - 201 - - 202 - - 203 - - 204 - - 205 - - 206 - - 207 - - 211 - - 212 - - 214 - - 216 - - 217 - - 219 - - 220 - - 221 - - 222 - - 223 - - 224 - - 225 - - 226 - - 228 - - 229 - - 230 - - 232 - - 233 - - 234 - - 236 - - 237 - - 238 - - 240 - - 241 - - 242 - - 243 - - 245 - - 246 - - 247 - - 248 - - 249 - - 251 - - 252 - - 253 - - 255 - - 256 - - 257 - - 258 - - 260 - - 262 - - 263 - - 264 - - 267 - - 268 - - 269 - - 270 - - 271 - - 272 - - 273 - - 274 - - 276 - - 277 - - 278 - - 279 - - 280 - - 282 - - 283 - - 284 - - 285 - - 287 - - 288 - - 289 - - 291 - - 292 - - 293 - - 294 - - 295 - - 297 - - 298 - - 301 - - 302 - - 303 - - 304 - - 307 - - 308 - - 310 - - 312 - - 313 - - 315 - - 316 - - 317 - - 318 - - 319 - - 320 - - 322 - - 324 - - 325 - - 327 - - 328 - - 329 - - 330 - - 335 - - 336 - - 337 - - 338 - - 339 - - 340 - - 341 - - 342 - - 343 - - 345 - - 346 - - 347 - - 348 - - 349 - - 352 - - 353 - - 354 - - 355 - - 356 - - 358 - - 361 - - 362 - - 363 - - 364 - - 366 - - 367 - - 368 - - 369 - - 370 - - 371 - - 372 - - 373 - - 374 - - 375 - - 376 - - 378 - - 379 - - 382 - - 383 - - 384 - - 385 - - 386 - - 387 - - 388 - - 389 - - 390 - - 391 - - 392 - - 393 - - 395 - - 396 - - 397 - - 398 - - 399 - - 400 - - 401 - - 402 - - 403 - - 404 - - 405 - - 406 - - 407 - - 409 - - 410 - - 411 - - 412 - - 413 - - 414 - - 415 - - 418 - - 419 - - 420 - - 421 - - 422 - - 423 - - 424 - - 425 - - 426 - - 427 - - 428 - - 431 - - 432 - - 434 - - 435 - - 437 - - 438 - - 441 - - 442 - - 443 - - 444 - - 445 - - 446 - - 447 - - 449 - - 450 - - 451 - - 452 - - 453 - - 454 - - 455 - - 456 - - 457 - - 458 - - 459 - - 460 - - 461 - - 462 - - 463 - - 465 - - 466 - - 467 - - 469 - - 470 - - 471 - - 472 - - 473 - - 474 - - 475 - - 476 - - 477 - - 478 - - 479 - - 480 - - 481 - - 482 - - 484 - - 485 - - 487 - - 488 - - 489 - - 491 - - 492 - - 494 - - 495 - - 496 - - 498 - - 499 - - 500 - - 502 - - 503 - - 504 - - 505 - - 506 - - 507 - - 508 - - 509 - - 510 - - 511 - - 512 - - 513 - - 515 - - 516 - - 517 - - 518 - - 519 - - 520 - - 521 - - 522 - - 524 - - 525 - - 528 - - 529 - - 531 - - 532 - - 534 - - 535 - - 537 - - 538 - - 540 - - 541 - - 543 - - 544 - - 546 - - 547 - - 548 - - 550 - - 553 - - 555 - - 556 - - 557 - - 559 - - 560 - - 561 - - 562 - - 563 - - 564 - - 565 - - 566 - - 567 - - 568 - - 569 - - 570 - - 571 - - 572 - - 573 - - 574 - - 575 - - 576 - - 577 - - 578 - - 579 - - 580 - - 581 - - 582 - - 583 - - 586 - - 587 - - 588 - - 589 - - 590 - - 592 - - 593 - - 596 - - 600 - - 602 - - 603 - - 604 - - 605 - - 606 - - 607 - - 608 - - 609 - - 610 - - 612 - - 613 - - 614 - - 615 - - 618 - - 619 - - 620 - - 621 - - 622 - - 623 - - 624 - - 625 - - 626 - - 627 - - 628 - - 629 - - 630 - - 631 - - 632 - - 634 - - 636 - - 637 - - 639 - - 640 - - 641 - - 642 - - 643 - - 644 - - 645 - - 646 - - 647 - - 648 - - 649 - - 650 - - 651 - - 652 - - 653 - - 654 - - 657 - - 658 - - 659 - - 660 - - 661 - - 662 - - 663 - - 664 - - 665 - - 667 - - 670 - - 671 - - 672 - - 673 - - 674 - - 675 - - 676 - - 677 - - 678 - - 679 - - 681 - - 682 - - 683 - - 684 - - 686 - - 689 - - 690 - - 691 - - 692 - - 693 - - 694 - - 695 - - 696 - - 697 - - 698 - - 699 - - 700 - - 701 - - 702 - - 705 - - 707 - - 709 - - 711 - - 712 - - 713 - - 716 - - 717 - - 718 - - 719 - - 720 - - 721 - - 722 - - 724 - - 725 - - 726 - - 727 - - 728 - - 729 - - 730 - - 731 - - 732 - - 733 - - 735 - - 736 - - 737 - - 738 - - 740 - - 741 - - 742 - - 743 - - 744 - - 746 - - 747 - - 748 - - 749 - - 750 - - 751 - - 752 - - 754 - - 756 - - 757 - - 758 - - 761 - - 762 - - 763 - - 764 - - 765 - - 766 - - 768 - - 769 - - 770 - - 771 - - 772 - - 773 - - 774 - - 775 - - 776 - - 777 - - 778 - - 779 - - 780 - - 781 - - 782 - - 783 - - 784 - - 785 - - 786 - - 787 - - 789 - - 791 - - 792 - - 796 - - 797 - - 799 - - 800 - - 801 - - 802 - - 803 - - 804 - - 805 - - 806 - - 807 - - 810 - - 811 - - 812 - - 813 - - 815 - - 816 - - 818 - - 819 - - 820 - - 821 - - 822 - - 824 - - 825 - - 826 - - 828 - - 829 - - 830 - - 831 - - 832 - - 833 - - 835 - - 836 - - 837 - - 838 - - 839 - - 840 - - 841 - - 842 - - 843 - - 844 - - 845 - - 846 - - 847 - - 849 - - 850 - - 851 - - 852 - - 853 - - 854 - - 855 - - 856 - - 857 - - 858 - - 859 - - 860 - - 861 - - 863 - - 865 - - 866 - - 867 - - 868 - - 870 - - 871 - - 873 - - 874 - - 875 - - 876 - - 877 - - 878 - - 881 - - 883 - - 884 - - 885 - - 886 - - 887 - - 888 - - 889 - - 890 - - 891 - - 892 - - 894 - - 895 - - 896 - - 898 - - 899 - - 900 - - 901 - - 902 - - 903 - - 905 - - 906 - - 907 - - 909 - - 910 - - 911 - - 912 - - 915 - - 916 - - 917 - - 918 - - - 0.8330342769622803 - - 0.8330861330032349 - - 1.098989725112915 - - 0.0957428514957428 - - 1.1612895727157593 - - 0.4703049063682556 - - 0.09569096565246582 - - -0.22225826978683472 - - 0.26259732246398926 - - 0.2628304660320282 - - 0.3366249203681946 - - 0.40562954545021057 - - -0.6918290853500366 - - 0.18271362781524658 - - 1.522369146347046 - - 0.3367617130279541 - - 0.7891305685043335 - - 1.7914975881576538 - - 1.2189394235610962 - - 0.6415701508522034 - - 1.7050760984420776 - - 1.853300929069519 - - 1.8837662935256958 - - 1.161426305770874 - - 1.931780457496643 - - 1.974898338317871 - - 2.062042713165283 - - 1.665402889251709 - - 1.0651249885559082 - - 0.528777003288269 - - 1.4584600925445557 - - 1.704390048980713 - - 1.4109606742858887 - - 0.8739264011383057 - - 0.4055776596069336 - - 1.2182621955871582 - - 1.0990341901779175 - - 0.6417514085769653 - - 0.917670488357544 - - 0.1827174723148346 - - 0.8330342769622803 - - -0.3561854958534241 - - 1.098937749862671 - - 0.8333100080490112 - - 0.5901779532432556 - - 0.4056333899497986 - - 0.641793966293335 - - 0.2626492381095886 - - 1.4767398834228516 - - 1.523003339767456 - - 1.8526148796081543 - - 1.7553712129592896 - - 0.8330342769622803 - - 1.54994797706604 - - 1.098989725112915 - - 1.098989725112915 - - 1.6289032697677612 - - 1.6289032697677612 - - 2.5725338459014893 - - 1.979255199432373 - - 2.264638900756836 - - 1.8087071180343628 - - 1.360558271408081 - - 0.6414333581924438 - - 1.9379489421844482 - - 1.5710643529891968 - - 0.955110490322113 - - 1.9247326850891113 - - 1.4115099906921387 - - 2.3232131004333496 - - 0.8329493999481201 - - 0.641518235206604 - - 1.2587052583694458 - - 1.7441229820251465 - - 1.4767398834228516 - - 1.4584600925445557 - - -0.10479313135147095 - - 0.7409548759460449 - - 0.5288288593292236 - - 2.570702075958252 - - 2.694930076599121 - - 1.570378303527832 - - 2.273859977722168 - - -2.29718017578125 - - 2.0223045349121094 - - 1.4111419916152954 - - 2.062042713165283 - - 0.40562954545021057 - - 2.3121652603149414 - - 2.246690034866333 - - -0.1049298644065857 - - 1.5065677165985107 - - 1.6289032697677612 - - 0.7889493107795715 - - 2.104417562484741 - - 0.0004925131797790527 - - 2.5689167976379395 - - 0.9938793778419495 - - 1.2792216539382935 - - 3.282301425933838 - - 0.4691665768623352 - - 2.5706443786621094 - - 2.1773343086242676 - - 2.984010934829712 - - 0.955195426940918 - - 2.2046756744384766 - - 2.579826831817627 - - 1.3101081848144531 - - 1.9379489421844482 - - 0.00035572052001953125 - - 1.0279760360717773 - - 1.931780457496643 - - 2.3884470462799072 - - -2.3005144596099854 - - 0.955195426940918 - - 0.641518235206604 - - 0.5288288593292236 - - 0.09569096565246582 - - 0.0006737411022186279 - - 1.0988528728485107 - - 1.5072537660598755 - - 0.4703049063682556 - - 1.4350025653839111 - - 0.9553766250610352 - - 1.92511785030365 - - 1.4772891998291016 - - 1.7177082300186157 - - 1.3099268674850464 - - 1.0651249885559082 - - 2.691326856613159 - - 1.9249485731124878 - - 2.1015706062316895 - - 0.9936981201171875 - - 1.0651249885559082 - - 0.5901779532432556 - - 0.7409104108810425 - - 0.4704861640930176 - - 2.274041175842285 - - 2.1045303344726562 - - 1.279273509979248 - - -0.10479313135147095 - - 1.195493221282959 - - 2.387086868286133 - - 2.1045584678649902 - - 1.8532490730285645 - - 1.5822457075119019 - - 1.8085259199142456 - - 0.18271362781524658 - - 2.1773622035980225 - - 2.1765642166137695 - - 1.931780457496643 - - 0.8739264011383057 - - 0.5288288593292236 - - 1.0651249885559082 - - 1.8844523429870605 - - 0.5901779532432556 - - 1.54994797706604 - - 1.2188875675201416 - - 3.0550806522369385 - - 2.2266292572021484 - - 0.0004925131797790527 - - 1.617445707321167 - - 1.6289032697677612 - - 2.0370430946350098 - - 1.70457124710083 - - 1.310056209564209 - - 1.6176269054412842 - - 1.5705543756484985 - - 0.4056333899497986 - - 1.2587052583694458 - - 1.4584600925445557 - - 0.9551546573638916 - - 1.5830662250518799 - - 0.40549275279045105 - - 2.1764791011810303 - - 1.5065158605575562 - - 1.522369146347046 - - -0.5095808506011963 - - 1.7772479057312012 - - 1.704390048980713 - - 1.9800255298614502 - - 1.7555005550384521 - - 2.0229907035827637 - - 1.581559658050537 - - 1.931780457496643 - - 1.334634780883789 - - 1.7183942794799805 - - 2.06461238861084 - - 1.0280206203460693 - - 1.2587052583694458 - - 1.4584600925445557 - - 0.3367617130279541 - - 1.6652216911315918 - - -1.6082706451416016 - - 1.195493221282959 - - 1.195493221282959 - - 2.271498203277588 - - 1.4591904878616333 - - 1.8526148796081543 - - 3.4846649169921875 - - 2.5851869583129883 - - 0.8330861330032349 - - 1.7441229820251465 - - 1.9379489421844482 - - 2.0372588634490967 - - 2.2945101261138916 - - 3.7718851566314697 - - 1.6176269054412842 - - 1.6173088550567627 - - 1.279273509979248 - - 1.7448090314865112 - - 1.3863818645477295 - - 1.9258038997650146 - - 2.0701236724853516 - - 0.5286920666694641 - - 1.4115099906921387 - - 0.6415701508522034 - - 0.9554710984230042 - - 2.445803642272949 - - 0.9939219951629639 - - 1.3862005472183228 - - 2.0227746963500977 - - 0.00035956501960754395 - - -0.6920214891433716 - - 0.9554710984230042 - - 1.8080159425735474 - - 0.7409104108810425 - - 1.1315056085586548 - - 1.0992136001586914 - - 1.717889428138733 - - 1.4355566501617432 - - 2.696624517440796 - - 1.981532096862793 - - 0.8741501569747925 - - 1.506748914718628 - - 0.4704861640930176 - - 2.1661999225616455 - - 1.7444359064102173 - - 2.169395923614502 - - 0.6415701508522034 - - 0.694489061832428 - - -0.10479313135147095 - - 0.7889493107795715 - - 1.065306305885315 - - 1.3862005472183228 - - 1.0653488636016846 - - 1.434821367263794 - - 1.4774258136749268 - - 1.7192147970199585 - - 1.2198303937911987 - - 0.955291748046875 - - 1.0279760360717773 - - 2.140174388885498 - - 1.2198303937911987 - - 1.1955803632736206 - - 2.1685056686401367 - - 1.7568259239196777 - - 1.0280206203460693 - - 1.5715584754943848 - - 2.6298575401306152 - - 2.039620876312256 - - 1.7568777799606323 - - 1.54994797706604 - - 0.8332673907279968 - - 0.917670488357544 - - 1.4118516445159912 - - 1.54994797706604 - - 1.54994797706604 - - 2.38995361328125 - - 2.038114070892334 - - 1.1312817335128784 - - 0.470308780670166 - - 2.806213140487671 - - 1.161426305770874 - - 1.646281123161316 - - 1.6181317567825317 - - 1.8090436458587646 - - 1.3861486911773682 - - 1.744672179222107 - - -0.6920214891433716 - - 0.9936981201171875 - - 1.3099268674850464 - - 3.1659865379333496 - - 1.1315056085586548 - - 1.570378303527832 - - 1.1312817335128784 - - 1.4586412906646729 - - 1.1314630508422852 - - 1.4779438972473145 - - 1.0992136001586914 - - 1.2586534023284912 - - 2.1386680603027344 - - 2.2045223712921143 - - 1.5823750495910645 - - 1.3099268674850464 - - 0.8330342769622803 - - 1.064988374710083 - - -0.10484498739242554 - - 1.5506340265274048 - - 1.3348159790039062 - - 0.8330342769622803 - - 0.6942652463912964 - - 0.9938793778419495 - - 0.6414333581924438 - - 0.9176186323165894 - - 1.4774258136749268 - - 0.9935613870620728 - - 0.18271362781524658 - - 1.2189394235610962 - - 0.9552472829818726 - - 2.246690034866333 - - 0.3367617130279541 - - 1.6289032697677612 - - 1.098937749862671 - - 2.579826831817627 - - 2.7552952766418457 - - 0.6415701508522034 - - 1.3603770732879639 - - 2.0658645629882812 - - 0.9938793778419495 - - 2.444115400314331 - - 1.434821367263794 - - 2.5077669620513916 - - 1.925299048423767 - - 1.9379489421844482 - - 0.0004925131797790527 - - 0.5901779532432556 - - 0.40549275279045105 - - 0.740773618221283 - - 0.0957428514957428 - - 0.0957428514957428 - - 1.0650732517242432 - - 2.777968168258667 - - 0.3366249203681946 - - 0.3366249203681946 - - 0.5288288593292236 - - 0.0004925131797790527 - - 1.0651249885559082 - - -0.510017991065979 - - 0.4703049063682556 - - 1.9742122888565063 - - -0.5102857351303101 - - 2.3232131004333496 - - 1.0991190671920776 - - 2.5077669620513916 - - 1.5223172903060913 - - 1.386063814163208 - - 2.858018159866333 - - 2.3723583221435547 - - 1.8837662935256958 - - 1.9379489421844482 - - 1.6466542482376099 - - 2.5028574466705322 - - 1.6459681987762451 - - 2.2046756744384766 - - 1.780085802078247 - - 1.3867497444152832 - - 0.4703049063682556 - - 3.172811508178711 - - 0.0004925131797790527 - - 0.40562954545021057 - - 0.18271362781524658 - - 1.0651249885559082 - - 3.863699436187744 - - 0.0004925131797790527 - - 2.125455856323242 - - 1.4353705644607544 - - -0.510478138923645 - - 1.92511785030365 - - 2.0260462760925293 - - 2.229368209838867 - - 0.4703049063682556 - - 2.335315465927124 - - 1.386063814163208 - - 2.308807849884033 - - 0.8738744854927063 - - 1.5054293870925903 - - 1.0651249885559082 - - 0.18271362781524658 - - 0.2626492381095886 - - 0.528777003288269 - - 3.2366585731506348 - - -2.297093629837036 - - 2.3712871074676514 - - 0.8741075992584229 - - 1.3862005472183228 - - 1.9791702032089233 - - 0.7889493107795715 - - -0.5097177028656006 - - 1.7546851634979248 - - 0.7889493107795715 - - 1.5065677165985107 - - 0.917670488357544 - - 1.1314630508422852 - - 1.1312817335128784 - - 1.3863818645477295 - - 2.3886282444000244 - - 1.8766850233078003 - - 1.1312817335128784 - - 1.5230551958084106 - - 0.7889937162399292 - - 0.3367617130279541 - - 2.228400230407715 - - 0.18271362781524658 - - 2.3723583221435547 - - 3.1804051399230957 - - 2.2275471687316895 - - 2.5028574466705322 - - 2.104417562484741 - - 2.38759183883667 - - 1.4591460227966309 - - 2.758937358856201 - - 1.7050760984420776 - - 2.279667854309082 - - 2.104417562484741 - - 0.5288288593292236 - - 0.5289582014083862 - - 1.877371072769165 - - 1.5065677165985107 - - 2.444467067718506 - - 2.3131513595581055 - - 2.1023406982421875 - - 0.8741075992584229 - - 1.434821367263794 - - 0.18271362781524658 - - 0.18271362781524658 - - 1.098989725112915 - - 2.0648281574249268 - - 1.3602402210235596 - - 1.098989725112915 - - 0.5902224183082581 - - 2.2458348274230957 - - -0.3560487627983093 - - 0.18257683515548706 - - 0.7889493107795715 - - 2.5685315132141113 - - 1.1953564882278442 - - 1.4584081172943115 - - 1.334634780883789 - - 1.4350025653839111 - - 0.6941283941268921 - - 0.2626492381095886 - - 0.2626492381095886 - - 2.2458348274230957 - - 0.5901779532432556 - - 2.5037755966186523 - - 1.4767398834228516 - - 1.9379489421844482 - - 0.40562954545021057 - - 0.9552472829818726 - - 2.273859977722168 - - 1.3603770732879639 - - 1.2587052583694458 - - 1.9332870244979858 - - 1.3106811046600342 - - 0.8329493999481201 - - 0.9938793778419495 - - 0.7891731262207031 - - 1.9698771238327026 - - 0.2628304660320282 - - 1.3603770732879639 - - 1.279273509979248 - - 1.4584600925445557 - - 0.528777003288269 - - 1.0652544498443604 - - 2.169395923614502 - - 1.8365846872329712 - - 1.6665914058685303 - - 1.279273509979248 - - 1.7187072038650513 - - 2.3217568397521973 - - 1.7199008464813232 - - 0.26260116696357727 - - 1.4108240604400635 - - 1.279273509979248 - - 1.0281999111175537 - - 0.09574669599533081 - - 1.3603770732879639 - - 2.2046756744384766 - - 2.023075580596924 - - 3.033841371536255 - - 1.8085259199142456 - - 0.788812518119812 - - 1.780085802078247 - - 2.2788124084472656 - - 1.8766850233078003 - - 1.7447571754455566 - - 2.9534451961517334 - - 0.9177149534225464 - - 1.1311450004577637 - - 2.103731632232666 - - 1.5702414512634277 - - 2.1388492584228516 - - 0.5288288593292236 - - 1.8078398704528809 - - 0.1826617419719696 - - 2.440305709838867 - - 1.4766032695770264 - - 1.3099268674850464 - - 2.335315465927124 - - 1.2587052583694458 - - 1.161426305770874 - - 1.3099268674850464 - - 1.0279760360717773 - - 1.4115099906921387 - - 0.5901779532432556 - - 2.9554922580718994 - - 2.2256431579589844 - - 2.444732189178467 - - 2.33109712600708 - - 0.7889493107795715 - - 0.2628304660320282 - - 1.195493221282959 - - 0.7409104108810425 - - 1.4774703979492188 - - 0.8329493999481201 - - 1.7050760984420776 - - 3.229017496109009 - - 1.6466542482376099 - - 0.8739264011383057 - - 1.195493221282959 - - 0.9552472829818726 - - 1.0651249885559082 - - 1.1616075038909912 - - 1.4109089374542236 - - 1.628766417503357 - - 0.4703049063682556 - - 1.5817408561706543 - - -0.1049298644065857 - - -0.5095808506011963 - - 0.9175337553024292 - - 0.8739264011383057 - - 1.54994797706604 - - 2.695953845977783 - - 0.4701681137084961 - - 1.3861486911773682 - - 0.6415701508522034 - - 0.5290101170539856 - - -0.5095808506011963 - - -0.6918290853500366 - - -0.5095808506011963 - - 2.1764445304870605 - - 0.5290101170539856 - - 0.40562954545021057 - - 2.3895459175109863 - - 0.4704861640930176 - - 0.18257683515548706 - - 0.0004925131797790527 - - 1.4591460227966309 - - 1.098989725112915 - - 0.6415701508522034 - - 0.6415701508522034 - - 0.9175337553024292 - - 0.5901779532432556 - - -0.10461187362670898 - - 2.4654746055603027 - - 0.641699492931366 - - 1.0641679763793945 - - 1.2791367769241333 - - 1.3099268674850464 - - 1.2791367769241333 - - 1.1312817335128784 - - 1.195493221282959 - - 0.5903592109680176 - - 1.2588346004486084 - - 3.4728047847747803 - - 0.788812518119812 - - -0.10479313135147095 - - 0.4703049063682556 - - 1.9800255298614502 - - 0.40581077337265015 - - 0.33694297075271606 - - 0.4703049063682556 - - 1.6289032697677612 - - 0.8737895488739014 - - 0.917670488357544 - - 1.704390048980713 - - 0.18271362781524658 - - 0.4055776596069336 - - 1.9800255298614502 - - 0.18271362781524658 - - 1.2189394235610962 - - 1.195493221282959 - - 0.4703049063682556 - - 1.3106127977371216 - - -0.10479313135147095 - - 0.4056739807128906 - - 1.0281054973602295 - - 1.2188875675201416 - - 0.0004406273365020752 - - 0.7408585548400879 - - 0.6942652463912964 - - 0.0004925131797790527 - - 1.7049392461776733 - - 0.4704861640930176 - - 0.6415701508522034 - - 0.0004925131797790527 - - 1.2189394235610962 - - 0.5901779532432556 - - 1.161426305770874 - - -0.22207701206207275 - - 1.4769212007522583 - - 0.6415701508522034 - - 0.8330861330032349 - - 0.917670488357544 - - 1.0281572341918945 - - 0.6415701508522034 - - -1.201601266860962 - - 0.8330861330032349 - - 1.5506340265274048 - - 0.7889493107795715 - - 0.7408585548400879 - - 1.8775522708892822 - - 1.1312817335128784 - - 0.7409104108810425 - - 0.0004925131797790527 - - 1.2189394235610962 - - 0.6415701508522034 - - 0.6417514085769653 - - 0.8332673907279968 - - 1.4767398834228516 - - 2.0265510082244873 - - 1.877371072769165 - - 2.125274658203125 - - 0.7889493107795715 - - 1.2189394235610962 - - 0.3367617130279541 - - 1.6289032697677612 - - 0.0957428514957428 - - 1.9740430116653442 - - 1.755319356918335 - - 2.3223230838775635 - - 0.9936981201171875 - - 0.4703049063682556 - - 1.6295374631881714 - - 2.023845911026001 - - 0.9936055541038513 - - 0.6941283941268921 - - 0.8330861330032349 - - 1.6289032697677612 - - 2.0176072120666504 - - 1.334634780883789 - - 1.098989725112915 - - 1.5072537660598755 - - 2.1386680603027344 - - 1.6461493968963623 - - 2.1685404777526855 - - 2.3732762336730957 - - 2.1013545989990234 - - 1.5230551958084106 - - 0.917670488357544 - - 0.4703049063682556 - - 1.931780457496643 - - 0.788812518119812 - - 1.8085259199142456 - - 1.098989725112915 - - 1.925129771232605 - - 1.4109606742858887 - - 1.7906303405761719 - - 2.2039055824279785 - - 0.18271362781524658 - - 1.161426305770874 - - 2.4505856037139893 - - 2.273859977722168 - - 1.0991709232330322 - - -0.22225826978683472 - - 1.5703264474868774 - - 1.5823750495910645 - - -0.6920214891433716 - - 2.241225481033325 - - 0.5901779532432556 - - 0.0006737411022186279 - - 2.3319525718688965 - - 2.0601859092712402 - - 0.8330342769622803 - - 1.8844523429870605 - - 2.5075857639312744 - - 1.5506340265274048 - - 1.8335193395614624 - - 1.0650732517242432 - - 0.6941283941268921 - - 0.2626492381095886 - - 0.917670488357544 - - 0.0957428514957428 - - 0.2628304660320282 - - 0.5288288593292236 - - -0.10479313135147095 - - 0.5901779532432556 - - 1.5703264474868774 - - 0.5901779532432556 - - 1.2189394235610962 - - -0.10479313135147095 - - 1.6957303285598755 - - 0.6941283941268921 - - 1.8844523429870605 - - 1.3612442016601562 - - 1.7901780605316162 - - 0.9552472829818726 - - 2.382542371749878 - - 0.788812518119812 - - 1.5710643529891968 - - 1.3344979286193848 - - 2.597963333129883 - - 1.0991709232330322 - - 1.4776071310043335 - - 0.470308780670166 - - 0.3367617130279541 - - 1.8847652673721313 - - 3.0270333290100098 - - 1.806637167930603 - - 2.631330728530884 - - 2.3328704833984375 - - 1.7555524110794067 - - 2.2414751052856445 - - 1.2587497234344482 - - 1.434821367263794 - - 1.9791702032089233 - - 1.5702414512634277 - - 0.6414333581924438 - - 1.5710643529891968 - - 2.3328704833984375 - - 2.445803642272949 - - 2.038114070892334 - - 2.4652934074401855 - - -0.5107458829879761 - - 1.696232557296753 - - 1.161426305770874 - - 0.788812518119812 - - 1.6459681987762451 - - 0.8330861330032349 - - 0.8738744854927063 - - 2.772785186767578 - - 1.5222322940826416 - - 1.6297705173492432 - - 1.334582805633545 - - 1.0988528728485107 -schema: https://metadata.datadrivendiscovery.org/schemas/v0/pipeline_run.json -start: '2019-06-19T09:54:48.783596Z' -status: - state: SUCCESS -steps: -- end: '2019-06-19T09:54:48.794840Z' - hyperparams: - dataframe_resource: - data: null - type: VALUE - method_calls: - - end: '2019-06-19T09:54:48.785537Z' - logging: [] - name: __init__ - start: '2019-06-19T09:54:48.785486Z' - status: - state: SUCCESS - - end: '2019-06-19T09:54:48.790607Z' - logging: [] - metadata: - produce: - - metadata: - dimension: - length: 736 - name: rows - semantic_types: - - https://metadata.datadrivendiscovery.org/types/TabularRow - schema: https://metadata.datadrivendiscovery.org/schemas/v0/container.json - semantic_types: - - https://metadata.datadrivendiscovery.org/types/Table - structural_type: d3m.container.pandas.DataFrame - selector: [] - - metadata: - dimension: - length: 30 - name: columns - semantic_types: - - https://metadata.datadrivendiscovery.org/types/TabularColumn - selector: - - __ALL_ELEMENTS__ - - metadata: - name: d3mIndex - semantic_types: - - http://schema.org/Integer - - https://metadata.datadrivendiscovery.org/types/PrimaryKey - structural_type: str - selector: - - __ALL_ELEMENTS__ - - 0 - - metadata: - name: idnum - semantic_types: - - http://schema.org/Float - - https://metadata.datadrivendiscovery.org/types/UniqueKey - structural_type: str - selector: - - __ALL_ELEMENTS__ - - 1 - - metadata: - name: state - semantic_types: - - https://metadata.datadrivendiscovery.org/types/CategoricalData - - https://metadata.datadrivendiscovery.org/types/Attribute - structural_type: str - selector: - - __ALL_ELEMENTS__ - - 2 - - metadata: - name: state2 - semantic_types: - - https://metadata.datadrivendiscovery.org/types/CategoricalData - - https://metadata.datadrivendiscovery.org/types/Attribute - structural_type: str - selector: - - __ALL_ELEMENTS__ - - 3 - - metadata: - name: stfips - semantic_types: - - http://schema.org/Float - - https://metadata.datadrivendiscovery.org/types/Attribute - structural_type: str - selector: - - __ALL_ELEMENTS__ - - 4 - - metadata: - name: zip - semantic_types: - - https://metadata.datadrivendiscovery.org/types/CategoricalData - - https://metadata.datadrivendiscovery.org/types/Attribute - structural_type: str - selector: - - __ALL_ELEMENTS__ - - 5 - - metadata: - name: region - semantic_types: - - https://metadata.datadrivendiscovery.org/types/CategoricalData - - https://metadata.datadrivendiscovery.org/types/Attribute - structural_type: str - selector: - - __ALL_ELEMENTS__ - - 6 - - metadata: - name: typebldg - semantic_types: - - https://metadata.datadrivendiscovery.org/types/CategoricalData - - https://metadata.datadrivendiscovery.org/types/Attribute - structural_type: str - selector: - - __ALL_ELEMENTS__ - - 7 - - metadata: - name: floor - semantic_types: - - https://metadata.datadrivendiscovery.org/types/CategoricalData - - https://metadata.datadrivendiscovery.org/types/Attribute - structural_type: str - selector: - - __ALL_ELEMENTS__ - - 8 - - metadata: - name: room - semantic_types: - - http://schema.org/Float - - https://metadata.datadrivendiscovery.org/types/Attribute - structural_type: str - selector: - - __ALL_ELEMENTS__ - - 9 - - metadata: - name: basement - semantic_types: - - http://schema.org/Boolean - - https://metadata.datadrivendiscovery.org/types/Attribute - structural_type: str - selector: - - __ALL_ELEMENTS__ - - 10 - - metadata: - name: windoor - semantic_types: - - https://metadata.datadrivendiscovery.org/types/CategoricalData - - https://metadata.datadrivendiscovery.org/types/Attribute - structural_type: str - selector: - - __ALL_ELEMENTS__ - - 11 - - metadata: - name: rep - semantic_types: - - https://metadata.datadrivendiscovery.org/types/CategoricalData - - https://metadata.datadrivendiscovery.org/types/Attribute - structural_type: str - selector: - - __ALL_ELEMENTS__ - - 12 - - metadata: - name: stratum - semantic_types: - - https://metadata.datadrivendiscovery.org/types/CategoricalData - - https://metadata.datadrivendiscovery.org/types/Attribute - structural_type: str - selector: - - __ALL_ELEMENTS__ - - 13 - - metadata: - name: wave - semantic_types: - - http://schema.org/Float - - https://metadata.datadrivendiscovery.org/types/Attribute - structural_type: str - selector: - - __ALL_ELEMENTS__ - - 14 - - metadata: - name: starttm - semantic_types: - - http://schema.org/Float - - https://metadata.datadrivendiscovery.org/types/Attribute - structural_type: str - selector: - - __ALL_ELEMENTS__ - - 15 - - metadata: - name: stoptm - semantic_types: - - http://schema.org/Float - - https://metadata.datadrivendiscovery.org/types/Attribute - structural_type: str - selector: - - __ALL_ELEMENTS__ - - 16 - - metadata: - name: startdt - semantic_types: - - http://schema.org/Float - - https://metadata.datadrivendiscovery.org/types/Attribute - structural_type: str - selector: - - __ALL_ELEMENTS__ - - 17 - - metadata: - name: stopdt - semantic_types: - - http://schema.org/Float - - https://metadata.datadrivendiscovery.org/types/Attribute - structural_type: str - selector: - - __ALL_ELEMENTS__ - - 18 - - metadata: - name: activity - semantic_types: - - http://schema.org/Float - - https://metadata.datadrivendiscovery.org/types/Attribute - structural_type: str - selector: - - __ALL_ELEMENTS__ - - 19 - - metadata: - name: pcterr - semantic_types: - - http://schema.org/Float - - https://metadata.datadrivendiscovery.org/types/Attribute - structural_type: str - selector: - - __ALL_ELEMENTS__ - - 20 - - metadata: - name: adjwt - semantic_types: - - http://schema.org/Float - - https://metadata.datadrivendiscovery.org/types/Attribute - structural_type: str - selector: - - __ALL_ELEMENTS__ - - 21 - - metadata: - name: dupflag - semantic_types: - - http://schema.org/Boolean - - https://metadata.datadrivendiscovery.org/types/Attribute - structural_type: str - selector: - - __ALL_ELEMENTS__ - - 22 - - metadata: - name: zipflag - semantic_types: - - http://schema.org/Boolean - - https://metadata.datadrivendiscovery.org/types/Attribute - structural_type: str - selector: - - __ALL_ELEMENTS__ - - 23 - - metadata: - name: cntyfips - semantic_types: - - http://schema.org/Float - - https://metadata.datadrivendiscovery.org/types/Attribute - structural_type: str - selector: - - __ALL_ELEMENTS__ - - 24 - - metadata: - name: county - semantic_types: - - https://metadata.datadrivendiscovery.org/types/CategoricalData - - https://metadata.datadrivendiscovery.org/types/Attribute - structural_type: str - selector: - - __ALL_ELEMENTS__ - - 25 - - metadata: - name: fips - semantic_types: - - https://metadata.datadrivendiscovery.org/types/CategoricalData - - https://metadata.datadrivendiscovery.org/types/Attribute - structural_type: str - selector: - - __ALL_ELEMENTS__ - - 26 - - metadata: - name: Uppm - semantic_types: - - http://schema.org/Float - - https://metadata.datadrivendiscovery.org/types/Attribute - structural_type: str - selector: - - __ALL_ELEMENTS__ - - 27 - - metadata: - name: county_code - semantic_types: - - https://metadata.datadrivendiscovery.org/types/CategoricalData - - https://metadata.datadrivendiscovery.org/types/Attribute - structural_type: str - selector: - - __ALL_ELEMENTS__ - - 28 - - metadata: - name: log_radon - semantic_types: - - http://schema.org/Float - - https://metadata.datadrivendiscovery.org/types/SuggestedTarget - - https://metadata.datadrivendiscovery.org/types/Target - - https://metadata.datadrivendiscovery.org/types/TrueTarget - structural_type: str - selector: - - __ALL_ELEMENTS__ - - 29 - name: fit_multi_produce - start: '2019-06-19T09:54:48.786064Z' - status: - state: SUCCESS - random_seed: 0 - start: '2019-06-19T09:54:48.783616Z' - status: - state: SUCCESS - type: PRIMITIVE -- end: '2019-06-19T09:54:48.892999Z' - hyperparams: - add_index_columns: - data: true - type: VALUE - exclude_columns: - data: [] - type: VALUE - parse_categorical_target_columns: - data: false - type: VALUE - replace_index_columns: - data: true - type: VALUE - return_result: - data: replace - type: VALUE - use_columns: - data: [] - type: VALUE - method_calls: - - end: '2019-06-19T09:54:48.797730Z' - logging: [] - name: __init__ - start: '2019-06-19T09:54:48.797694Z' - status: - state: SUCCESS - - end: '2019-06-19T09:54:48.888883Z' - logging: [] - metadata: - produce: - - metadata: - dimension: - length: 736 - name: rows - semantic_types: - - https://metadata.datadrivendiscovery.org/types/TabularRow - schema: https://metadata.datadrivendiscovery.org/schemas/v0/container.json - semantic_types: - - https://metadata.datadrivendiscovery.org/types/Table - structural_type: d3m.container.pandas.DataFrame - selector: [] - - metadata: - dimension: - length: 30 - name: columns - semantic_types: - - https://metadata.datadrivendiscovery.org/types/TabularColumn - selector: - - __ALL_ELEMENTS__ - - metadata: - name: d3mIndex - semantic_types: - - http://schema.org/Integer - - https://metadata.datadrivendiscovery.org/types/PrimaryKey - structural_type: int - selector: - - __ALL_ELEMENTS__ - - 0 - - metadata: - name: idnum - semantic_types: - - http://schema.org/Float - - https://metadata.datadrivendiscovery.org/types/UniqueKey - structural_type: float - selector: - - __ALL_ELEMENTS__ - - 1 - - metadata: - name: state - semantic_types: - - https://metadata.datadrivendiscovery.org/types/CategoricalData - - https://metadata.datadrivendiscovery.org/types/Attribute - structural_type: str - selector: - - __ALL_ELEMENTS__ - - 2 - - metadata: - name: state2 - semantic_types: - - https://metadata.datadrivendiscovery.org/types/CategoricalData - - https://metadata.datadrivendiscovery.org/types/Attribute - structural_type: str - selector: - - __ALL_ELEMENTS__ - - 3 - - metadata: - name: stfips - semantic_types: - - http://schema.org/Float - - https://metadata.datadrivendiscovery.org/types/Attribute - structural_type: float - selector: - - __ALL_ELEMENTS__ - - 4 - - metadata: - name: zip - semantic_types: - - https://metadata.datadrivendiscovery.org/types/CategoricalData - - https://metadata.datadrivendiscovery.org/types/Attribute - structural_type: str - selector: - - __ALL_ELEMENTS__ - - 5 - - metadata: - name: region - semantic_types: - - https://metadata.datadrivendiscovery.org/types/CategoricalData - - https://metadata.datadrivendiscovery.org/types/Attribute - structural_type: str - selector: - - __ALL_ELEMENTS__ - - 6 - - metadata: - name: typebldg - semantic_types: - - https://metadata.datadrivendiscovery.org/types/CategoricalData - - https://metadata.datadrivendiscovery.org/types/Attribute - structural_type: str - selector: - - __ALL_ELEMENTS__ - - 7 - - metadata: - name: floor - semantic_types: - - https://metadata.datadrivendiscovery.org/types/CategoricalData - - https://metadata.datadrivendiscovery.org/types/Attribute - structural_type: str - selector: - - __ALL_ELEMENTS__ - - 8 - - metadata: - name: room - semantic_types: - - http://schema.org/Float - - https://metadata.datadrivendiscovery.org/types/Attribute - structural_type: float - selector: - - __ALL_ELEMENTS__ - - 9 - - metadata: - name: basement - semantic_types: - - http://schema.org/Boolean - - https://metadata.datadrivendiscovery.org/types/Attribute - structural_type: int - selector: - - __ALL_ELEMENTS__ - - 10 - - metadata: - name: windoor - semantic_types: - - https://metadata.datadrivendiscovery.org/types/CategoricalData - - https://metadata.datadrivendiscovery.org/types/Attribute - structural_type: str - selector: - - __ALL_ELEMENTS__ - - 11 - - metadata: - name: rep - semantic_types: - - https://metadata.datadrivendiscovery.org/types/CategoricalData - - https://metadata.datadrivendiscovery.org/types/Attribute - structural_type: str - selector: - - __ALL_ELEMENTS__ - - 12 - - metadata: - name: stratum - semantic_types: - - https://metadata.datadrivendiscovery.org/types/CategoricalData - - https://metadata.datadrivendiscovery.org/types/Attribute - structural_type: str - selector: - - __ALL_ELEMENTS__ - - 13 - - metadata: - name: wave - semantic_types: - - http://schema.org/Float - - https://metadata.datadrivendiscovery.org/types/Attribute - structural_type: float - selector: - - __ALL_ELEMENTS__ - - 14 - - metadata: - name: starttm - semantic_types: - - http://schema.org/Float - - https://metadata.datadrivendiscovery.org/types/Attribute - structural_type: float - selector: - - __ALL_ELEMENTS__ - - 15 - - metadata: - name: stoptm - semantic_types: - - http://schema.org/Float - - https://metadata.datadrivendiscovery.org/types/Attribute - structural_type: float - selector: - - __ALL_ELEMENTS__ - - 16 - - metadata: - name: startdt - semantic_types: - - http://schema.org/Float - - https://metadata.datadrivendiscovery.org/types/Attribute - structural_type: float - selector: - - __ALL_ELEMENTS__ - - 17 - - metadata: - name: stopdt - semantic_types: - - http://schema.org/Float - - https://metadata.datadrivendiscovery.org/types/Attribute - structural_type: float - selector: - - __ALL_ELEMENTS__ - - 18 - - metadata: - name: activity - semantic_types: - - http://schema.org/Float - - https://metadata.datadrivendiscovery.org/types/Attribute - structural_type: float - selector: - - __ALL_ELEMENTS__ - - 19 - - metadata: - name: pcterr - semantic_types: - - http://schema.org/Float - - https://metadata.datadrivendiscovery.org/types/Attribute - structural_type: float - selector: - - __ALL_ELEMENTS__ - - 20 - - metadata: - name: adjwt - semantic_types: - - http://schema.org/Float - - https://metadata.datadrivendiscovery.org/types/Attribute - structural_type: float - selector: - - __ALL_ELEMENTS__ - - 21 - - metadata: - name: dupflag - semantic_types: - - http://schema.org/Boolean - - https://metadata.datadrivendiscovery.org/types/Attribute - structural_type: int - selector: - - __ALL_ELEMENTS__ - - 22 - - metadata: - name: zipflag - semantic_types: - - http://schema.org/Boolean - - https://metadata.datadrivendiscovery.org/types/Attribute - structural_type: int - selector: - - __ALL_ELEMENTS__ - - 23 - - metadata: - name: cntyfips - semantic_types: - - http://schema.org/Float - - https://metadata.datadrivendiscovery.org/types/Attribute - structural_type: float - selector: - - __ALL_ELEMENTS__ - - 24 - - metadata: - name: county - semantic_types: - - https://metadata.datadrivendiscovery.org/types/CategoricalData - - https://metadata.datadrivendiscovery.org/types/Attribute - structural_type: str - selector: - - __ALL_ELEMENTS__ - - 25 - - metadata: - name: fips - semantic_types: - - https://metadata.datadrivendiscovery.org/types/CategoricalData - - https://metadata.datadrivendiscovery.org/types/Attribute - structural_type: str - selector: - - __ALL_ELEMENTS__ - - 26 - - metadata: - name: Uppm - semantic_types: - - http://schema.org/Float - - https://metadata.datadrivendiscovery.org/types/Attribute - structural_type: float - selector: - - __ALL_ELEMENTS__ - - 27 - - metadata: - name: county_code - semantic_types: - - https://metadata.datadrivendiscovery.org/types/CategoricalData - - https://metadata.datadrivendiscovery.org/types/Attribute - structural_type: str - selector: - - __ALL_ELEMENTS__ - - 28 - - metadata: - name: log_radon - semantic_types: - - http://schema.org/Float - - https://metadata.datadrivendiscovery.org/types/SuggestedTarget - - https://metadata.datadrivendiscovery.org/types/Target - - https://metadata.datadrivendiscovery.org/types/TrueTarget - structural_type: float - selector: - - __ALL_ELEMENTS__ - - 29 - name: fit_multi_produce - start: '2019-06-19T09:54:48.798357Z' - status: - state: SUCCESS - random_seed: 1 - start: '2019-06-19T09:54:48.794860Z' - status: - state: SUCCESS - type: PRIMITIVE -- end: '2019-06-19T09:54:48.901575Z' - hyperparams: - add_index_columns: - data: false - type: VALUE - exclude_columns: - data: [] - type: VALUE - match_logic: - data: any - type: VALUE - negate: - data: false - type: VALUE - use_columns: - data: [] - type: VALUE - method_calls: - - end: '2019-06-19T09:54:48.895747Z' - logging: [] - name: __init__ - start: '2019-06-19T09:54:48.895710Z' - status: - state: SUCCESS - - end: '2019-06-19T09:54:48.899643Z' - logging: [] - metadata: - produce: - - metadata: - dimension: - length: 736 - name: rows - semantic_types: - - https://metadata.datadrivendiscovery.org/types/TabularRow - schema: https://metadata.datadrivendiscovery.org/schemas/v0/container.json - semantic_types: - - https://metadata.datadrivendiscovery.org/types/Table - structural_type: d3m.container.pandas.DataFrame - selector: [] - - metadata: - dimension: - length: 12 - name: columns - semantic_types: - - https://metadata.datadrivendiscovery.org/types/TabularColumn - selector: - - __ALL_ELEMENTS__ - - metadata: - name: state - semantic_types: - - https://metadata.datadrivendiscovery.org/types/CategoricalData - - https://metadata.datadrivendiscovery.org/types/Attribute - structural_type: str - selector: - - __ALL_ELEMENTS__ - - 0 - - metadata: - name: state2 - semantic_types: - - https://metadata.datadrivendiscovery.org/types/CategoricalData - - https://metadata.datadrivendiscovery.org/types/Attribute - structural_type: str - selector: - - __ALL_ELEMENTS__ - - 1 - - metadata: - name: zip - semantic_types: - - https://metadata.datadrivendiscovery.org/types/CategoricalData - - https://metadata.datadrivendiscovery.org/types/Attribute - structural_type: str - selector: - - __ALL_ELEMENTS__ - - 2 - - metadata: - name: region - semantic_types: - - https://metadata.datadrivendiscovery.org/types/CategoricalData - - https://metadata.datadrivendiscovery.org/types/Attribute - structural_type: str - selector: - - __ALL_ELEMENTS__ - - 3 - - metadata: - name: typebldg - semantic_types: - - https://metadata.datadrivendiscovery.org/types/CategoricalData - - https://metadata.datadrivendiscovery.org/types/Attribute - structural_type: str - selector: - - __ALL_ELEMENTS__ - - 4 - - metadata: - name: floor - semantic_types: - - https://metadata.datadrivendiscovery.org/types/CategoricalData - - https://metadata.datadrivendiscovery.org/types/Attribute - structural_type: str - selector: - - __ALL_ELEMENTS__ - - 5 - - metadata: - name: windoor - semantic_types: - - https://metadata.datadrivendiscovery.org/types/CategoricalData - - https://metadata.datadrivendiscovery.org/types/Attribute - structural_type: str - selector: - - __ALL_ELEMENTS__ - - 6 - - metadata: - name: rep - semantic_types: - - https://metadata.datadrivendiscovery.org/types/CategoricalData - - https://metadata.datadrivendiscovery.org/types/Attribute - structural_type: str - selector: - - __ALL_ELEMENTS__ - - 7 - - metadata: - name: stratum - semantic_types: - - https://metadata.datadrivendiscovery.org/types/CategoricalData - - https://metadata.datadrivendiscovery.org/types/Attribute - structural_type: str - selector: - - __ALL_ELEMENTS__ - - 8 - - metadata: - name: county - semantic_types: - - https://metadata.datadrivendiscovery.org/types/CategoricalData - - https://metadata.datadrivendiscovery.org/types/Attribute - structural_type: str - selector: - - __ALL_ELEMENTS__ - - 9 - - metadata: - name: fips - semantic_types: - - https://metadata.datadrivendiscovery.org/types/CategoricalData - - https://metadata.datadrivendiscovery.org/types/Attribute - structural_type: str - selector: - - __ALL_ELEMENTS__ - - 10 - - metadata: - name: county_code - semantic_types: - - https://metadata.datadrivendiscovery.org/types/CategoricalData - - https://metadata.datadrivendiscovery.org/types/Attribute - structural_type: str - selector: - - __ALL_ELEMENTS__ - - 11 - name: fit_multi_produce - start: '2019-06-19T09:54:48.896349Z' - status: - state: SUCCESS - random_seed: 2 - start: '2019-06-19T09:54:48.893020Z' - status: - state: SUCCESS - type: PRIMITIVE -- end: '2019-06-19T09:54:48.909559Z' - hyperparams: - add_index_columns: - data: false - type: VALUE - match_logic: - data: any - type: VALUE - negate: - data: false - type: VALUE - use_columns: - data: [] - type: VALUE - method_calls: - - end: '2019-06-19T09:54:48.904245Z' - logging: [] - name: __init__ - start: '2019-06-19T09:54:48.904212Z' - status: - state: SUCCESS - - end: '2019-06-19T09:54:48.907449Z' - logging: [] - metadata: - produce: - - metadata: - dimension: - length: 736 - name: rows - semantic_types: - - https://metadata.datadrivendiscovery.org/types/TabularRow - schema: https://metadata.datadrivendiscovery.org/schemas/v0/container.json - semantic_types: - - https://metadata.datadrivendiscovery.org/types/Table - structural_type: d3m.container.pandas.DataFrame - selector: [] - - metadata: - dimension: - length: 14 - name: columns - semantic_types: - - https://metadata.datadrivendiscovery.org/types/TabularColumn - selector: - - __ALL_ELEMENTS__ - - metadata: - name: idnum - semantic_types: - - http://schema.org/Float - - https://metadata.datadrivendiscovery.org/types/UniqueKey - structural_type: float - selector: - - __ALL_ELEMENTS__ - - 0 - - metadata: - name: stfips - semantic_types: - - http://schema.org/Float - - https://metadata.datadrivendiscovery.org/types/Attribute - structural_type: float - selector: - - __ALL_ELEMENTS__ - - 1 - - metadata: - name: room - semantic_types: - - http://schema.org/Float - - https://metadata.datadrivendiscovery.org/types/Attribute - structural_type: float - selector: - - __ALL_ELEMENTS__ - - 2 - - metadata: - name: wave - semantic_types: - - http://schema.org/Float - - https://metadata.datadrivendiscovery.org/types/Attribute - structural_type: float - selector: - - __ALL_ELEMENTS__ - - 3 - - metadata: - name: starttm - semantic_types: - - http://schema.org/Float - - https://metadata.datadrivendiscovery.org/types/Attribute - structural_type: float - selector: - - __ALL_ELEMENTS__ - - 4 - - metadata: - name: stoptm - semantic_types: - - http://schema.org/Float - - https://metadata.datadrivendiscovery.org/types/Attribute - structural_type: float - selector: - - __ALL_ELEMENTS__ - - 5 - - metadata: - name: startdt - semantic_types: - - http://schema.org/Float - - https://metadata.datadrivendiscovery.org/types/Attribute - structural_type: float - selector: - - __ALL_ELEMENTS__ - - 6 - - metadata: - name: stopdt - semantic_types: - - http://schema.org/Float - - https://metadata.datadrivendiscovery.org/types/Attribute - structural_type: float - selector: - - __ALL_ELEMENTS__ - - 7 - - metadata: - name: activity - semantic_types: - - http://schema.org/Float - - https://metadata.datadrivendiscovery.org/types/Attribute - structural_type: float - selector: - - __ALL_ELEMENTS__ - - 8 - - metadata: - name: pcterr - semantic_types: - - http://schema.org/Float - - https://metadata.datadrivendiscovery.org/types/Attribute - structural_type: float - selector: - - __ALL_ELEMENTS__ - - 9 - - metadata: - name: adjwt - semantic_types: - - http://schema.org/Float - - https://metadata.datadrivendiscovery.org/types/Attribute - structural_type: float - selector: - - __ALL_ELEMENTS__ - - 10 - - metadata: - name: cntyfips - semantic_types: - - http://schema.org/Float - - https://metadata.datadrivendiscovery.org/types/Attribute - structural_type: float - selector: - - __ALL_ELEMENTS__ - - 11 - - metadata: - name: Uppm - semantic_types: - - http://schema.org/Float - - https://metadata.datadrivendiscovery.org/types/Attribute - structural_type: float - selector: - - __ALL_ELEMENTS__ - - 12 - - metadata: - name: log_radon - semantic_types: - - http://schema.org/Float - - https://metadata.datadrivendiscovery.org/types/SuggestedTarget - - https://metadata.datadrivendiscovery.org/types/Target - - https://metadata.datadrivendiscovery.org/types/TrueTarget - structural_type: float - selector: - - __ALL_ELEMENTS__ - - 13 - name: fit_multi_produce - start: '2019-06-19T09:54:48.904812Z' - status: - state: SUCCESS - random_seed: 3 - start: '2019-06-19T09:54:48.901595Z' - status: - state: SUCCESS - type: PRIMITIVE -- end: '2019-06-19T09:54:48.915759Z' - hyperparams: - add_index_columns: - data: false - type: VALUE - exclude_columns: - data: [] - type: VALUE - match_logic: - data: any - type: VALUE - negate: - data: false - type: VALUE - use_columns: - data: [] - type: VALUE - method_calls: - - end: '2019-06-19T09:54:48.912059Z' - logging: [] - name: __init__ - start: '2019-06-19T09:54:48.912027Z' - status: - state: SUCCESS - - end: '2019-06-19T09:54:48.915161Z' - logging: [] - metadata: - produce: - - metadata: - dimension: - length: 736 - name: rows - semantic_types: - - https://metadata.datadrivendiscovery.org/types/TabularRow - schema: https://metadata.datadrivendiscovery.org/schemas/v0/container.json - semantic_types: - - https://metadata.datadrivendiscovery.org/types/Table - structural_type: d3m.container.pandas.DataFrame - selector: [] - - metadata: - dimension: - length: 1 - name: columns - semantic_types: - - https://metadata.datadrivendiscovery.org/types/TabularColumn - selector: - - __ALL_ELEMENTS__ - - metadata: - name: log_radon - semantic_types: - - http://schema.org/Float - - https://metadata.datadrivendiscovery.org/types/SuggestedTarget - - https://metadata.datadrivendiscovery.org/types/Target - - https://metadata.datadrivendiscovery.org/types/TrueTarget - structural_type: str - selector: - - __ALL_ELEMENTS__ - - 0 - name: fit_multi_produce - start: '2019-06-19T09:54:48.912622Z' - status: - state: SUCCESS - random_seed: 4 - start: '2019-06-19T09:54:48.909578Z' - status: - state: SUCCESS - type: PRIMITIVE -- end: '2019-06-19T09:54:48.966235Z' - hyperparams: - add_index_columns: - data: false - type: VALUE - error_on_no_input: - data: true - type: VALUE - exclude_columns: - data: [] - type: VALUE - fill_value: - data: - case: none - value: null - type: VALUE - missing_values: - data: - case: float - value: - encoding: pickle - value: gANHf/gAAAAAAAAu - type: VALUE - return_semantic_type: - data: https://metadata.datadrivendiscovery.org/types/Attribute - type: VALUE - strategy: - data: mean - type: VALUE - use_columns: - data: [] - type: VALUE - method_calls: - - end: '2019-06-19T09:54:48.919732Z' - logging: [] - name: __init__ - start: '2019-06-19T09:54:48.919684Z' - status: - state: SUCCESS - - end: '2019-06-19T09:54:48.964102Z' - logging: [] - metadata: - produce: - - metadata: - dimension: - length: 736 - name: rows - semantic_types: - - https://metadata.datadrivendiscovery.org/types/TabularRow - schema: https://metadata.datadrivendiscovery.org/schemas/v0/container.json - semantic_types: - - https://metadata.datadrivendiscovery.org/types/Table - structural_type: d3m.container.pandas.DataFrame - selector: [] - - metadata: - dimension: - length: 14 - name: columns - semantic_types: - - https://metadata.datadrivendiscovery.org/types/TabularColumn - selector: - - __ALL_ELEMENTS__ - - metadata: - name: idnum - semantic_types: - - http://schema.org/Float - - https://metadata.datadrivendiscovery.org/types/UniqueKey - structural_type: float - selector: - - __ALL_ELEMENTS__ - - 0 - - metadata: - name: stfips - semantic_types: - - http://schema.org/Float - - https://metadata.datadrivendiscovery.org/types/Attribute - structural_type: float - selector: - - __ALL_ELEMENTS__ - - 1 - - metadata: - name: room - semantic_types: - - http://schema.org/Float - - https://metadata.datadrivendiscovery.org/types/Attribute - structural_type: float - selector: - - __ALL_ELEMENTS__ - - 2 - - metadata: - name: wave - semantic_types: - - http://schema.org/Float - - https://metadata.datadrivendiscovery.org/types/Attribute - structural_type: float - selector: - - __ALL_ELEMENTS__ - - 3 - - metadata: - name: starttm - semantic_types: - - http://schema.org/Float - - https://metadata.datadrivendiscovery.org/types/Attribute - structural_type: float - selector: - - __ALL_ELEMENTS__ - - 4 - - metadata: - name: stoptm - semantic_types: - - http://schema.org/Float - - https://metadata.datadrivendiscovery.org/types/Attribute - structural_type: float - selector: - - __ALL_ELEMENTS__ - - 5 - - metadata: - name: startdt - semantic_types: - - http://schema.org/Float - - https://metadata.datadrivendiscovery.org/types/Attribute - structural_type: float - selector: - - __ALL_ELEMENTS__ - - 6 - - metadata: - name: stopdt - semantic_types: - - http://schema.org/Float - - https://metadata.datadrivendiscovery.org/types/Attribute - structural_type: float - selector: - - __ALL_ELEMENTS__ - - 7 - - metadata: - name: activity - semantic_types: - - http://schema.org/Float - - https://metadata.datadrivendiscovery.org/types/Attribute - structural_type: float - selector: - - __ALL_ELEMENTS__ - - 8 - - metadata: - name: pcterr - semantic_types: - - http://schema.org/Float - - https://metadata.datadrivendiscovery.org/types/Attribute - structural_type: float - selector: - - __ALL_ELEMENTS__ - - 9 - - metadata: - name: adjwt - semantic_types: - - http://schema.org/Float - - https://metadata.datadrivendiscovery.org/types/Attribute - structural_type: float - selector: - - __ALL_ELEMENTS__ - - 10 - - metadata: - name: cntyfips - semantic_types: - - http://schema.org/Float - - https://metadata.datadrivendiscovery.org/types/Attribute - structural_type: float - selector: - - __ALL_ELEMENTS__ - - 11 - - metadata: - name: Uppm - semantic_types: - - http://schema.org/Float - - https://metadata.datadrivendiscovery.org/types/Attribute - structural_type: float - selector: - - __ALL_ELEMENTS__ - - 12 - - metadata: - name: log_radon - semantic_types: - - http://schema.org/Float - - https://metadata.datadrivendiscovery.org/types/SuggestedTarget - - https://metadata.datadrivendiscovery.org/types/Target - - https://metadata.datadrivendiscovery.org/types/TrueTarget - structural_type: float - selector: - - __ALL_ELEMENTS__ - - 13 - name: fit_multi_produce - start: '2019-06-19T09:54:48.920417Z' - status: - state: SUCCESS - random_seed: 5 - start: '2019-06-19T09:54:48.915777Z' - status: - state: SUCCESS - type: PRIMITIVE -- end: '2019-06-19T09:54:49.039949Z' - hyperparams: - add_index_columns: - data: true - type: VALUE - base_score: - data: 0.5 - type: VALUE - colsample_bylevel: - data: 1 - type: VALUE - colsample_bytree: - data: 1 - type: VALUE - exclude_inputs_columns: - data: [] - type: VALUE - exclude_outputs_columns: - data: [] - type: VALUE - gamma: - data: 0.0 - type: VALUE - importance_type: - data: gain - type: VALUE - learning_rate: - data: 0.1 - type: VALUE - max_delta_step: - data: - case: unlimited - value: 0 - type: VALUE - max_depth: - data: - case: limit - value: 3 - type: VALUE - min_child_weight: - data: 1 - type: VALUE - n_estimators: - data: 100 - type: VALUE - n_jobs: - data: - case: limit - value: 1 - type: VALUE - n_more_estimators: - data: 100 - type: VALUE - reg_alpha: - data: 0 - type: VALUE - reg_lambda: - data: 1 - type: VALUE - scale_pos_weight: - data: 1 - type: VALUE - subsample: - data: 1 - type: VALUE - use_inputs_columns: - data: [] - type: VALUE - use_outputs_columns: - data: [] - type: VALUE - method_calls: - - end: '2019-06-19T09:54:48.972495Z' - logging: [] - name: __init__ - start: '2019-06-19T09:54:48.972423Z' - status: - state: SUCCESS - - end: '2019-06-19T09:54:49.039069Z' - logging: [] - metadata: - produce: - - metadata: - dimension: - length: 736 - name: rows - semantic_types: - - https://metadata.datadrivendiscovery.org/types/TabularRow - schema: https://metadata.datadrivendiscovery.org/schemas/v0/container.json - semantic_types: - - https://metadata.datadrivendiscovery.org/types/Table - structural_type: d3m.container.pandas.DataFrame - selector: [] - - metadata: - dimension: - length: 3 - name: columns - semantic_types: - - https://metadata.datadrivendiscovery.org/types/TabularColumn - selector: - - __ALL_ELEMENTS__ - - metadata: - name: idnum - semantic_types: - - http://schema.org/Float - - https://metadata.datadrivendiscovery.org/types/UniqueKey - structural_type: float - selector: - - __ALL_ELEMENTS__ - - 0 - - metadata: - name: log_radon - semantic_types: - - http://schema.org/Float - - https://metadata.datadrivendiscovery.org/types/SuggestedTarget - - https://metadata.datadrivendiscovery.org/types/Target - - https://metadata.datadrivendiscovery.org/types/PredictedTarget - structural_type: str - selector: - - __ALL_ELEMENTS__ - - 1 - - metadata: - name: log_radon - semantic_types: - - http://schema.org/Float - - https://metadata.datadrivendiscovery.org/types/SuggestedTarget - - https://metadata.datadrivendiscovery.org/types/Target - - https://metadata.datadrivendiscovery.org/types/TrueTarget - structural_type: float - selector: - - __ALL_ELEMENTS__ - - 2 - name: fit_multi_produce - start: '2019-06-19T09:54:48.973599Z' - status: - state: SUCCESS - random_seed: 6 - start: '2019-06-19T09:54:48.966255Z' - status: - state: SUCCESS - type: PRIMITIVE -- end: '2019-06-19T09:54:49.049637Z' - hyperparams: - exclude_columns: - data: [] - type: VALUE - use_columns: - data: [] - type: VALUE - method_calls: - - end: '2019-06-19T09:54:49.041989Z' - logging: [] - name: __init__ - start: '2019-06-19T09:54:49.041954Z' - status: - state: SUCCESS - - end: '2019-06-19T09:54:49.048870Z' - logging: [] - metadata: - produce: - - metadata: - dimension: - length: 736 - name: rows - semantic_types: - - https://metadata.datadrivendiscovery.org/types/TabularRow - schema: https://metadata.datadrivendiscovery.org/schemas/v0/container.json - semantic_types: - - https://metadata.datadrivendiscovery.org/types/Table - structural_type: d3m.container.pandas.DataFrame - selector: [] - - metadata: - dimension: - length: 2 - name: columns - semantic_types: - - https://metadata.datadrivendiscovery.org/types/TabularColumn - selector: - - __ALL_ELEMENTS__ - - metadata: - name: d3mIndex - semantic_types: - - http://schema.org/Integer - - https://metadata.datadrivendiscovery.org/types/PrimaryKey - structural_type: int - selector: - - __ALL_ELEMENTS__ - - 0 - - metadata: - name: log_radon - semantic_types: - - http://schema.org/Float - - https://metadata.datadrivendiscovery.org/types/SuggestedTarget - - https://metadata.datadrivendiscovery.org/types/Target - - https://metadata.datadrivendiscovery.org/types/PredictedTarget - structural_type: str - selector: - - __ALL_ELEMENTS__ - - 1 - name: fit_multi_produce - start: '2019-06-19T09:54:49.042531Z' - status: - state: SUCCESS - random_seed: 7 - start: '2019-06-19T09:54:49.039969Z' - status: - state: SUCCESS - type: PRIMITIVE ---- -context: TESTING -datasets: -- digest: d2413f26c84df994b808f397c3aa9908e54169d43ca3034ea1a2c3f9c9a6ec27 - id: 26_radon_seed_dataset_TEST -end: '2019-06-19T09:54:49.273243Z' -environment: - engine_version: 2019.6.7 - id: 6fdad0c4-dcb1-541d-a2a7-d2d9590c26dd - reference_engine_version: 2019.6.7 - resources: - cpu: - constraints: - cpu_shares: 1024 - devices: - - name: Intel Core Processor (Broadwell) - - name: Intel Core Processor (Broadwell) - - name: Intel Core Processor (Broadwell) - - name: Intel Core Processor (Broadwell) - - name: Intel Core Processor (Broadwell) - - name: Intel Core Processor (Broadwell) - - name: Intel Core Processor (Broadwell) - - name: Intel Core Processor (Broadwell) - logical_present: 8 - physical_present: 8 - memory: - total_memory: 25281880064 - worker_id: f5ffb488-a883-509d-ad4f-5e819913cb14 -id: 93ad8197-9c10-5666-b50c-dcc06d46f9d2 -pipeline: - digest: 79e82bee63a0db3dd75d41de0caa2054b860a8211fa9895a061fdb8f9f1c444d - id: 0f636602-6299-411b-9873-4b974cd393ba -previous_pipeline_run: - id: 67cc5311-80d6-5f1f-96f6-7f7b9d1308a0 -problem: - digest: 01ab113ff802b57fe872f7b4e4422789921d033c41bc8ad0bd6e0d041291ed6f - id: 26_radon_seed_problem -random_seed: 0 -run: - phase: PRODUCE - results: - predictions: - header: - - d3mIndex - - log_radon - values: - - - 5 - - 23 - - 30 - - 31 - - 33 - - 39 - - 44 - - 49 - - 54 - - 60 - - 63 - - 65 - - 66 - - 67 - - 70 - - 72 - - 76 - - 77 - - 78 - - 82 - - 86 - - 96 - - 97 - - 107 - - 109 - - 110 - - 120 - - 136 - - 137 - - 139 - - 141 - - 165 - - 168 - - 174 - - 192 - - 198 - - 208 - - 209 - - 210 - - 213 - - 215 - - 218 - - 227 - - 231 - - 235 - - 239 - - 244 - - 250 - - 254 - - 259 - - 261 - - 265 - - 266 - - 275 - - 281 - - 286 - - 290 - - 296 - - 299 - - 300 - - 305 - - 306 - - 309 - - 311 - - 314 - - 321 - - 323 - - 326 - - 331 - - 332 - - 333 - - 334 - - 344 - - 350 - - 351 - - 357 - - 359 - - 360 - - 365 - - 377 - - 380 - - 381 - - 394 - - 408 - - 416 - - 417 - - 429 - - 430 - - 433 - - 436 - - 439 - - 440 - - 448 - - 464 - - 468 - - 483 - - 486 - - 490 - - 493 - - 497 - - 501 - - 514 - - 523 - - 526 - - 527 - - 530 - - 533 - - 536 - - 539 - - 542 - - 545 - - 549 - - 551 - - 552 - - 554 - - 558 - - 584 - - 585 - - 591 - - 594 - - 595 - - 597 - - 598 - - 599 - - 601 - - 611 - - 616 - - 617 - - 633 - - 635 - - 638 - - 655 - - 656 - - 666 - - 668 - - 669 - - 680 - - 685 - - 687 - - 688 - - 703 - - 704 - - 706 - - 708 - - 710 - - 714 - - 715 - - 723 - - 734 - - 739 - - 745 - - 753 - - 755 - - 759 - - 760 - - 767 - - 788 - - 790 - - 793 - - 794 - - 795 - - 798 - - 808 - - 809 - - 814 - - 817 - - 823 - - 827 - - 834 - - 848 - - 862 - - 864 - - 869 - - 872 - - 879 - - 880 - - 882 - - 893 - - 897 - - 904 - - 908 - - 913 - - 914 - - - 0.9552472829818726 - - 0.6942652463912964 - - 1.5232363939285278 - - 1.506748914718628 - - 2.103731632232666 - - 1.0988528728485107 - - -1.2187950611114502 - - 0.5904017686843872 - - 0.6942652463912964 - - 1.54994797706604 - - -0.6920214891433716 - - 1.5065677165985107 - - 1.8838632106781006 - - 1.0279760360717773 - - 1.9791702032089233 - - 0.992601752281189 - - 1.9309251308441162 - - 2.569783926010132 - - 1.7802670001983643 - - 2.6912167072296143 - - 2.264638900756836 - - 1.3862450122833252 - - 0.3367617130279541 - - 1.334634780883789 - - 0.694213330745697 - - 1.6957303285598755 - - 0.5903072953224182 - - 1.5814228057861328 - - 1.2586534023284912 - - 1.2587052583694458 - - 0.40562954545021057 - - 1.5043662786483765 - - 0.7409104108810425 - - 1.6481608152389526 - - 1.5065677165985107 - - 0.18271362781524658 - - 0.40562954545021057 - - 0.4055776596069336 - - 0.694489061832428 - - 1.3603770732879639 - - 1.4774258136749268 - - 0.8330861330032349 - - 1.8773192167282104 - - 1.5065158605575562 - - 0.8738744854927063 - - 0.9552472829818726 - - 2.2046756744384766 - - 2.690866708755493 - - 0.9936981201171875 - - 1.5821088552474976 - - 1.279273509979248 - - 1.2194933891296387 - - 0.7889493107795715 - - 0.33671367168426514 - - 1.7054443359375 - - 1.3873944282531738 - - 1.0651249885559082 - - 1.3612680435180664 - - 1.7199008464813232 - - 0.9554710984230042 - - 1.4770009517669678 - - 1.5707955360412598 - - 0.528777003288269 - - -0.22207701206207275 - - 1.718523383140564 - - 0.5902649760246277 - - 2.5712332725524902 - - 1.744625210762024 - - 2.038295269012451 - - 0.9936981201171875 - - 1.5226820707321167 - - 1.7928229570388794 - - 0.5289158821105957 - - 0.0004963576793670654 - - 0.641793966293335 - - 1.833651065826416 - - 1.3862005472183228 - - 1.0990341901779175 - - 1.3603770732879639 - - 0.4703049063682556 - - 1.3099268674850464 - - 1.1312817335128784 - - 2.1388492584228516 - - 1.522550344467163 - - 0.3367098271846771 - - 2.444817304611206 - - 1.4766032695770264 - - 1.2190687656402588 - - 1.4584081172943115 - - 1.2189394235610962 - - 2.064997434616089 - - 1.279273509979248 - - 1.550502061843872 - - -0.5095808506011963 - - 0.6415701508522034 - - 1.195493221282959 - - 0.40562954545021057 - - 1.6176269054412842 - - 1.0652544498443604 - - 0.7409104108810425 - - 2.1023406982421875 - - 1.8332862854003906 - - 1.522369146347046 - - 1.195493221282959 - - 1.6297705173492432 - - 0.740773618221283 - - 0.7889493107795715 - - 0.9552472829818726 - - 0.9552472829818726 - - 1.0279760360717773 - - 2.5039567947387695 - - 1.3602402210235596 - - 1.7800339460372925 - - -0.6920214891433716 - - 1.0651249885559082 - - 0.470253050327301 - - 1.0281999111175537 - - 0.2628304660320282 - - 1.6176269054412842 - - 0.9552472829818726 - - 0.2625162601470947 - - 0.5903499126434326 - - 1.1616501808166504 - - -0.22225826978683472 - - 0.6941283941268921 - - 1.54994797706604 - - 1.6466542482376099 - - 2.062042713165283 - - 0.2626492381095886 - - 1.4584600925445557 - - 0.7408585548400879 - - 0.5288288593292236 - - 1.570378303527832 - - 2.3885598182678223 - - 2.169395923614502 - - 1.5230551958084106 - - 2.1773343086242676 - - -0.22239500284194946 - - 1.2587052583694458 - - 0.7889493107795715 - - 1.1616075038909912 - - 1.2189394235610962 - - 1.7447571754455566 - - 0.4701681137084961 - - 0.18271362781524658 - - 0.3367617130279541 - - 1.161426305770874 - - 0.26259732246398926 - - 0.5288288593292236 - - -0.3560487627983093 - - 1.1612895727157593 - - 0.40562954545021057 - - 0.4703049063682556 - - 0.5901779532432556 - - 0.18257683515548706 - - -0.22239500284194946 - - 1.8846335411071777 - - 1.2188026905059814 - - 2.024027109146118 - - 2.6903984546661377 - - 0.6415701508522034 - - 1.334582805633545 - - 1.3101081848144531 - - 0.4703049063682556 - - 1.1312817335128784 - - 1.581559658050537 - - 2.9553821086883545 - - 2.1388492584228516 - - 1.1954413652420044 - - 1.8844523429870605 - - 2.2962329387664795 - - 2.1386680603027344 - - 1.617445707321167 - - 0.917670488357544 - - 1.360558271408081 - - 0.6415701508522034 - - 0.6415701508522034 - - 2.464556932449341 - - -0.22225826978683472 - - 1.92511785030365 - - 2.023845911026001 - - 2.2644577026367188 - - 1.877371072769165 - scores: - - metric: - metric: ROOT_MEAN_SQUARED_ERROR - normalized: 0.9999912641893107 - value: 0.017471621378821623 - scoring: - datasets: - - digest: d2413f26c84df994b808f397c3aa9908e54169d43ca3034ea1a2c3f9c9a6ec27 - id: 26_radon_seed_dataset_SCORE - end: '2019-06-19T09:54:49.314593Z' - pipeline: - digest: a7a07527cdff5a525341894356056b4420d9b99f12bc1a90198880a3ea7f6bd1 - id: f596cd77-25f8-4d4c-a350-bb30ab1e58f6 - random_seed: 0 - start: '2019-06-19T09:54:49.288788Z' - status: - state: SUCCESS - steps: - - end: '2019-06-19T09:54:49.314573Z' - hyperparams: - add_normalized_scores: - data: true - type: VALUE - metrics: - data: - - k: null - metric: ROOT_MEAN_SQUARED_ERROR - pos_label: null - type: VALUE - method_calls: - - end: '2019-06-19T09:54:49.291387Z' - logging: [] - name: __init__ - start: '2019-06-19T09:54:49.291349Z' - status: - state: SUCCESS - - end: '2019-06-19T09:54:49.313706Z' - logging: [] - metadata: - produce: - - metadata: - dimension: - length: 1 - name: rows - semantic_types: - - https://metadata.datadrivendiscovery.org/types/TabularRow - schema: https://metadata.datadrivendiscovery.org/schemas/v0/container.json - semantic_types: - - https://metadata.datadrivendiscovery.org/types/Table - structural_type: d3m.container.pandas.DataFrame - selector: [] - - metadata: - dimension: - length: 3 - name: columns - semantic_types: - - https://metadata.datadrivendiscovery.org/types/TabularColumn - selector: - - __ALL_ELEMENTS__ - - metadata: - name: metric - semantic_types: - - https://metadata.datadrivendiscovery.org/types/PrimaryMultiKey - structural_type: str - selector: - - __ALL_ELEMENTS__ - - 0 - - metadata: - name: value - semantic_types: - - https://metadata.datadrivendiscovery.org/types/Score - structural_type: numpy.float64 - selector: - - __ALL_ELEMENTS__ - - 1 - - metadata: - name: normalized - semantic_types: - - https://metadata.datadrivendiscovery.org/types/Score - structural_type: numpy.float64 - selector: - - __ALL_ELEMENTS__ - - 2 - name: fit_multi_produce - start: '2019-06-19T09:54:49.291923Z' - status: - state: SUCCESS - random_seed: 0 - start: '2019-06-19T09:54:49.288803Z' - status: - state: SUCCESS - type: PRIMITIVE -schema: https://metadata.datadrivendiscovery.org/schemas/v0/pipeline_run.json -start: '2019-06-19T09:54:49.108507Z' -status: - state: SUCCESS -steps: -- end: '2019-06-19T09:54:49.117083Z' - method_calls: - - end: '2019-06-19T09:54:49.113071Z' - logging: [] - metadata: - produce: - - metadata: - dimension: - length: 183 - name: rows - semantic_types: - - https://metadata.datadrivendiscovery.org/types/TabularRow - schema: https://metadata.datadrivendiscovery.org/schemas/v0/container.json - semantic_types: - - https://metadata.datadrivendiscovery.org/types/Table - structural_type: d3m.container.pandas.DataFrame - selector: [] - - metadata: - dimension: - length: 30 - name: columns - semantic_types: - - https://metadata.datadrivendiscovery.org/types/TabularColumn - selector: - - __ALL_ELEMENTS__ - - metadata: - name: d3mIndex - semantic_types: - - http://schema.org/Integer - - https://metadata.datadrivendiscovery.org/types/PrimaryKey - structural_type: str - selector: - - __ALL_ELEMENTS__ - - 0 - - metadata: - name: idnum - semantic_types: - - http://schema.org/Float - - https://metadata.datadrivendiscovery.org/types/UniqueKey - structural_type: str - selector: - - __ALL_ELEMENTS__ - - 1 - - metadata: - name: state - semantic_types: - - https://metadata.datadrivendiscovery.org/types/CategoricalData - - https://metadata.datadrivendiscovery.org/types/Attribute - structural_type: str - selector: - - __ALL_ELEMENTS__ - - 2 - - metadata: - name: state2 - semantic_types: - - https://metadata.datadrivendiscovery.org/types/CategoricalData - - https://metadata.datadrivendiscovery.org/types/Attribute - structural_type: str - selector: - - __ALL_ELEMENTS__ - - 3 - - metadata: - name: stfips - semantic_types: - - http://schema.org/Float - - https://metadata.datadrivendiscovery.org/types/Attribute - structural_type: str - selector: - - __ALL_ELEMENTS__ - - 4 - - metadata: - name: zip - semantic_types: - - https://metadata.datadrivendiscovery.org/types/CategoricalData - - https://metadata.datadrivendiscovery.org/types/Attribute - structural_type: str - selector: - - __ALL_ELEMENTS__ - - 5 - - metadata: - name: region - semantic_types: - - https://metadata.datadrivendiscovery.org/types/CategoricalData - - https://metadata.datadrivendiscovery.org/types/Attribute - structural_type: str - selector: - - __ALL_ELEMENTS__ - - 6 - - metadata: - name: typebldg - semantic_types: - - https://metadata.datadrivendiscovery.org/types/CategoricalData - - https://metadata.datadrivendiscovery.org/types/Attribute - structural_type: str - selector: - - __ALL_ELEMENTS__ - - 7 - - metadata: - name: floor - semantic_types: - - https://metadata.datadrivendiscovery.org/types/CategoricalData - - https://metadata.datadrivendiscovery.org/types/Attribute - structural_type: str - selector: - - __ALL_ELEMENTS__ - - 8 - - metadata: - name: room - semantic_types: - - http://schema.org/Float - - https://metadata.datadrivendiscovery.org/types/Attribute - structural_type: str - selector: - - __ALL_ELEMENTS__ - - 9 - - metadata: - name: basement - semantic_types: - - http://schema.org/Boolean - - https://metadata.datadrivendiscovery.org/types/Attribute - structural_type: str - selector: - - __ALL_ELEMENTS__ - - 10 - - metadata: - name: windoor - semantic_types: - - https://metadata.datadrivendiscovery.org/types/CategoricalData - - https://metadata.datadrivendiscovery.org/types/Attribute - structural_type: str - selector: - - __ALL_ELEMENTS__ - - 11 - - metadata: - name: rep - semantic_types: - - https://metadata.datadrivendiscovery.org/types/CategoricalData - - https://metadata.datadrivendiscovery.org/types/Attribute - structural_type: str - selector: - - __ALL_ELEMENTS__ - - 12 - - metadata: - name: stratum - semantic_types: - - https://metadata.datadrivendiscovery.org/types/CategoricalData - - https://metadata.datadrivendiscovery.org/types/Attribute - structural_type: str - selector: - - __ALL_ELEMENTS__ - - 13 - - metadata: - name: wave - semantic_types: - - http://schema.org/Float - - https://metadata.datadrivendiscovery.org/types/Attribute - structural_type: str - selector: - - __ALL_ELEMENTS__ - - 14 - - metadata: - name: starttm - semantic_types: - - http://schema.org/Float - - https://metadata.datadrivendiscovery.org/types/Attribute - structural_type: str - selector: - - __ALL_ELEMENTS__ - - 15 - - metadata: - name: stoptm - semantic_types: - - http://schema.org/Float - - https://metadata.datadrivendiscovery.org/types/Attribute - structural_type: str - selector: - - __ALL_ELEMENTS__ - - 16 - - metadata: - name: startdt - semantic_types: - - http://schema.org/Float - - https://metadata.datadrivendiscovery.org/types/Attribute - structural_type: str - selector: - - __ALL_ELEMENTS__ - - 17 - - metadata: - name: stopdt - semantic_types: - - http://schema.org/Float - - https://metadata.datadrivendiscovery.org/types/Attribute - structural_type: str - selector: - - __ALL_ELEMENTS__ - - 18 - - metadata: - name: activity - semantic_types: - - http://schema.org/Float - - https://metadata.datadrivendiscovery.org/types/Attribute - structural_type: str - selector: - - __ALL_ELEMENTS__ - - 19 - - metadata: - name: pcterr - semantic_types: - - http://schema.org/Float - - https://metadata.datadrivendiscovery.org/types/Attribute - structural_type: str - selector: - - __ALL_ELEMENTS__ - - 20 - - metadata: - name: adjwt - semantic_types: - - http://schema.org/Float - - https://metadata.datadrivendiscovery.org/types/Attribute - structural_type: str - selector: - - __ALL_ELEMENTS__ - - 21 - - metadata: - name: dupflag - semantic_types: - - http://schema.org/Boolean - - https://metadata.datadrivendiscovery.org/types/Attribute - structural_type: str - selector: - - __ALL_ELEMENTS__ - - 22 - - metadata: - name: zipflag - semantic_types: - - http://schema.org/Boolean - - https://metadata.datadrivendiscovery.org/types/Attribute - structural_type: str - selector: - - __ALL_ELEMENTS__ - - 23 - - metadata: - name: cntyfips - semantic_types: - - http://schema.org/Float - - https://metadata.datadrivendiscovery.org/types/Attribute - structural_type: str - selector: - - __ALL_ELEMENTS__ - - 24 - - metadata: - name: county - semantic_types: - - https://metadata.datadrivendiscovery.org/types/CategoricalData - - https://metadata.datadrivendiscovery.org/types/Attribute - structural_type: str - selector: - - __ALL_ELEMENTS__ - - 25 - - metadata: - name: fips - semantic_types: - - https://metadata.datadrivendiscovery.org/types/CategoricalData - - https://metadata.datadrivendiscovery.org/types/Attribute - structural_type: str - selector: - - __ALL_ELEMENTS__ - - 26 - - metadata: - name: Uppm - semantic_types: - - http://schema.org/Float - - https://metadata.datadrivendiscovery.org/types/Attribute - structural_type: str - selector: - - __ALL_ELEMENTS__ - - 27 - - metadata: - name: county_code - semantic_types: - - https://metadata.datadrivendiscovery.org/types/CategoricalData - - https://metadata.datadrivendiscovery.org/types/Attribute - structural_type: str - selector: - - __ALL_ELEMENTS__ - - 28 - - metadata: - name: log_radon - semantic_types: - - http://schema.org/Float - - https://metadata.datadrivendiscovery.org/types/SuggestedTarget - - https://metadata.datadrivendiscovery.org/types/Target - - https://metadata.datadrivendiscovery.org/types/TrueTarget - structural_type: str - selector: - - __ALL_ELEMENTS__ - - 29 - name: multi_produce - start: '2019-06-19T09:54:49.109030Z' - status: - state: SUCCESS - start: '2019-06-19T09:54:49.108527Z' - status: - state: SUCCESS - type: PRIMITIVE -- end: '2019-06-19T09:54:49.204005Z' - method_calls: - - end: '2019-06-19T09:54:49.199866Z' - logging: [] - metadata: - produce: - - metadata: - dimension: - length: 183 - name: rows - semantic_types: - - https://metadata.datadrivendiscovery.org/types/TabularRow - schema: https://metadata.datadrivendiscovery.org/schemas/v0/container.json - semantic_types: - - https://metadata.datadrivendiscovery.org/types/Table - structural_type: d3m.container.pandas.DataFrame - selector: [] - - metadata: - dimension: - length: 30 - name: columns - semantic_types: - - https://metadata.datadrivendiscovery.org/types/TabularColumn - selector: - - __ALL_ELEMENTS__ - - metadata: - name: d3mIndex - semantic_types: - - http://schema.org/Integer - - https://metadata.datadrivendiscovery.org/types/PrimaryKey - structural_type: int - selector: - - __ALL_ELEMENTS__ - - 0 - - metadata: - name: idnum - semantic_types: - - http://schema.org/Float - - https://metadata.datadrivendiscovery.org/types/UniqueKey - structural_type: float - selector: - - __ALL_ELEMENTS__ - - 1 - - metadata: - name: state - semantic_types: - - https://metadata.datadrivendiscovery.org/types/CategoricalData - - https://metadata.datadrivendiscovery.org/types/Attribute - structural_type: str - selector: - - __ALL_ELEMENTS__ - - 2 - - metadata: - name: state2 - semantic_types: - - https://metadata.datadrivendiscovery.org/types/CategoricalData - - https://metadata.datadrivendiscovery.org/types/Attribute - structural_type: str - selector: - - __ALL_ELEMENTS__ - - 3 - - metadata: - name: stfips - semantic_types: - - http://schema.org/Float - - https://metadata.datadrivendiscovery.org/types/Attribute - structural_type: float - selector: - - __ALL_ELEMENTS__ - - 4 - - metadata: - name: zip - semantic_types: - - https://metadata.datadrivendiscovery.org/types/CategoricalData - - https://metadata.datadrivendiscovery.org/types/Attribute - structural_type: str - selector: - - __ALL_ELEMENTS__ - - 5 - - metadata: - name: region - semantic_types: - - https://metadata.datadrivendiscovery.org/types/CategoricalData - - https://metadata.datadrivendiscovery.org/types/Attribute - structural_type: str - selector: - - __ALL_ELEMENTS__ - - 6 - - metadata: - name: typebldg - semantic_types: - - https://metadata.datadrivendiscovery.org/types/CategoricalData - - https://metadata.datadrivendiscovery.org/types/Attribute - structural_type: str - selector: - - __ALL_ELEMENTS__ - - 7 - - metadata: - name: floor - semantic_types: - - https://metadata.datadrivendiscovery.org/types/CategoricalData - - https://metadata.datadrivendiscovery.org/types/Attribute - structural_type: str - selector: - - __ALL_ELEMENTS__ - - 8 - - metadata: - name: room - semantic_types: - - http://schema.org/Float - - https://metadata.datadrivendiscovery.org/types/Attribute - structural_type: float - selector: - - __ALL_ELEMENTS__ - - 9 - - metadata: - name: basement - semantic_types: - - http://schema.org/Boolean - - https://metadata.datadrivendiscovery.org/types/Attribute - structural_type: int - selector: - - __ALL_ELEMENTS__ - - 10 - - metadata: - name: windoor - semantic_types: - - https://metadata.datadrivendiscovery.org/types/CategoricalData - - https://metadata.datadrivendiscovery.org/types/Attribute - structural_type: str - selector: - - __ALL_ELEMENTS__ - - 11 - - metadata: - name: rep - semantic_types: - - https://metadata.datadrivendiscovery.org/types/CategoricalData - - https://metadata.datadrivendiscovery.org/types/Attribute - structural_type: str - selector: - - __ALL_ELEMENTS__ - - 12 - - metadata: - name: stratum - semantic_types: - - https://metadata.datadrivendiscovery.org/types/CategoricalData - - https://metadata.datadrivendiscovery.org/types/Attribute - structural_type: str - selector: - - __ALL_ELEMENTS__ - - 13 - - metadata: - name: wave - semantic_types: - - http://schema.org/Float - - https://metadata.datadrivendiscovery.org/types/Attribute - structural_type: float - selector: - - __ALL_ELEMENTS__ - - 14 - - metadata: - name: starttm - semantic_types: - - http://schema.org/Float - - https://metadata.datadrivendiscovery.org/types/Attribute - structural_type: float - selector: - - __ALL_ELEMENTS__ - - 15 - - metadata: - name: stoptm - semantic_types: - - http://schema.org/Float - - https://metadata.datadrivendiscovery.org/types/Attribute - structural_type: float - selector: - - __ALL_ELEMENTS__ - - 16 - - metadata: - name: startdt - semantic_types: - - http://schema.org/Float - - https://metadata.datadrivendiscovery.org/types/Attribute - structural_type: float - selector: - - __ALL_ELEMENTS__ - - 17 - - metadata: - name: stopdt - semantic_types: - - http://schema.org/Float - - https://metadata.datadrivendiscovery.org/types/Attribute - structural_type: float - selector: - - __ALL_ELEMENTS__ - - 18 - - metadata: - name: activity - semantic_types: - - http://schema.org/Float - - https://metadata.datadrivendiscovery.org/types/Attribute - structural_type: float - selector: - - __ALL_ELEMENTS__ - - 19 - - metadata: - name: pcterr - semantic_types: - - http://schema.org/Float - - https://metadata.datadrivendiscovery.org/types/Attribute - structural_type: float - selector: - - __ALL_ELEMENTS__ - - 20 - - metadata: - name: adjwt - semantic_types: - - http://schema.org/Float - - https://metadata.datadrivendiscovery.org/types/Attribute - structural_type: float - selector: - - __ALL_ELEMENTS__ - - 21 - - metadata: - name: dupflag - semantic_types: - - http://schema.org/Boolean - - https://metadata.datadrivendiscovery.org/types/Attribute - structural_type: int - selector: - - __ALL_ELEMENTS__ - - 22 - - metadata: - name: zipflag - semantic_types: - - http://schema.org/Boolean - - https://metadata.datadrivendiscovery.org/types/Attribute - structural_type: int - selector: - - __ALL_ELEMENTS__ - - 23 - - metadata: - name: cntyfips - semantic_types: - - http://schema.org/Float - - https://metadata.datadrivendiscovery.org/types/Attribute - structural_type: float - selector: - - __ALL_ELEMENTS__ - - 24 - - metadata: - name: county - semantic_types: - - https://metadata.datadrivendiscovery.org/types/CategoricalData - - https://metadata.datadrivendiscovery.org/types/Attribute - structural_type: str - selector: - - __ALL_ELEMENTS__ - - 25 - - metadata: - name: fips - semantic_types: - - https://metadata.datadrivendiscovery.org/types/CategoricalData - - https://metadata.datadrivendiscovery.org/types/Attribute - structural_type: str - selector: - - __ALL_ELEMENTS__ - - 26 - - metadata: - name: Uppm - semantic_types: - - http://schema.org/Float - - https://metadata.datadrivendiscovery.org/types/Attribute - structural_type: float - selector: - - __ALL_ELEMENTS__ - - 27 - - metadata: - name: county_code - semantic_types: - - https://metadata.datadrivendiscovery.org/types/CategoricalData - - https://metadata.datadrivendiscovery.org/types/Attribute - structural_type: str - selector: - - __ALL_ELEMENTS__ - - 28 - - metadata: - name: log_radon - semantic_types: - - http://schema.org/Float - - https://metadata.datadrivendiscovery.org/types/SuggestedTarget - - https://metadata.datadrivendiscovery.org/types/Target - - https://metadata.datadrivendiscovery.org/types/TrueTarget - structural_type: float - selector: - - __ALL_ELEMENTS__ - - 29 - name: multi_produce - start: '2019-06-19T09:54:49.117714Z' - status: - state: SUCCESS - start: '2019-06-19T09:54:49.117103Z' - status: - state: SUCCESS - type: PRIMITIVE -- end: '2019-06-19T09:54:49.209289Z' - method_calls: - - end: '2019-06-19T09:54:49.207372Z' - logging: [] - metadata: - produce: - - metadata: - dimension: - length: 183 - name: rows - semantic_types: - - https://metadata.datadrivendiscovery.org/types/TabularRow - schema: https://metadata.datadrivendiscovery.org/schemas/v0/container.json - semantic_types: - - https://metadata.datadrivendiscovery.org/types/Table - structural_type: d3m.container.pandas.DataFrame - selector: [] - - metadata: - dimension: - length: 12 - name: columns - semantic_types: - - https://metadata.datadrivendiscovery.org/types/TabularColumn - selector: - - __ALL_ELEMENTS__ - - metadata: - name: state - semantic_types: - - https://metadata.datadrivendiscovery.org/types/CategoricalData - - https://metadata.datadrivendiscovery.org/types/Attribute - structural_type: str - selector: - - __ALL_ELEMENTS__ - - 0 - - metadata: - name: state2 - semantic_types: - - https://metadata.datadrivendiscovery.org/types/CategoricalData - - https://metadata.datadrivendiscovery.org/types/Attribute - structural_type: str - selector: - - __ALL_ELEMENTS__ - - 1 - - metadata: - name: zip - semantic_types: - - https://metadata.datadrivendiscovery.org/types/CategoricalData - - https://metadata.datadrivendiscovery.org/types/Attribute - structural_type: str - selector: - - __ALL_ELEMENTS__ - - 2 - - metadata: - name: region - semantic_types: - - https://metadata.datadrivendiscovery.org/types/CategoricalData - - https://metadata.datadrivendiscovery.org/types/Attribute - structural_type: str - selector: - - __ALL_ELEMENTS__ - - 3 - - metadata: - name: typebldg - semantic_types: - - https://metadata.datadrivendiscovery.org/types/CategoricalData - - https://metadata.datadrivendiscovery.org/types/Attribute - structural_type: str - selector: - - __ALL_ELEMENTS__ - - 4 - - metadata: - name: floor - semantic_types: - - https://metadata.datadrivendiscovery.org/types/CategoricalData - - https://metadata.datadrivendiscovery.org/types/Attribute - structural_type: str - selector: - - __ALL_ELEMENTS__ - - 5 - - metadata: - name: windoor - semantic_types: - - https://metadata.datadrivendiscovery.org/types/CategoricalData - - https://metadata.datadrivendiscovery.org/types/Attribute - structural_type: str - selector: - - __ALL_ELEMENTS__ - - 6 - - metadata: - name: rep - semantic_types: - - https://metadata.datadrivendiscovery.org/types/CategoricalData - - https://metadata.datadrivendiscovery.org/types/Attribute - structural_type: str - selector: - - __ALL_ELEMENTS__ - - 7 - - metadata: - name: stratum - semantic_types: - - https://metadata.datadrivendiscovery.org/types/CategoricalData - - https://metadata.datadrivendiscovery.org/types/Attribute - structural_type: str - selector: - - __ALL_ELEMENTS__ - - 8 - - metadata: - name: county - semantic_types: - - https://metadata.datadrivendiscovery.org/types/CategoricalData - - https://metadata.datadrivendiscovery.org/types/Attribute - structural_type: str - selector: - - __ALL_ELEMENTS__ - - 9 - - metadata: - name: fips - semantic_types: - - https://metadata.datadrivendiscovery.org/types/CategoricalData - - https://metadata.datadrivendiscovery.org/types/Attribute - structural_type: str - selector: - - __ALL_ELEMENTS__ - - 10 - - metadata: - name: county_code - semantic_types: - - https://metadata.datadrivendiscovery.org/types/CategoricalData - - https://metadata.datadrivendiscovery.org/types/Attribute - structural_type: str - selector: - - __ALL_ELEMENTS__ - - 11 - name: multi_produce - start: '2019-06-19T09:54:49.204709Z' - status: - state: SUCCESS - start: '2019-06-19T09:54:49.204040Z' - status: - state: SUCCESS - type: PRIMITIVE -- end: '2019-06-19T09:54:49.214141Z' - method_calls: - - end: '2019-06-19T09:54:49.212034Z' - logging: [] - metadata: - produce: - - metadata: - dimension: - length: 183 - name: rows - semantic_types: - - https://metadata.datadrivendiscovery.org/types/TabularRow - schema: https://metadata.datadrivendiscovery.org/schemas/v0/container.json - semantic_types: - - https://metadata.datadrivendiscovery.org/types/Table - structural_type: d3m.container.pandas.DataFrame - selector: [] - - metadata: - dimension: - length: 14 - name: columns - semantic_types: - - https://metadata.datadrivendiscovery.org/types/TabularColumn - selector: - - __ALL_ELEMENTS__ - - metadata: - name: idnum - semantic_types: - - http://schema.org/Float - - https://metadata.datadrivendiscovery.org/types/UniqueKey - structural_type: float - selector: - - __ALL_ELEMENTS__ - - 0 - - metadata: - name: stfips - semantic_types: - - http://schema.org/Float - - https://metadata.datadrivendiscovery.org/types/Attribute - structural_type: float - selector: - - __ALL_ELEMENTS__ - - 1 - - metadata: - name: room - semantic_types: - - http://schema.org/Float - - https://metadata.datadrivendiscovery.org/types/Attribute - structural_type: float - selector: - - __ALL_ELEMENTS__ - - 2 - - metadata: - name: wave - semantic_types: - - http://schema.org/Float - - https://metadata.datadrivendiscovery.org/types/Attribute - structural_type: float - selector: - - __ALL_ELEMENTS__ - - 3 - - metadata: - name: starttm - semantic_types: - - http://schema.org/Float - - https://metadata.datadrivendiscovery.org/types/Attribute - structural_type: float - selector: - - __ALL_ELEMENTS__ - - 4 - - metadata: - name: stoptm - semantic_types: - - http://schema.org/Float - - https://metadata.datadrivendiscovery.org/types/Attribute - structural_type: float - selector: - - __ALL_ELEMENTS__ - - 5 - - metadata: - name: startdt - semantic_types: - - http://schema.org/Float - - https://metadata.datadrivendiscovery.org/types/Attribute - structural_type: float - selector: - - __ALL_ELEMENTS__ - - 6 - - metadata: - name: stopdt - semantic_types: - - http://schema.org/Float - - https://metadata.datadrivendiscovery.org/types/Attribute - structural_type: float - selector: - - __ALL_ELEMENTS__ - - 7 - - metadata: - name: activity - semantic_types: - - http://schema.org/Float - - https://metadata.datadrivendiscovery.org/types/Attribute - structural_type: float - selector: - - __ALL_ELEMENTS__ - - 8 - - metadata: - name: pcterr - semantic_types: - - http://schema.org/Float - - https://metadata.datadrivendiscovery.org/types/Attribute - structural_type: float - selector: - - __ALL_ELEMENTS__ - - 9 - - metadata: - name: adjwt - semantic_types: - - http://schema.org/Float - - https://metadata.datadrivendiscovery.org/types/Attribute - structural_type: float - selector: - - __ALL_ELEMENTS__ - - 10 - - metadata: - name: cntyfips - semantic_types: - - http://schema.org/Float - - https://metadata.datadrivendiscovery.org/types/Attribute - structural_type: float - selector: - - __ALL_ELEMENTS__ - - 11 - - metadata: - name: Uppm - semantic_types: - - http://schema.org/Float - - https://metadata.datadrivendiscovery.org/types/Attribute - structural_type: float - selector: - - __ALL_ELEMENTS__ - - 12 - - metadata: - name: log_radon - semantic_types: - - http://schema.org/Float - - https://metadata.datadrivendiscovery.org/types/SuggestedTarget - - https://metadata.datadrivendiscovery.org/types/Target - - https://metadata.datadrivendiscovery.org/types/TrueTarget - structural_type: float - selector: - - __ALL_ELEMENTS__ - - 13 - name: multi_produce - start: '2019-06-19T09:54:49.209904Z' - status: - state: SUCCESS - start: '2019-06-19T09:54:49.209307Z' - status: - state: SUCCESS - type: PRIMITIVE -- end: '2019-06-19T09:54:49.217452Z' - method_calls: - - end: '2019-06-19T09:54:49.216856Z' - logging: [] - metadata: - produce: - - metadata: - dimension: - length: 183 - name: rows - semantic_types: - - https://metadata.datadrivendiscovery.org/types/TabularRow - schema: https://metadata.datadrivendiscovery.org/schemas/v0/container.json - semantic_types: - - https://metadata.datadrivendiscovery.org/types/Table - structural_type: d3m.container.pandas.DataFrame - selector: [] - - metadata: - dimension: - length: 1 - name: columns - semantic_types: - - https://metadata.datadrivendiscovery.org/types/TabularColumn - selector: - - __ALL_ELEMENTS__ - - metadata: - name: log_radon - semantic_types: - - http://schema.org/Float - - https://metadata.datadrivendiscovery.org/types/SuggestedTarget - - https://metadata.datadrivendiscovery.org/types/Target - - https://metadata.datadrivendiscovery.org/types/TrueTarget - structural_type: str - selector: - - __ALL_ELEMENTS__ - - 0 - name: multi_produce - start: '2019-06-19T09:54:49.214788Z' - status: - state: SUCCESS - start: '2019-06-19T09:54:49.214160Z' - status: - state: SUCCESS - type: PRIMITIVE -- end: '2019-06-19T09:54:49.254497Z' - method_calls: - - end: '2019-06-19T09:54:49.252367Z' - logging: [] - metadata: - produce: - - metadata: - dimension: - length: 183 - name: rows - semantic_types: - - https://metadata.datadrivendiscovery.org/types/TabularRow - schema: https://metadata.datadrivendiscovery.org/schemas/v0/container.json - semantic_types: - - https://metadata.datadrivendiscovery.org/types/Table - structural_type: d3m.container.pandas.DataFrame - selector: [] - - metadata: - dimension: - length: 14 - name: columns - semantic_types: - - https://metadata.datadrivendiscovery.org/types/TabularColumn - selector: - - __ALL_ELEMENTS__ - - metadata: - name: idnum - semantic_types: - - http://schema.org/Float - - https://metadata.datadrivendiscovery.org/types/UniqueKey - structural_type: float - selector: - - __ALL_ELEMENTS__ - - 0 - - metadata: - name: stfips - semantic_types: - - http://schema.org/Float - - https://metadata.datadrivendiscovery.org/types/Attribute - structural_type: float - selector: - - __ALL_ELEMENTS__ - - 1 - - metadata: - name: room - semantic_types: - - http://schema.org/Float - - https://metadata.datadrivendiscovery.org/types/Attribute - structural_type: float - selector: - - __ALL_ELEMENTS__ - - 2 - - metadata: - name: wave - semantic_types: - - http://schema.org/Float - - https://metadata.datadrivendiscovery.org/types/Attribute - structural_type: float - selector: - - __ALL_ELEMENTS__ - - 3 - - metadata: - name: starttm - semantic_types: - - http://schema.org/Float - - https://metadata.datadrivendiscovery.org/types/Attribute - structural_type: float - selector: - - __ALL_ELEMENTS__ - - 4 - - metadata: - name: stoptm - semantic_types: - - http://schema.org/Float - - https://metadata.datadrivendiscovery.org/types/Attribute - structural_type: float - selector: - - __ALL_ELEMENTS__ - - 5 - - metadata: - name: startdt - semantic_types: - - http://schema.org/Float - - https://metadata.datadrivendiscovery.org/types/Attribute - structural_type: float - selector: - - __ALL_ELEMENTS__ - - 6 - - metadata: - name: stopdt - semantic_types: - - http://schema.org/Float - - https://metadata.datadrivendiscovery.org/types/Attribute - structural_type: float - selector: - - __ALL_ELEMENTS__ - - 7 - - metadata: - name: activity - semantic_types: - - http://schema.org/Float - - https://metadata.datadrivendiscovery.org/types/Attribute - structural_type: float - selector: - - __ALL_ELEMENTS__ - - 8 - - metadata: - name: pcterr - semantic_types: - - http://schema.org/Float - - https://metadata.datadrivendiscovery.org/types/Attribute - structural_type: float - selector: - - __ALL_ELEMENTS__ - - 9 - - metadata: - name: adjwt - semantic_types: - - http://schema.org/Float - - https://metadata.datadrivendiscovery.org/types/Attribute - structural_type: float - selector: - - __ALL_ELEMENTS__ - - 10 - - metadata: - name: cntyfips - semantic_types: - - http://schema.org/Float - - https://metadata.datadrivendiscovery.org/types/Attribute - structural_type: float - selector: - - __ALL_ELEMENTS__ - - 11 - - metadata: - name: Uppm - semantic_types: - - http://schema.org/Float - - https://metadata.datadrivendiscovery.org/types/Attribute - structural_type: float - selector: - - __ALL_ELEMENTS__ - - 12 - - metadata: - name: log_radon - semantic_types: - - http://schema.org/Float - - https://metadata.datadrivendiscovery.org/types/SuggestedTarget - - https://metadata.datadrivendiscovery.org/types/Target - - https://metadata.datadrivendiscovery.org/types/TrueTarget - structural_type: float - selector: - - __ALL_ELEMENTS__ - - 13 - name: multi_produce - start: '2019-06-19T09:54:49.218199Z' - status: - state: SUCCESS - start: '2019-06-19T09:54:49.217470Z' - status: - state: SUCCESS - type: PRIMITIVE -- end: '2019-06-19T09:54:49.265815Z' - method_calls: - - end: '2019-06-19T09:54:49.264935Z' - logging: [] - metadata: - produce: - - metadata: - dimension: - length: 183 - name: rows - semantic_types: - - https://metadata.datadrivendiscovery.org/types/TabularRow - schema: https://metadata.datadrivendiscovery.org/schemas/v0/container.json - semantic_types: - - https://metadata.datadrivendiscovery.org/types/Table - structural_type: d3m.container.pandas.DataFrame - selector: [] - - metadata: - dimension: - length: 3 - name: columns - semantic_types: - - https://metadata.datadrivendiscovery.org/types/TabularColumn - selector: - - __ALL_ELEMENTS__ - - metadata: - name: idnum - semantic_types: - - http://schema.org/Float - - https://metadata.datadrivendiscovery.org/types/UniqueKey - structural_type: float - selector: - - __ALL_ELEMENTS__ - - 0 - - metadata: - name: log_radon - semantic_types: - - http://schema.org/Float - - https://metadata.datadrivendiscovery.org/types/SuggestedTarget - - https://metadata.datadrivendiscovery.org/types/Target - - https://metadata.datadrivendiscovery.org/types/PredictedTarget - structural_type: str - selector: - - __ALL_ELEMENTS__ - - 1 - - metadata: - name: log_radon - semantic_types: - - http://schema.org/Float - - https://metadata.datadrivendiscovery.org/types/SuggestedTarget - - https://metadata.datadrivendiscovery.org/types/Target - - https://metadata.datadrivendiscovery.org/types/TrueTarget - structural_type: float - selector: - - __ALL_ELEMENTS__ - - 2 - name: multi_produce - start: '2019-06-19T09:54:49.255748Z' - status: - state: SUCCESS - start: '2019-06-19T09:54:49.254517Z' - status: - state: SUCCESS - type: PRIMITIVE -- end: '2019-06-19T09:54:49.273222Z' - method_calls: - - end: '2019-06-19T09:54:49.272461Z' - logging: [] - metadata: - produce: - - metadata: - dimension: - length: 183 - name: rows - semantic_types: - - https://metadata.datadrivendiscovery.org/types/TabularRow - schema: https://metadata.datadrivendiscovery.org/schemas/v0/container.json - semantic_types: - - https://metadata.datadrivendiscovery.org/types/Table - structural_type: d3m.container.pandas.DataFrame - selector: [] - - metadata: - dimension: - length: 2 - name: columns - semantic_types: - - https://metadata.datadrivendiscovery.org/types/TabularColumn - selector: - - __ALL_ELEMENTS__ - - metadata: - name: d3mIndex - semantic_types: - - http://schema.org/Integer - - https://metadata.datadrivendiscovery.org/types/PrimaryKey - structural_type: int - selector: - - __ALL_ELEMENTS__ - - 0 - - metadata: - name: log_radon - semantic_types: - - http://schema.org/Float - - https://metadata.datadrivendiscovery.org/types/SuggestedTarget - - https://metadata.datadrivendiscovery.org/types/Target - - https://metadata.datadrivendiscovery.org/types/PredictedTarget - structural_type: str - selector: - - __ALL_ELEMENTS__ - - 1 - name: multi_produce - start: '2019-06-19T09:54:49.266417Z' - status: - state: SUCCESS - start: '2019-06-19T09:54:49.265835Z' - status: - state: SUCCESS - type: PRIMITIVE diff --git a/common-primitives/pipeline_runs/schema_discovery.profiler.Common/pipeline_run_extract_structural_types.yml.gz b/common-primitives/pipeline_runs/schema_discovery.profiler.Common/pipeline_run_extract_structural_types.yml.gz deleted file mode 120000 index 91f49b3..0000000 --- a/common-primitives/pipeline_runs/schema_discovery.profiler.Common/pipeline_run_extract_structural_types.yml.gz +++ /dev/null @@ -1 +0,0 @@ -../data_transformation.extract_columns_by_structural_types.Common/pipeline_run.yml.gz \ No newline at end of file diff --git a/common-primitives/pipeline_runs/schema_discovery.profiler.Common/pipeline_run_group_field_compose.yml.gz b/common-primitives/pipeline_runs/schema_discovery.profiler.Common/pipeline_run_group_field_compose.yml.gz deleted file mode 120000 index 0a4dd35..0000000 --- a/common-primitives/pipeline_runs/schema_discovery.profiler.Common/pipeline_run_group_field_compose.yml.gz +++ /dev/null @@ -1 +0,0 @@ -../data_transformation.grouping_field_compose.Common/pipeline_run.yml.gz \ No newline at end of file diff --git a/common-primitives/pipelines/classification.light_gbm.DataFrameCommon/d2473bbc-7839-4deb-9ba4-4ff4bc9b0bde.json b/common-primitives/pipelines/classification.light_gbm.DataFrameCommon/d2473bbc-7839-4deb-9ba4-4ff4bc9b0bde.json deleted file mode 100644 index 7ee4edb..0000000 --- a/common-primitives/pipelines/classification.light_gbm.DataFrameCommon/d2473bbc-7839-4deb-9ba4-4ff4bc9b0bde.json +++ /dev/null @@ -1,246 +0,0 @@ -{ - "context": "TESTING", - "created": "2019-02-12T01:09:44.343543Z", - "id": "d2473bbc-7839-4deb-9ba4-4ff4bc9b0bde", - "inputs": [ - { - "name": "inputs" - } - ], - "outputs": [ - { - "data": "steps.7.produce", - "name": "output predictions" - } - ], - "schema": "https://metadata.datadrivendiscovery.org/schemas/v0/pipeline.json", - "steps": [ - { - "arguments": { - "inputs": { - "data": "inputs.0", - "type": "CONTAINER" - } - }, - "outputs": [ - { - "id": "produce" - } - ], - "primitive": { - "id": "4b42ce1e-9b98-4a25-b68e-fad13311eb65", - "name": "Extract a DataFrame from a Dataset", - "python_path": "d3m.primitives.data_transformation.dataset_to_dataframe.Common", - "version": "0.3.0" - }, - "type": "PRIMITIVE" - }, - { - "arguments": { - "inputs": { - "data": "steps.0.produce", - "type": "CONTAINER" - } - }, - "hyperparams": { - "parse_semantic_types": { - "data": [ - "http://schema.org/Boolean", - "http://schema.org/Integer", - "http://schema.org/Float", - "https://metadata.datadrivendiscovery.org/types/FloatVector", - "http://schema.org/DateTime" - ], - "type": "VALUE" - } - }, - "outputs": [ - { - "id": "produce" - } - ], - "primitive": { - "id": "d510cb7a-1782-4f51-b44c-58f0236e47c7", - "name": "Parses strings into their types", - "python_path": "d3m.primitives.data_transformation.column_parser.Common", - "version": "0.6.0" - }, - "type": "PRIMITIVE" - }, - { - "arguments": { - "inputs": { - "data": "steps.1.produce", - "type": "CONTAINER" - } - }, - "hyperparams": { - "semantic_types": { - "data": [ - "https://metadata.datadrivendiscovery.org/types/CategoricalData" - ], - "type": "VALUE" - } - }, - "outputs": [ - { - "id": "produce" - } - ], - "primitive": { - "id": "4503a4c6-42f7-45a1-a1d4-ed69699cf5e1", - "name": "Extracts columns by semantic type", - "python_path": "d3m.primitives.data_transformation.extract_columns_by_semantic_types.Common", - "version": "0.3.0" - }, - "type": "PRIMITIVE" - }, - { - "arguments": { - "inputs": { - "data": "steps.1.produce", - "type": "CONTAINER" - } - }, - "hyperparams": { - "exclude_columns": { - "data": [ - 0 - ], - "type": "VALUE" - }, - "semantic_types": { - "data": [ - "http://schema.org/Integer", - "http://schema.org/Float" - ], - "type": "VALUE" - } - }, - "outputs": [ - { - "id": "produce" - } - ], - "primitive": { - "id": "4503a4c6-42f7-45a1-a1d4-ed69699cf5e1", - "name": "Extracts columns by semantic type", - "python_path": "d3m.primitives.data_transformation.extract_columns_by_semantic_types.Common", - "version": "0.3.0" - }, - "type": "PRIMITIVE" - }, - { - "arguments": { - "inputs": { - "data": "steps.0.produce", - "type": "CONTAINER" - } - }, - "hyperparams": { - "semantic_types": { - "data": [ - "https://metadata.datadrivendiscovery.org/types/TrueTarget" - ], - "type": "VALUE" - } - }, - "outputs": [ - { - "id": "produce" - } - ], - "primitive": { - "id": "4503a4c6-42f7-45a1-a1d4-ed69699cf5e1", - "name": "Extracts columns by semantic type", - "python_path": "d3m.primitives.data_transformation.extract_columns_by_semantic_types.Common", - "version": "0.3.0" - }, - "type": "PRIMITIVE" - }, - { - "arguments": { - "inputs": { - "data": "steps.3.produce", - "type": "CONTAINER" - } - }, - "hyperparams": { - "return_result": { - "data": "replace", - "type": "VALUE" - }, - "use_semantic_types": { - "data": true, - "type": "VALUE" - } - }, - "outputs": [ - { - "id": "produce" - } - ], - "primitive": { - "id": "d016df89-de62-3c53-87ed-c06bb6a23cde", - "name": "sklearn.impute.SimpleImputer", - "python_path": "d3m.primitives.data_cleaning.imputer.SKlearn", - "version": "2019.6.7" - }, - "type": "PRIMITIVE" - }, - { - "arguments": { - "inputs": { - "data": "steps.5.produce", - "type": "CONTAINER" - }, - "outputs": { - "data": "steps.4.produce", - "type": "CONTAINER" - } - }, - "hyperparams": { - "return_result": { - "data": "replace", - "type": "VALUE" - } - }, - "outputs": [ - { - "id": "produce" - } - ], - "primitive": { - "id": "259aa747-795c-435e-8e33-8c32a4c83c6b", - "name": "LightGBM GBTree classifier", - "python_path": "d3m.primitives.classification.light_gbm.Common", - "version": "0.1.0" - }, - "type": "PRIMITIVE" - }, - { - "arguments": { - "inputs": { - "data": "steps.6.produce", - "type": "CONTAINER" - }, - "reference": { - "data": "steps.1.produce", - "type": "CONTAINER" - } - }, - "outputs": [ - { - "id": "produce" - } - ], - "primitive": { - "id": "8d38b340-f83f-4877-baaa-162f8e551736", - "name": "Construct pipeline predictions output", - "python_path": "d3m.primitives.data_transformation.construct_predictions.Common", - "version": "0.3.0" - }, - "type": "PRIMITIVE" - } - ] -} diff --git a/common-primitives/pipelines/classification.random_forest.DataFrameCommon/b523335c-0c47-4d02-a582-f69609cde1e8.json b/common-primitives/pipelines/classification.random_forest.DataFrameCommon/b523335c-0c47-4d02-a582-f69609cde1e8.json deleted file mode 120000 index 51266fd..0000000 --- a/common-primitives/pipelines/classification.random_forest.DataFrameCommon/b523335c-0c47-4d02-a582-f69609cde1e8.json +++ /dev/null @@ -1 +0,0 @@ -../data_transformation.extract_columns_by_structural_types.Common/b523335c-0c47-4d02-a582-f69609cde1e8.json \ No newline at end of file diff --git a/common-primitives/pipelines/classification.random_forest.DataFrameCommon/ccad0f9c-130e-4063-a91e-ea65a18cb041.yaml b/common-primitives/pipelines/classification.random_forest.DataFrameCommon/ccad0f9c-130e-4063-a91e-ea65a18cb041.yaml deleted file mode 100644 index 470d2be..0000000 --- a/common-primitives/pipelines/classification.random_forest.DataFrameCommon/ccad0f9c-130e-4063-a91e-ea65a18cb041.yaml +++ /dev/null @@ -1,110 +0,0 @@ -id: ccad0f9c-130e-4063-a91e-ea65a18cb041 -schema: https://metadata.datadrivendiscovery.org/schemas/v0/pipeline.json -source: - name: Mitar -created: "2019-06-05T11:48:52.806069Z" -context: TESTING -name: Random Forest classifier pipeline -description: | - A simple pipeline which runs Random Forest classifier on tabular data. -inputs: - - name: input dataset -outputs: - - name: predictions - data: steps.5.produce -steps: - # Step 0. - - type: PRIMITIVE - primitive: - id: f31f8c1f-d1c5-43e5-a4b2-2ae4a761ef2e - version: 0.2.0 - python_path: d3m.primitives.data_transformation.denormalize.Common - name: Denormalize datasets - arguments: - inputs: - type: CONTAINER - data: inputs.0 - outputs: - - id: produce - # Step 1. - - type: PRIMITIVE - primitive: - id: 4b42ce1e-9b98-4a25-b68e-fad13311eb65 - version: 0.3.0 - python_path: d3m.primitives.data_transformation.dataset_to_dataframe.Common - name: Extract a DataFrame from a Dataset - arguments: - inputs: - type: CONTAINER - data: steps.0.produce - outputs: - - id: produce - # Step 2. - - type: PRIMITIVE - primitive: - id: d510cb7a-1782-4f51-b44c-58f0236e47c7 - version: 0.6.0 - python_path: d3m.primitives.data_transformation.column_parser.Common - name: Parses strings into their types - arguments: - inputs: - type: CONTAINER - data: steps.1.produce - outputs: - - id: produce - # Step 3. - - type: PRIMITIVE - primitive: - id: d016df89-de62-3c53-87ed-c06bb6a23cde - version: 2019.6.7 - python_path: d3m.primitives.data_cleaning.imputer.SKlearn - name: sklearn.impute.SimpleImputer - arguments: - inputs: - type: CONTAINER - data: steps.2.produce - outputs: - - id: produce - hyperparams: - use_semantic_types: - type: VALUE - data: true - return_result: - type: VALUE - data: replace - # Step 4. - - type: PRIMITIVE - primitive: - id: 37c2b19d-bdab-4a30-ba08-6be49edcc6af - version: 0.4.0 - python_path: d3m.primitives.classification.random_forest.Common - name: Random forest classifier - arguments: - inputs: - type: CONTAINER - data: steps.3.produce - outputs: - type: CONTAINER - data: steps.3.produce - outputs: - - id: produce - hyperparams: - return_result: - type: VALUE - data: replace - # Step 5. - - type: PRIMITIVE - primitive: - id: 8d38b340-f83f-4877-baaa-162f8e551736 - version: 0.3.0 - python_path: d3m.primitives.data_transformation.construct_predictions.Common - name: Construct pipeline predictions output - arguments: - inputs: - type: CONTAINER - data: steps.4.produce - reference: - type: CONTAINER - data: steps.2.produce - outputs: - - id: produce diff --git a/common-primitives/pipelines/classification.xgboost_dart.DataFrameCommon/b7a24816-2518-4073-9c45-b97f2b2fee30.json b/common-primitives/pipelines/classification.xgboost_dart.DataFrameCommon/b7a24816-2518-4073-9c45-b97f2b2fee30.json deleted file mode 100644 index b5ba302..0000000 --- a/common-primitives/pipelines/classification.xgboost_dart.DataFrameCommon/b7a24816-2518-4073-9c45-b97f2b2fee30.json +++ /dev/null @@ -1,246 +0,0 @@ -{ - "context": "TESTING", - "created": "2019-02-12T01:33:29.921236Z", - "id": "b7a24816-2518-4073-9c45-b97f2b2fee30", - "inputs": [ - { - "name": "inputs" - } - ], - "outputs": [ - { - "data": "steps.7.produce", - "name": "output predictions" - } - ], - "schema": "https://metadata.datadrivendiscovery.org/schemas/v0/pipeline.json", - "steps": [ - { - "arguments": { - "inputs": { - "data": "inputs.0", - "type": "CONTAINER" - } - }, - "outputs": [ - { - "id": "produce" - } - ], - "primitive": { - "id": "4b42ce1e-9b98-4a25-b68e-fad13311eb65", - "name": "Extract a DataFrame from a Dataset", - "python_path": "d3m.primitives.data_transformation.dataset_to_dataframe.Common", - "version": "0.3.0" - }, - "type": "PRIMITIVE" - }, - { - "arguments": { - "inputs": { - "data": "steps.0.produce", - "type": "CONTAINER" - } - }, - "hyperparams": { - "parse_semantic_types": { - "data": [ - "http://schema.org/Boolean", - "http://schema.org/Integer", - "http://schema.org/Float", - "https://metadata.datadrivendiscovery.org/types/FloatVector", - "http://schema.org/DateTime" - ], - "type": "VALUE" - } - }, - "outputs": [ - { - "id": "produce" - } - ], - "primitive": { - "id": "d510cb7a-1782-4f51-b44c-58f0236e47c7", - "name": "Parses strings into their types", - "python_path": "d3m.primitives.data_transformation.column_parser.Common", - "version": "0.6.0" - }, - "type": "PRIMITIVE" - }, - { - "arguments": { - "inputs": { - "data": "steps.1.produce", - "type": "CONTAINER" - } - }, - "hyperparams": { - "semantic_types": { - "data": [ - "https://metadata.datadrivendiscovery.org/types/CategoricalData" - ], - "type": "VALUE" - } - }, - "outputs": [ - { - "id": "produce" - } - ], - "primitive": { - "id": "4503a4c6-42f7-45a1-a1d4-ed69699cf5e1", - "name": "Extracts columns by semantic type", - "python_path": "d3m.primitives.data_transformation.extract_columns_by_semantic_types.Common", - "version": "0.3.0" - }, - "type": "PRIMITIVE" - }, - { - "arguments": { - "inputs": { - "data": "steps.1.produce", - "type": "CONTAINER" - } - }, - "hyperparams": { - "exclude_columns": { - "data": [ - 0 - ], - "type": "VALUE" - }, - "semantic_types": { - "data": [ - "http://schema.org/Integer", - "http://schema.org/Float" - ], - "type": "VALUE" - } - }, - "outputs": [ - { - "id": "produce" - } - ], - "primitive": { - "id": "4503a4c6-42f7-45a1-a1d4-ed69699cf5e1", - "name": "Extracts columns by semantic type", - "python_path": "d3m.primitives.data_transformation.extract_columns_by_semantic_types.Common", - "version": "0.3.0" - }, - "type": "PRIMITIVE" - }, - { - "arguments": { - "inputs": { - "data": "steps.0.produce", - "type": "CONTAINER" - } - }, - "hyperparams": { - "semantic_types": { - "data": [ - "https://metadata.datadrivendiscovery.org/types/TrueTarget" - ], - "type": "VALUE" - } - }, - "outputs": [ - { - "id": "produce" - } - ], - "primitive": { - "id": "4503a4c6-42f7-45a1-a1d4-ed69699cf5e1", - "name": "Extracts columns by semantic type", - "python_path": "d3m.primitives.data_transformation.extract_columns_by_semantic_types.Common", - "version": "0.3.0" - }, - "type": "PRIMITIVE" - }, - { - "arguments": { - "inputs": { - "data": "steps.3.produce", - "type": "CONTAINER" - } - }, - "hyperparams": { - "return_result": { - "data": "replace", - "type": "VALUE" - }, - "use_semantic_types": { - "data": true, - "type": "VALUE" - } - }, - "outputs": [ - { - "id": "produce" - } - ], - "primitive": { - "id": "d016df89-de62-3c53-87ed-c06bb6a23cde", - "name": "sklearn.impute.SimpleImputer", - "python_path": "d3m.primitives.data_cleaning.imputer.SKlearn", - "version": "2019.6.7" - }, - "type": "PRIMITIVE" - }, - { - "arguments": { - "inputs": { - "data": "steps.5.produce", - "type": "CONTAINER" - }, - "outputs": { - "data": "steps.4.produce", - "type": "CONTAINER" - } - }, - "hyperparams": { - "return_result": { - "data": "replace", - "type": "VALUE" - } - }, - "outputs": [ - { - "id": "produce" - } - ], - "primitive": { - "id": "7476950e-4373-4cf5-a852-7e16afb8e098", - "name": "XGBoost DART classifier", - "python_path": "d3m.primitives.classification.xgboost_dart.Common", - "version": "0.1.0" - }, - "type": "PRIMITIVE" - }, - { - "arguments": { - "inputs": { - "data": "steps.6.produce", - "type": "CONTAINER" - }, - "reference": { - "data": "steps.1.produce", - "type": "CONTAINER" - } - }, - "outputs": [ - { - "id": "produce" - } - ], - "primitive": { - "id": "8d38b340-f83f-4877-baaa-162f8e551736", - "name": "Construct pipeline predictions output", - "python_path": "d3m.primitives.data_transformation.construct_predictions.Common", - "version": "0.3.0" - }, - "type": "PRIMITIVE" - } - ] -} diff --git a/common-primitives/pipelines/classification.xgboost_gbtree.DataFrameCommon/4d402450-2562-48cc-93fd-719fb658c43c.json b/common-primitives/pipelines/classification.xgboost_gbtree.DataFrameCommon/4d402450-2562-48cc-93fd-719fb658c43c.json deleted file mode 100644 index 629964e..0000000 --- a/common-primitives/pipelines/classification.xgboost_gbtree.DataFrameCommon/4d402450-2562-48cc-93fd-719fb658c43c.json +++ /dev/null @@ -1,246 +0,0 @@ -{ - "context": "TESTING", - "created": "2019-02-12T01:18:47.753202Z", - "id": "4d402450-2562-48cc-93fd-719fb658c43c", - "inputs": [ - { - "name": "inputs" - } - ], - "outputs": [ - { - "data": "steps.7.produce", - "name": "output predictions" - } - ], - "schema": "https://metadata.datadrivendiscovery.org/schemas/v0/pipeline.json", - "steps": [ - { - "arguments": { - "inputs": { - "data": "inputs.0", - "type": "CONTAINER" - } - }, - "outputs": [ - { - "id": "produce" - } - ], - "primitive": { - "id": "4b42ce1e-9b98-4a25-b68e-fad13311eb65", - "name": "Extract a DataFrame from a Dataset", - "python_path": "d3m.primitives.data_transformation.dataset_to_dataframe.Common", - "version": "0.3.0" - }, - "type": "PRIMITIVE" - }, - { - "arguments": { - "inputs": { - "data": "steps.0.produce", - "type": "CONTAINER" - } - }, - "hyperparams": { - "parse_semantic_types": { - "data": [ - "http://schema.org/Boolean", - "http://schema.org/Integer", - "http://schema.org/Float", - "https://metadata.datadrivendiscovery.org/types/FloatVector", - "http://schema.org/DateTime" - ], - "type": "VALUE" - } - }, - "outputs": [ - { - "id": "produce" - } - ], - "primitive": { - "id": "d510cb7a-1782-4f51-b44c-58f0236e47c7", - "name": "Parses strings into their types", - "python_path": "d3m.primitives.data_transformation.column_parser.Common", - "version": "0.6.0" - }, - "type": "PRIMITIVE" - }, - { - "arguments": { - "inputs": { - "data": "steps.1.produce", - "type": "CONTAINER" - } - }, - "hyperparams": { - "semantic_types": { - "data": [ - "https://metadata.datadrivendiscovery.org/types/CategoricalData" - ], - "type": "VALUE" - } - }, - "outputs": [ - { - "id": "produce" - } - ], - "primitive": { - "id": "4503a4c6-42f7-45a1-a1d4-ed69699cf5e1", - "name": "Extracts columns by semantic type", - "python_path": "d3m.primitives.data_transformation.extract_columns_by_semantic_types.Common", - "version": "0.3.0" - }, - "type": "PRIMITIVE" - }, - { - "arguments": { - "inputs": { - "data": "steps.1.produce", - "type": "CONTAINER" - } - }, - "hyperparams": { - "exclude_columns": { - "data": [ - 0 - ], - "type": "VALUE" - }, - "semantic_types": { - "data": [ - "http://schema.org/Integer", - "http://schema.org/Float" - ], - "type": "VALUE" - } - }, - "outputs": [ - { - "id": "produce" - } - ], - "primitive": { - "id": "4503a4c6-42f7-45a1-a1d4-ed69699cf5e1", - "name": "Extracts columns by semantic type", - "python_path": "d3m.primitives.data_transformation.extract_columns_by_semantic_types.Common", - "version": "0.3.0" - }, - "type": "PRIMITIVE" - }, - { - "arguments": { - "inputs": { - "data": "steps.0.produce", - "type": "CONTAINER" - } - }, - "hyperparams": { - "semantic_types": { - "data": [ - "https://metadata.datadrivendiscovery.org/types/TrueTarget" - ], - "type": "VALUE" - } - }, - "outputs": [ - { - "id": "produce" - } - ], - "primitive": { - "id": "4503a4c6-42f7-45a1-a1d4-ed69699cf5e1", - "name": "Extracts columns by semantic type", - "python_path": "d3m.primitives.data_transformation.extract_columns_by_semantic_types.Common", - "version": "0.3.0" - }, - "type": "PRIMITIVE" - }, - { - "arguments": { - "inputs": { - "data": "steps.3.produce", - "type": "CONTAINER" - } - }, - "hyperparams": { - "return_result": { - "data": "replace", - "type": "VALUE" - }, - "use_semantic_types": { - "data": true, - "type": "VALUE" - } - }, - "outputs": [ - { - "id": "produce" - } - ], - "primitive": { - "id": "d016df89-de62-3c53-87ed-c06bb6a23cde", - "name": "sklearn.impute.SimpleImputer", - "python_path": "d3m.primitives.data_cleaning.imputer.SKlearn", - "version": "2019.6.7" - }, - "type": "PRIMITIVE" - }, - { - "arguments": { - "inputs": { - "data": "steps.5.produce", - "type": "CONTAINER" - }, - "outputs": { - "data": "steps.4.produce", - "type": "CONTAINER" - } - }, - "hyperparams": { - "return_result": { - "data": "replace", - "type": "VALUE" - } - }, - "outputs": [ - { - "id": "produce" - } - ], - "primitive": { - "id": "fe0841b7-6e70-4bc3-a56c-0670a95ebc6a", - "name": "XGBoost GBTree classifier", - "python_path": "d3m.primitives.classification.xgboost_gbtree.Common", - "version": "0.1.0" - }, - "type": "PRIMITIVE" - }, - { - "arguments": { - "inputs": { - "data": "steps.6.produce", - "type": "CONTAINER" - }, - "reference": { - "data": "steps.1.produce", - "type": "CONTAINER" - } - }, - "outputs": [ - { - "id": "produce" - } - ], - "primitive": { - "id": "8d38b340-f83f-4877-baaa-162f8e551736", - "name": "Construct pipeline predictions output", - "python_path": "d3m.primitives.data_transformation.construct_predictions.Common", - "version": "0.3.0" - }, - "type": "PRIMITIVE" - } - ] -} diff --git a/common-primitives/pipelines/data_augmentation.datamart_augmentation.Common/3afd2bd2-7ba1-4ac1-928f-fad0c39a05e5.json b/common-primitives/pipelines/data_augmentation.datamart_augmentation.Common/3afd2bd2-7ba1-4ac1-928f-fad0c39a05e5.json deleted file mode 100644 index b873182..0000000 --- a/common-primitives/pipelines/data_augmentation.datamart_augmentation.Common/3afd2bd2-7ba1-4ac1-928f-fad0c39a05e5.json +++ /dev/null @@ -1,522 +0,0 @@ -{ - "id": "3afd2bd2-7ba1-4ac1-928f-fad0c39a05e5", - "schema": "https://metadata.datadrivendiscovery.org/schemas/v0/pipeline.json", - "created": "2019-11-06T04:22:27.325146Z", - "inputs": [ - { - "name": "input dataset" - } - ], - "outputs": [ - { - "data": "steps.17.produce", - "name": "predictions of input dataset" - } - ], - "steps": [ - { - "type": "PRIMITIVE", - "primitive": { - "id": "f31f8c1f-d1c5-43e5-a4b2-2ae4a761ef2e", - "version": "0.2.0", - "python_path": "d3m.primitives.data_transformation.denormalize.Common", - "name": "Denormalize datasets", - "digest": "80ddde3709877015f7e5d262621fb4c25a2db0c7ba03c62c4fdf80cd3ede5d5b" - }, - "arguments": { - "inputs": { - "type": "CONTAINER", - "data": "inputs.0" - } - }, - "outputs": [ - { - "id": "produce" - } - ], - "hyperparams": { - "starting_resource": { - "type": "VALUE", - "data": null - }, - "recursive": { - "type": "VALUE", - "data": true - }, - "many_to_many": { - "type": "VALUE", - "data": false - }, - "discard_not_joined_tabular_resources": { - "type": "VALUE", - "data": false - } - } - }, - { - "type": "PRIMITIVE", - "primitive": { - "id": "fe0f1ac8-1d39-463a-b344-7bd498a31b91", - "version": "0.1", - "python_path": "d3m.primitives.data_augmentation.datamart_augmentation.Common", - "name": "Perform dataset augmentation using Datamart", - "digest": "5f3eda98f6a45530343707fd3e2159879d1ad4550f589a5596389c41fab83d47" - }, - "arguments": { - "inputs": { - "type": "CONTAINER", - "data": "steps.0.produce" - } - }, - "outputs": [ - { - "id": "produce" - } - ], - "hyperparams": { - "system_identifier": { - "type": "VALUE", - "data": "ISI" - }, - "search_result": { - "type": "VALUE", - "data": "{\"augmentation\": {\"left_columns\": [[19]], \"right_columns\": [[12]], \"type\": \"join\"}, \"id\": \"wikidata_search_on___P1082___P1449___P1451___P1549___P1705___P1813___P2044___P2046___P2927___P571___P6591___with_column_STABBR_wikidata\", \"materialize_info\": \"{\\\"id\\\": \\\"wikidata_search_on___P1082___P1449___P1451___P1549___P1705___P1813___P2044___P2046___P2927___P571___P6591___with_column_STABBR_wikidata\\\", \\\"score\\\": 1, \\\"metadata\\\": {\\\"connection_url\\\": \\\"http://dsbox02.isi.edu:9000\\\", \\\"search_result\\\": {\\\"p_nodes_needed\\\": [\\\"P1082\\\", \\\"P1449\\\", \\\"P1451\\\", \\\"P1549\\\", \\\"P1705\\\", \\\"P1813\\\", \\\"P2044\\\", \\\"P2046\\\", \\\"P2927\\\", \\\"P571\\\", \\\"P6591\\\"], \\\"target_q_node_column_name\\\": \\\"STABBR_wikidata\\\"}, \\\"query_json\\\": null, \\\"search_type\\\": \\\"wikidata\\\"}, \\\"augmentation\\\": {\\\"properties\\\": \\\"join\\\", \\\"left_columns\\\": [19], \\\"right_columns\\\": [12]}, \\\"datamart_type\\\": \\\"isi\\\"}\", \"metadata\": [{\"metadata\": {\"dimension\": {\"length\": 3243, \"name\": \"rows\", \"semantic_types\": [\"https://metadata.datadrivendiscovery.org/types/TabularRow\"]}, \"schema\": \"https://metadata.datadrivendiscovery.org/schemas/v0/container.json\", \"semantic_types\": [\"https://metadata.datadrivendiscovery.org/types/Table\"], \"structural_type\": \"d3m.container.pandas.DataFrame\"}, \"selector\": []}, {\"metadata\": {\"dimension\": {\"length\": 11, \"name\": \"columns\", \"semantic_types\": [\"https://metadata.datadrivendiscovery.org/types/TabularColumn\"]}}, \"selector\": [\"__ALL_ELEMENTS__\"]}, {\"metadata\": {\"P_node\": \"P1082\", \"name\": \"population_for_STABBR_wikidata\", \"semantic_types\": [true, [\"http://schema.org/Float\", \"https://metadata.datadrivendiscovery.org/types/Attribute\", \"https://metadata.datadrivendiscovery.org/types/Datamart_augmented_column\"]], \"structural_type\": \"str\"}, \"selector\": [\"__ALL_ELEMENTS__\", 0]}, {\"metadata\": {\"P_node\": \"P1449\", \"name\": \"nickname_for_STABBR_wikidata\", \"semantic_types\": [true, [\"http://schema.org/Text\", \"https://metadata.datadrivendiscovery.org/types/Attribute\", \"https://metadata.datadrivendiscovery.org/types/Datamart_augmented_column\"]], \"structural_type\": \"str\"}, \"selector\": [\"__ALL_ELEMENTS__\", 1]}, {\"metadata\": {\"P_node\": \"P1451\", \"name\": \"motto text_for_STABBR_wikidata\", \"semantic_types\": [true, [\"http://schema.org/Text\", \"https://metadata.datadrivendiscovery.org/types/Attribute\", \"https://metadata.datadrivendiscovery.org/types/Datamart_augmented_column\"]], \"structural_type\": \"str\"}, \"selector\": [\"__ALL_ELEMENTS__\", 2]}, {\"metadata\": {\"P_node\": \"P1549\", \"name\": \"demonym_for_STABBR_wikidata\", \"semantic_types\": [true, [\"http://schema.org/Text\", \"https://metadata.datadrivendiscovery.org/types/Attribute\", \"https://metadata.datadrivendiscovery.org/types/Datamart_augmented_column\"]], \"structural_type\": \"str\"}, \"selector\": [\"__ALL_ELEMENTS__\", 3]}, {\"metadata\": {\"P_node\": \"P1705\", \"name\": \"native label_for_STABBR_wikidata\", \"semantic_types\": [true, [\"http://schema.org/Text\", \"https://metadata.datadrivendiscovery.org/types/Attribute\", \"https://metadata.datadrivendiscovery.org/types/Datamart_augmented_column\"]], \"structural_type\": \"str\"}, \"selector\": [\"__ALL_ELEMENTS__\", 4]}, {\"metadata\": {\"P_node\": \"P1813\", \"name\": \"short name_for_STABBR_wikidata\", \"semantic_types\": [true, [\"http://schema.org/Text\", \"https://metadata.datadrivendiscovery.org/types/Attribute\", \"https://metadata.datadrivendiscovery.org/types/Datamart_augmented_column\"]], \"structural_type\": \"str\"}, \"selector\": [\"__ALL_ELEMENTS__\", 5]}, {\"metadata\": {\"P_node\": \"P2044\", \"name\": \"elevation above sea level_for_STABBR_wikidata\", \"semantic_types\": [true, [\"http://schema.org/Float\", \"https://metadata.datadrivendiscovery.org/types/Attribute\", \"https://metadata.datadrivendiscovery.org/types/Datamart_augmented_column\"]], \"structural_type\": \"str\"}, \"selector\": [\"__ALL_ELEMENTS__\", 6]}, {\"metadata\": {\"P_node\": \"P2046\", \"name\": \"area_for_STABBR_wikidata\", \"semantic_types\": [true, [\"http://schema.org/Float\", \"https://metadata.datadrivendiscovery.org/types/Attribute\", \"https://metadata.datadrivendiscovery.org/types/Datamart_augmented_column\"]], \"structural_type\": \"str\"}, \"selector\": [\"__ALL_ELEMENTS__\", 7]}, {\"metadata\": {\"P_node\": \"P2927\", \"name\": \"water as percent of area_for_STABBR_wikidata\", \"semantic_types\": [true, [\"http://schema.org/Float\", \"https://metadata.datadrivendiscovery.org/types/Attribute\", \"https://metadata.datadrivendiscovery.org/types/Datamart_augmented_column\"]], \"structural_type\": \"str\"}, \"selector\": [\"__ALL_ELEMENTS__\", 8]}, {\"metadata\": {\"P_node\": \"P571\", \"name\": \"inception_for_STABBR_wikidata\", \"semantic_types\": [true, [\"http://schema.org/DateTime\", \"https://metadata.datadrivendiscovery.org/types/Attribute\", \"https://metadata.datadrivendiscovery.org/types/Datamart_augmented_column\"]], \"structural_type\": \"str\"}, \"selector\": [\"__ALL_ELEMENTS__\", 9]}, {\"metadata\": {\"P_node\": \"P6591\", \"name\": \"maximum temperature record_for_STABBR_wikidata\", \"semantic_types\": [true, [\"http://schema.org/Float\", \"https://metadata.datadrivendiscovery.org/types/Attribute\", \"https://metadata.datadrivendiscovery.org/types/Datamart_augmented_column\"]], \"structural_type\": \"str\"}, \"selector\": [\"__ALL_ELEMENTS__\", 10]}, {\"metadata\": {\"name\": \"q_node\", \"semantic_types\": [\"https://metadata.datadrivendiscovery.org/types/CategoricalData\", \"https://metadata.datadrivendiscovery.org/types/Attribute\", \"http://wikidata.org/qnode\", \"https://metadata.datadrivendiscovery.org/types/Datamart_augmented_column\"], \"structural_type\": \"str\"}, \"selector\": [\"__ALL_ELEMENTS__\", 11]}, {\"metadata\": {\"name\": \"joining_pairs\", \"semantic_types\": [\"http://schema.org/Integer\", \"https://metadata.datadrivendiscovery.org/types/Attribute\", \"https://metadata.datadrivendiscovery.org/types/Datamart_augmented_column\"], \"structural_type\": \"list\"}, \"selector\": [\"__ALL_ELEMENTS__\", 12]}], \"score\": 1, \"summary\": {\"Columns\": [\"[0] population\", \"[1] nickname\", \"[2] motto text\", \"[3] demonym\", \"[4] native label\", \"[5] short name\", \"[6] elevation above sea level\", \"[7] area\", \"[8] water as percent of area\", \"[9] inception\", \"[10] maximum temperature record\"], \"Datamart ID\": \"wikidata_search_on___P1082___P1449___P1451___P1549___P1705___P1813___P2044___P2046___P2927___P571___P6591___with_column_STABBR_wikidata\", \"Recommend Join Columns\": \"STABBR_wikidata\", \"Score\": \"1\", \"URL\": \"None\", \"title\": \"wikidata search result for STABBR_wikidata\"}, \"supplied_id\": \"DA_college_debt_dataset_TRAIN\", \"supplied_resource_id\": \"learningData\"}" - } - } - }, - { - "type": "PRIMITIVE", - "primitive": { - "id": "dsbox-featurizer-do-nothing-dataset-version", - "version": "1.5.3", - "python_path": "d3m.primitives.data_preprocessing.do_nothing_for_dataset.DSBOX", - "name": "DSBox do-nothing primitive dataset version", - "digest": "c42dca1f4110288d5399d05b6dcd776a63e110d8d266521622581b500b08cee2" - }, - "arguments": { - "inputs": { - "type": "CONTAINER", - "data": "steps.1.produce" - } - }, - "outputs": [ - { - "id": "produce" - } - ] - }, - { - "type": "PRIMITIVE", - "primitive": { - "id": "fe0f1ac8-1d39-463a-b344-7bd498a31b91", - "version": "0.1", - "python_path": "d3m.primitives.data_augmentation.datamart_augmentation.Common", - "name": "Perform dataset augmentation using Datamart", - "digest": "5f3eda98f6a45530343707fd3e2159879d1ad4550f589a5596389c41fab83d47" - }, - "arguments": { - "inputs": { - "type": "CONTAINER", - "data": "steps.2.produce" - } - }, - "outputs": [ - { - "id": "produce" - } - ], - "hyperparams": { - "system_identifier": { - "type": "VALUE", - "data": "ISI" - }, - "search_result": { - "type": "VALUE", - "data": "{\"augmentation\": {\"left_columns\": [[2]], \"right_columns\": [[3]], \"type\": \"join\"}, \"id\": \"D4cb70062-77ed-4097-a486-0b43ffe81463\", \"materialize_info\": \"{\\\"id\\\": \\\"D4cb70062-77ed-4097-a486-0b43ffe81463\\\", \\\"score\\\": 0.9398390424662831, \\\"metadata\\\": {\\\"connection_url\\\": \\\"http://dsbox02.isi.edu:9000\\\", \\\"search_result\\\": {\\\"variable\\\": {\\\"type\\\": \\\"uri\\\", \\\"value\\\": \\\"http://www.wikidata.org/entity/statement/D4cb70062-77ed-4097-a486-0b43ffe81463-db0080de-12d9-4189-b13a-2a46fa63a227\\\"}, \\\"dataset\\\": {\\\"type\\\": \\\"uri\\\", \\\"value\\\": \\\"http://www.wikidata.org/entity/D4cb70062-77ed-4097-a486-0b43ffe81463\\\"}, \\\"url\\\": {\\\"type\\\": \\\"uri\\\", \\\"value\\\": \\\"http://dsbox02.isi.edu:9000/upload/local_datasets/Most-Recent-Cohorts-Scorecard-Elements.csv\\\"}, \\\"file_type\\\": {\\\"datatype\\\": \\\"http://www.w3.org/2001/XMLSchema#string\\\", \\\"type\\\": \\\"literal\\\", \\\"value\\\": \\\"csv\\\"}, \\\"extra_information\\\": {\\\"datatype\\\": \\\"http://www.w3.org/2001/XMLSchema#string\\\", \\\"type\\\": \\\"literal\\\", \\\"value\\\": \\\"{\\\\\\\"column_meta_0\\\\\\\": {\\\\\\\"semantic_type\\\\\\\": [\\\\\\\"http://schema.org/Integer\\\\\\\", \\\\\\\"https://metadata.datadrivendiscovery.org/types/Attribute\\\\\\\"], \\\\\\\"name\\\\\\\": \\\\\\\"UNITID\\\\\\\"}, \\\\\\\"column_meta_1\\\\\\\": {\\\\\\\"semantic_type\\\\\\\": [\\\\\\\"http://schema.org/Integer\\\\\\\", \\\\\\\"https://metadata.datadrivendiscovery.org/types/Attribute\\\\\\\"], \\\\\\\"name\\\\\\\": \\\\\\\"OPEID\\\\\\\"}, \\\\\\\"column_meta_2\\\\\\\": {\\\\\\\"semantic_type\\\\\\\": [\\\\\\\"http://schema.org/Integer\\\\\\\", \\\\\\\"https://metadata.datadrivendiscovery.org/types/Attribute\\\\\\\"], \\\\\\\"name\\\\\\\": \\\\\\\"OPEID6\\\\\\\"}, \\\\\\\"column_meta_3\\\\\\\": {\\\\\\\"semantic_type\\\\\\\": [\\\\\\\"http://schema.org/Text\\\\\\\", \\\\\\\"https://metadata.datadrivendiscovery.org/types/Attribute\\\\\\\"], \\\\\\\"name\\\\\\\": \\\\\\\"INSTNM\\\\\\\"}, \\\\\\\"column_meta_4\\\\\\\": {\\\\\\\"semantic_type\\\\\\\": [\\\\\\\"http://schema.org/Text\\\\\\\", \\\\\\\"https://metadata.datadrivendiscovery.org/types/Attribute\\\\\\\"], \\\\\\\"name\\\\\\\": \\\\\\\"CITY\\\\\\\"}, \\\\\\\"column_meta_5\\\\\\\": {\\\\\\\"semantic_type\\\\\\\": [\\\\\\\"http://schema.org/Text\\\\\\\", \\\\\\\"https://metadata.datadrivendiscovery.org/types/Attribute\\\\\\\"], \\\\\\\"name\\\\\\\": \\\\\\\"STABBR\\\\\\\"}, \\\\\\\"column_meta_6\\\\\\\": {\\\\\\\"semantic_type\\\\\\\": [\\\\\\\"http://schema.org/Text\\\\\\\", \\\\\\\"https://metadata.datadrivendiscovery.org/types/Attribute\\\\\\\"], \\\\\\\"name\\\\\\\": \\\\\\\"INSTURL\\\\\\\"}, \\\\\\\"column_meta_7\\\\\\\": {\\\\\\\"semantic_type\\\\\\\": [\\\\\\\"http://schema.org/Text\\\\\\\", \\\\\\\"https://metadata.datadrivendiscovery.org/types/Attribute\\\\\\\"], \\\\\\\"name\\\\\\\": \\\\\\\"NPCURL\\\\\\\"}, \\\\\\\"column_meta_8\\\\\\\": {\\\\\\\"semantic_type\\\\\\\": [\\\\\\\"http://schema.org/Integer\\\\\\\", \\\\\\\"https://metadata.datadrivendiscovery.org/types/Attribute\\\\\\\"], \\\\\\\"name\\\\\\\": \\\\\\\"HCM2\\\\\\\"}, \\\\\\\"column_meta_9\\\\\\\": {\\\\\\\"semantic_type\\\\\\\": [\\\\\\\"http://schema.org/Integer\\\\\\\", \\\\\\\"https://metadata.datadrivendiscovery.org/types/Attribute\\\\\\\"], \\\\\\\"name\\\\\\\": \\\\\\\"PREDDEG\\\\\\\"}, \\\\\\\"column_meta_10\\\\\\\": {\\\\\\\"semantic_type\\\\\\\": [\\\\\\\"http://schema.org/Integer\\\\\\\", \\\\\\\"https://metadata.datadrivendiscovery.org/types/Attribute\\\\\\\"], \\\\\\\"name\\\\\\\": \\\\\\\"HIGHDEG\\\\\\\"}, \\\\\\\"column_meta_11\\\\\\\": {\\\\\\\"semantic_type\\\\\\\": [\\\\\\\"http://schema.org/Integer\\\\\\\", \\\\\\\"https://metadata.datadrivendiscovery.org/types/Attribute\\\\\\\"], \\\\\\\"name\\\\\\\": \\\\\\\"CONTROL\\\\\\\"}, \\\\\\\"column_meta_12\\\\\\\": {\\\\\\\"semantic_type\\\\\\\": [\\\\\\\"http://schema.org/Integer\\\\\\\", \\\\\\\"https://metadata.datadrivendiscovery.org/types/Attribute\\\\\\\"], \\\\\\\"name\\\\\\\": \\\\\\\"LOCALE\\\\\\\"}, \\\\\\\"column_meta_13\\\\\\\": {\\\\\\\"semantic_type\\\\\\\": [\\\\\\\"http://schema.org/Integer\\\\\\\", \\\\\\\"https://metadata.datadrivendiscovery.org/types/Attribute\\\\\\\"], \\\\\\\"name\\\\\\\": \\\\\\\"HBCU\\\\\\\"}, \\\\\\\"column_meta_14\\\\\\\": {\\\\\\\"semantic_type\\\\\\\": [\\\\\\\"http://schema.org/Integer\\\\\\\", \\\\\\\"https://metadata.datadrivendiscovery.org/types/Attribute\\\\\\\"], \\\\\\\"name\\\\\\\": \\\\\\\"PBI\\\\\\\"}, \\\\\\\"column_meta_15\\\\\\\": {\\\\\\\"semantic_type\\\\\\\": [\\\\\\\"http://schema.org/Integer\\\\\\\", \\\\\\\"https://metadata.datadrivendiscovery.org/types/Attribute\\\\\\\"], \\\\\\\"name\\\\\\\": \\\\\\\"ANNHI\\\\\\\"}, \\\\\\\"column_meta_16\\\\\\\": {\\\\\\\"semantic_type\\\\\\\": [\\\\\\\"http://schema.org/Integer\\\\\\\", \\\\\\\"https://metadata.datadrivendiscovery.org/types/Attribute\\\\\\\"], \\\\\\\"name\\\\\\\": \\\\\\\"TRIBAL\\\\\\\"}, \\\\\\\"column_meta_17\\\\\\\": {\\\\\\\"semantic_type\\\\\\\": [\\\\\\\"http://schema.org/Integer\\\\\\\", \\\\\\\"https://metadata.datadrivendiscovery.org/types/Attribute\\\\\\\"], \\\\\\\"name\\\\\\\": \\\\\\\"AANAPII\\\\\\\"}, \\\\\\\"column_meta_18\\\\\\\": {\\\\\\\"semantic_type\\\\\\\": [\\\\\\\"http://schema.org/Integer\\\\\\\", \\\\\\\"https://metadata.datadrivendiscovery.org/types/Attribute\\\\\\\"], \\\\\\\"name\\\\\\\": \\\\\\\"HSI\\\\\\\"}, \\\\\\\"column_meta_19\\\\\\\": {\\\\\\\"semantic_type\\\\\\\": [\\\\\\\"http://schema.org/Integer\\\\\\\", \\\\\\\"https://metadata.datadrivendiscovery.org/types/Attribute\\\\\\\"], \\\\\\\"name\\\\\\\": \\\\\\\"NANTI\\\\\\\"}, \\\\\\\"column_meta_20\\\\\\\": {\\\\\\\"semantic_type\\\\\\\": [\\\\\\\"http://schema.org/Integer\\\\\\\", \\\\\\\"https://metadata.datadrivendiscovery.org/types/Attribute\\\\\\\"], \\\\\\\"name\\\\\\\": \\\\\\\"MENONLY\\\\\\\"}, \\\\\\\"column_meta_21\\\\\\\": {\\\\\\\"semantic_type\\\\\\\": [\\\\\\\"http://schema.org/Integer\\\\\\\", \\\\\\\"https://metadata.datadrivendiscovery.org/types/Attribute\\\\\\\"], \\\\\\\"name\\\\\\\": \\\\\\\"WOMENONLY\\\\\\\"}, \\\\\\\"column_meta_22\\\\\\\": {\\\\\\\"semantic_type\\\\\\\": [\\\\\\\"http://schema.org/Float\\\\\\\", \\\\\\\"https://metadata.datadrivendiscovery.org/types/Attribute\\\\\\\"], \\\\\\\"name\\\\\\\": \\\\\\\"RELAFFIL\\\\\\\"}, \\\\\\\"column_meta_23\\\\\\\": {\\\\\\\"semantic_type\\\\\\\": [\\\\\\\"http://schema.org/Float\\\\\\\", \\\\\\\"https://metadata.datadrivendiscovery.org/types/Attribute\\\\\\\"], \\\\\\\"name\\\\\\\": \\\\\\\"SATVR25\\\\\\\"}, \\\\\\\"column_meta_24\\\\\\\": {\\\\\\\"semantic_type\\\\\\\": [\\\\\\\"http://schema.org/Float\\\\\\\", \\\\\\\"https://metadata.datadrivendiscovery.org/types/Attribute\\\\\\\"], \\\\\\\"name\\\\\\\": \\\\\\\"SATVR75\\\\\\\"}, \\\\\\\"column_meta_25\\\\\\\": {\\\\\\\"semantic_type\\\\\\\": [\\\\\\\"http://schema.org/Float\\\\\\\", \\\\\\\"https://metadata.datadrivendiscovery.org/types/Attribute\\\\\\\"], \\\\\\\"name\\\\\\\": \\\\\\\"SATMT25\\\\\\\"}, \\\\\\\"column_meta_26\\\\\\\": {\\\\\\\"semantic_type\\\\\\\": [\\\\\\\"http://schema.org/Float\\\\\\\", \\\\\\\"https://metadata.datadrivendiscovery.org/types/Attribute\\\\\\\"], \\\\\\\"name\\\\\\\": \\\\\\\"SATMT75\\\\\\\"}, \\\\\\\"column_meta_27\\\\\\\": {\\\\\\\"semantic_type\\\\\\\": [\\\\\\\"http://schema.org/Float\\\\\\\", \\\\\\\"https://metadata.datadrivendiscovery.org/types/Attribute\\\\\\\"], \\\\\\\"name\\\\\\\": \\\\\\\"SATWR25\\\\\\\"}, \\\\\\\"column_meta_28\\\\\\\": {\\\\\\\"semantic_type\\\\\\\": [\\\\\\\"http://schema.org/Float\\\\\\\", \\\\\\\"https://metadata.datadrivendiscovery.org/types/Attribute\\\\\\\"], \\\\\\\"name\\\\\\\": \\\\\\\"SATWR75\\\\\\\"}, \\\\\\\"column_meta_29\\\\\\\": {\\\\\\\"semantic_type\\\\\\\": [\\\\\\\"http://schema.org/Float\\\\\\\", \\\\\\\"https://metadata.datadrivendiscovery.org/types/Attribute\\\\\\\"], \\\\\\\"name\\\\\\\": \\\\\\\"SATVRMID\\\\\\\"}, \\\\\\\"column_meta_30\\\\\\\": {\\\\\\\"semantic_type\\\\\\\": [\\\\\\\"http://schema.org/Float\\\\\\\", \\\\\\\"https://metadata.datadrivendiscovery.org/types/Attribute\\\\\\\"], \\\\\\\"name\\\\\\\": \\\\\\\"SATMTMID\\\\\\\"}, \\\\\\\"column_meta_31\\\\\\\": {\\\\\\\"semantic_type\\\\\\\": [\\\\\\\"http://schema.org/Float\\\\\\\", \\\\\\\"https://metadata.datadrivendiscovery.org/types/Attribute\\\\\\\"], \\\\\\\"name\\\\\\\": \\\\\\\"SATWRMID\\\\\\\"}, \\\\\\\"column_meta_32\\\\\\\": {\\\\\\\"semantic_type\\\\\\\": [\\\\\\\"http://schema.org/Float\\\\\\\", \\\\\\\"https://metadata.datadrivendiscovery.org/types/Attribute\\\\\\\"], \\\\\\\"name\\\\\\\": \\\\\\\"ACTCM25\\\\\\\"}, \\\\\\\"column_meta_33\\\\\\\": {\\\\\\\"semantic_type\\\\\\\": [\\\\\\\"http://schema.org/Float\\\\\\\", \\\\\\\"https://metadata.datadrivendiscovery.org/types/Attribute\\\\\\\"], \\\\\\\"name\\\\\\\": \\\\\\\"ACTCM75\\\\\\\"}, \\\\\\\"column_meta_34\\\\\\\": {\\\\\\\"semantic_type\\\\\\\": [\\\\\\\"http://schema.org/Float\\\\\\\", \\\\\\\"https://metadata.datadrivendiscovery.org/types/Attribute\\\\\\\"], \\\\\\\"name\\\\\\\": \\\\\\\"ACTEN25\\\\\\\"}, \\\\\\\"column_meta_35\\\\\\\": {\\\\\\\"semantic_type\\\\\\\": [\\\\\\\"http://schema.org/Float\\\\\\\", \\\\\\\"https://metadata.datadrivendiscovery.org/types/Attribute\\\\\\\"], \\\\\\\"name\\\\\\\": \\\\\\\"ACTEN75\\\\\\\"}, \\\\\\\"column_meta_36\\\\\\\": {\\\\\\\"semantic_type\\\\\\\": [\\\\\\\"http://schema.org/Float\\\\\\\", \\\\\\\"https://metadata.datadrivendiscovery.org/types/Attribute\\\\\\\"], \\\\\\\"name\\\\\\\": \\\\\\\"ACTMT25\\\\\\\"}, \\\\\\\"column_meta_37\\\\\\\": {\\\\\\\"semantic_type\\\\\\\": [\\\\\\\"http://schema.org/Float\\\\\\\", \\\\\\\"https://metadata.datadrivendiscovery.org/types/Attribute\\\\\\\"], \\\\\\\"name\\\\\\\": \\\\\\\"ACTMT75\\\\\\\"}, \\\\\\\"column_meta_38\\\\\\\": {\\\\\\\"semantic_type\\\\\\\": [\\\\\\\"https://metadata.datadrivendiscovery.org/types/CategoricalData\\\\\\\", \\\\\\\"https://metadata.datadrivendiscovery.org/types/Attribute\\\\\\\"], \\\\\\\"name\\\\\\\": \\\\\\\"ACTWR25\\\\\\\"}, \\\\\\\"column_meta_39\\\\\\\": {\\\\\\\"semantic_type\\\\\\\": [\\\\\\\"https://metadata.datadrivendiscovery.org/types/CategoricalData\\\\\\\", \\\\\\\"https://metadata.datadrivendiscovery.org/types/Attribute\\\\\\\"], \\\\\\\"name\\\\\\\": \\\\\\\"ACTWR75\\\\\\\"}, \\\\\\\"column_meta_40\\\\\\\": {\\\\\\\"semantic_type\\\\\\\": [\\\\\\\"http://schema.org/Float\\\\\\\", \\\\\\\"https://metadata.datadrivendiscovery.org/types/Attribute\\\\\\\"], \\\\\\\"name\\\\\\\": \\\\\\\"ACTCMMID\\\\\\\"}, \\\\\\\"column_meta_41\\\\\\\": {\\\\\\\"semantic_type\\\\\\\": [\\\\\\\"http://schema.org/Float\\\\\\\", \\\\\\\"https://metadata.datadrivendiscovery.org/types/Attribute\\\\\\\"], \\\\\\\"name\\\\\\\": \\\\\\\"ACTENMID\\\\\\\"}, \\\\\\\"column_meta_42\\\\\\\": {\\\\\\\"semantic_type\\\\\\\": [\\\\\\\"http://schema.org/Float\\\\\\\", \\\\\\\"https://metadata.datadrivendiscovery.org/types/Attribute\\\\\\\"], \\\\\\\"name\\\\\\\": \\\\\\\"ACTMTMID\\\\\\\"}, \\\\\\\"column_meta_43\\\\\\\": {\\\\\\\"semantic_type\\\\\\\": [\\\\\\\"https://metadata.datadrivendiscovery.org/types/CategoricalData\\\\\\\", \\\\\\\"https://metadata.datadrivendiscovery.org/types/Attribute\\\\\\\"], \\\\\\\"name\\\\\\\": \\\\\\\"ACTWRMID\\\\\\\"}, \\\\\\\"column_meta_44\\\\\\\": {\\\\\\\"semantic_type\\\\\\\": [\\\\\\\"http://schema.org/Float\\\\\\\", \\\\\\\"https://metadata.datadrivendiscovery.org/types/Attribute\\\\\\\"], \\\\\\\"name\\\\\\\": \\\\\\\"SAT_AVG\\\\\\\"}, \\\\\\\"column_meta_45\\\\\\\": {\\\\\\\"semantic_type\\\\\\\": [\\\\\\\"http://schema.org/Float\\\\\\\", \\\\\\\"https://metadata.datadrivendiscovery.org/types/Attribute\\\\\\\"], \\\\\\\"name\\\\\\\": \\\\\\\"SAT_AVG_ALL\\\\\\\"}, \\\\\\\"column_meta_46\\\\\\\": {\\\\\\\"semantic_type\\\\\\\": [\\\\\\\"http://schema.org/Float\\\\\\\", \\\\\\\"https://metadata.datadrivendiscovery.org/types/Attribute\\\\\\\"], \\\\\\\"name\\\\\\\": \\\\\\\"PCIP01\\\\\\\"}, \\\\\\\"column_meta_47\\\\\\\": {\\\\\\\"semantic_type\\\\\\\": [\\\\\\\"http://schema.org/Float\\\\\\\", \\\\\\\"https://metadata.datadrivendiscovery.org/types/Attribute\\\\\\\"], \\\\\\\"name\\\\\\\": \\\\\\\"PCIP03\\\\\\\"}, \\\\\\\"column_meta_48\\\\\\\": {\\\\\\\"semantic_type\\\\\\\": [\\\\\\\"http://schema.org/Float\\\\\\\", \\\\\\\"https://metadata.datadrivendiscovery.org/types/Attribute\\\\\\\"], \\\\\\\"name\\\\\\\": \\\\\\\"PCIP04\\\\\\\"}, \\\\\\\"column_meta_49\\\\\\\": {\\\\\\\"semantic_type\\\\\\\": [\\\\\\\"http://schema.org/Float\\\\\\\", \\\\\\\"https://metadata.datadrivendiscovery.org/types/Attribute\\\\\\\"], \\\\\\\"name\\\\\\\": \\\\\\\"PCIP05\\\\\\\"}, \\\\\\\"column_meta_50\\\\\\\": {\\\\\\\"semantic_type\\\\\\\": [\\\\\\\"http://schema.org/Float\\\\\\\", \\\\\\\"https://metadata.datadrivendiscovery.org/types/Attribute\\\\\\\"], \\\\\\\"name\\\\\\\": \\\\\\\"PCIP09\\\\\\\"}, \\\\\\\"column_meta_51\\\\\\\": {\\\\\\\"semantic_type\\\\\\\": [\\\\\\\"http://schema.org/Float\\\\\\\", \\\\\\\"https://metadata.datadrivendiscovery.org/types/Attribute\\\\\\\"], \\\\\\\"name\\\\\\\": \\\\\\\"PCIP10\\\\\\\"}, \\\\\\\"column_meta_52\\\\\\\": {\\\\\\\"semantic_type\\\\\\\": [\\\\\\\"http://schema.org/Float\\\\\\\", \\\\\\\"https://metadata.datadrivendiscovery.org/types/Attribute\\\\\\\"], \\\\\\\"name\\\\\\\": \\\\\\\"PCIP11\\\\\\\"}, \\\\\\\"column_meta_53\\\\\\\": {\\\\\\\"semantic_type\\\\\\\": [\\\\\\\"http://schema.org/Float\\\\\\\", \\\\\\\"https://metadata.datadrivendiscovery.org/types/Attribute\\\\\\\"], \\\\\\\"name\\\\\\\": \\\\\\\"PCIP12\\\\\\\"}, \\\\\\\"column_meta_54\\\\\\\": {\\\\\\\"semantic_type\\\\\\\": [\\\\\\\"http://schema.org/Float\\\\\\\", \\\\\\\"https://metadata.datadrivendiscovery.org/types/Attribute\\\\\\\"], \\\\\\\"name\\\\\\\": \\\\\\\"PCIP13\\\\\\\"}, \\\\\\\"column_meta_55\\\\\\\": {\\\\\\\"semantic_type\\\\\\\": [\\\\\\\"http://schema.org/Float\\\\\\\", \\\\\\\"https://metadata.datadrivendiscovery.org/types/Attribute\\\\\\\"], \\\\\\\"name\\\\\\\": \\\\\\\"PCIP14\\\\\\\"}, \\\\\\\"column_meta_56\\\\\\\": {\\\\\\\"semantic_type\\\\\\\": [\\\\\\\"http://schema.org/Float\\\\\\\", \\\\\\\"https://metadata.datadrivendiscovery.org/types/Attribute\\\\\\\"], \\\\\\\"name\\\\\\\": \\\\\\\"PCIP15\\\\\\\"}, \\\\\\\"column_meta_57\\\\\\\": {\\\\\\\"semantic_type\\\\\\\": [\\\\\\\"http://schema.org/Float\\\\\\\", \\\\\\\"https://metadata.datadrivendiscovery.org/types/Attribute\\\\\\\"], \\\\\\\"name\\\\\\\": \\\\\\\"PCIP16\\\\\\\"}, \\\\\\\"column_meta_58\\\\\\\": {\\\\\\\"semantic_type\\\\\\\": [\\\\\\\"http://schema.org/Float\\\\\\\", \\\\\\\"https://metadata.datadrivendiscovery.org/types/Attribute\\\\\\\"], \\\\\\\"name\\\\\\\": \\\\\\\"PCIP19\\\\\\\"}, \\\\\\\"column_meta_59\\\\\\\": {\\\\\\\"semantic_type\\\\\\\": [\\\\\\\"http://schema.org/Float\\\\\\\", \\\\\\\"https://metadata.datadrivendiscovery.org/types/Attribute\\\\\\\"], \\\\\\\"name\\\\\\\": \\\\\\\"PCIP22\\\\\\\"}, \\\\\\\"column_meta_60\\\\\\\": {\\\\\\\"semantic_type\\\\\\\": [\\\\\\\"http://schema.org/Float\\\\\\\", \\\\\\\"https://metadata.datadrivendiscovery.org/types/Attribute\\\\\\\"], \\\\\\\"name\\\\\\\": \\\\\\\"PCIP23\\\\\\\"}, \\\\\\\"column_meta_61\\\\\\\": {\\\\\\\"semantic_type\\\\\\\": [\\\\\\\"http://schema.org/Float\\\\\\\", \\\\\\\"https://metadata.datadrivendiscovery.org/types/Attribute\\\\\\\"], \\\\\\\"name\\\\\\\": \\\\\\\"PCIP24\\\\\\\"}, \\\\\\\"column_meta_62\\\\\\\": {\\\\\\\"semantic_type\\\\\\\": [\\\\\\\"http://schema.org/Float\\\\\\\", \\\\\\\"https://metadata.datadrivendiscovery.org/types/Attribute\\\\\\\"], \\\\\\\"name\\\\\\\": \\\\\\\"PCIP25\\\\\\\"}, \\\\\\\"column_meta_63\\\\\\\": {\\\\\\\"semantic_type\\\\\\\": [\\\\\\\"http://schema.org/Float\\\\\\\", \\\\\\\"https://metadata.datadrivendiscovery.org/types/Attribute\\\\\\\"], \\\\\\\"name\\\\\\\": \\\\\\\"PCIP26\\\\\\\"}, \\\\\\\"column_meta_64\\\\\\\": {\\\\\\\"semantic_type\\\\\\\": [\\\\\\\"http://schema.org/Float\\\\\\\", \\\\\\\"https://metadata.datadrivendiscovery.org/types/Attribute\\\\\\\"], \\\\\\\"name\\\\\\\": \\\\\\\"PCIP27\\\\\\\"}, \\\\\\\"column_meta_65\\\\\\\": {\\\\\\\"semantic_type\\\\\\\": [\\\\\\\"http://schema.org/Float\\\\\\\", \\\\\\\"https://metadata.datadrivendiscovery.org/types/Attribute\\\\\\\"], \\\\\\\"name\\\\\\\": \\\\\\\"PCIP29\\\\\\\"}, \\\\\\\"column_meta_66\\\\\\\": {\\\\\\\"semantic_type\\\\\\\": [\\\\\\\"http://schema.org/Float\\\\\\\", \\\\\\\"https://metadata.datadrivendiscovery.org/types/Attribute\\\\\\\"], \\\\\\\"name\\\\\\\": \\\\\\\"PCIP30\\\\\\\"}, \\\\\\\"column_meta_67\\\\\\\": {\\\\\\\"semantic_type\\\\\\\": [\\\\\\\"http://schema.org/Float\\\\\\\", \\\\\\\"https://metadata.datadrivendiscovery.org/types/Attribute\\\\\\\"], \\\\\\\"name\\\\\\\": \\\\\\\"PCIP31\\\\\\\"}, \\\\\\\"column_meta_68\\\\\\\": {\\\\\\\"semantic_type\\\\\\\": [\\\\\\\"http://schema.org/Float\\\\\\\", \\\\\\\"https://metadata.datadrivendiscovery.org/types/Attribute\\\\\\\"], \\\\\\\"name\\\\\\\": \\\\\\\"PCIP38\\\\\\\"}, \\\\\\\"column_meta_69\\\\\\\": {\\\\\\\"semantic_type\\\\\\\": [\\\\\\\"http://schema.org/Float\\\\\\\", \\\\\\\"https://metadata.datadrivendiscovery.org/types/Attribute\\\\\\\"], \\\\\\\"name\\\\\\\": \\\\\\\"PCIP39\\\\\\\"}, \\\\\\\"column_meta_70\\\\\\\": {\\\\\\\"semantic_type\\\\\\\": [\\\\\\\"http://schema.org/Float\\\\\\\", \\\\\\\"https://metadata.datadrivendiscovery.org/types/Attribute\\\\\\\"], \\\\\\\"name\\\\\\\": \\\\\\\"PCIP40\\\\\\\"}, \\\\\\\"column_meta_71\\\\\\\": {\\\\\\\"semantic_type\\\\\\\": [\\\\\\\"http://schema.org/Float\\\\\\\", \\\\\\\"https://metadata.datadrivendiscovery.org/types/Attribute\\\\\\\"], \\\\\\\"name\\\\\\\": \\\\\\\"PCIP41\\\\\\\"}, \\\\\\\"column_meta_72\\\\\\\": {\\\\\\\"semantic_type\\\\\\\": [\\\\\\\"http://schema.org/Float\\\\\\\", \\\\\\\"https://metadata.datadrivendiscovery.org/types/Attribute\\\\\\\"], \\\\\\\"name\\\\\\\": \\\\\\\"PCIP42\\\\\\\"}, \\\\\\\"column_meta_73\\\\\\\": {\\\\\\\"semantic_type\\\\\\\": [\\\\\\\"http://schema.org/Float\\\\\\\", \\\\\\\"https://metadata.datadrivendiscovery.org/types/Attribute\\\\\\\"], \\\\\\\"name\\\\\\\": \\\\\\\"PCIP43\\\\\\\"}, \\\\\\\"column_meta_74\\\\\\\": {\\\\\\\"semantic_type\\\\\\\": [\\\\\\\"http://schema.org/Float\\\\\\\", \\\\\\\"https://metadata.datadrivendiscovery.org/types/Attribute\\\\\\\"], \\\\\\\"name\\\\\\\": \\\\\\\"PCIP44\\\\\\\"}, \\\\\\\"column_meta_75\\\\\\\": {\\\\\\\"semantic_type\\\\\\\": [\\\\\\\"http://schema.org/Float\\\\\\\", \\\\\\\"https://metadata.datadrivendiscovery.org/types/Attribute\\\\\\\"], \\\\\\\"name\\\\\\\": \\\\\\\"PCIP45\\\\\\\"}, \\\\\\\"column_meta_76\\\\\\\": {\\\\\\\"semantic_type\\\\\\\": [\\\\\\\"http://schema.org/Float\\\\\\\", \\\\\\\"https://metadata.datadrivendiscovery.org/types/Attribute\\\\\\\"], \\\\\\\"name\\\\\\\": \\\\\\\"PCIP46\\\\\\\"}, \\\\\\\"column_meta_77\\\\\\\": {\\\\\\\"semantic_type\\\\\\\": [\\\\\\\"http://schema.org/Float\\\\\\\", \\\\\\\"https://metadata.datadrivendiscovery.org/types/Attribute\\\\\\\"], \\\\\\\"name\\\\\\\": \\\\\\\"PCIP47\\\\\\\"}, \\\\\\\"column_meta_78\\\\\\\": {\\\\\\\"semantic_type\\\\\\\": [\\\\\\\"http://schema.org/Float\\\\\\\", \\\\\\\"https://metadata.datadrivendiscovery.org/types/Attribute\\\\\\\"], \\\\\\\"name\\\\\\\": \\\\\\\"PCIP48\\\\\\\"}, \\\\\\\"column_meta_79\\\\\\\": {\\\\\\\"semantic_type\\\\\\\": [\\\\\\\"http://schema.org/Float\\\\\\\", \\\\\\\"https://metadata.datadrivendiscovery.org/types/Attribute\\\\\\\"], \\\\\\\"name\\\\\\\": \\\\\\\"PCIP49\\\\\\\"}, \\\\\\\"column_meta_80\\\\\\\": {\\\\\\\"semantic_type\\\\\\\": [\\\\\\\"http://schema.org/Float\\\\\\\", \\\\\\\"https://metadata.datadrivendiscovery.org/types/Attribute\\\\\\\"], \\\\\\\"name\\\\\\\": \\\\\\\"PCIP50\\\\\\\"}, \\\\\\\"column_meta_81\\\\\\\": {\\\\\\\"semantic_type\\\\\\\": [\\\\\\\"http://schema.org/Float\\\\\\\", \\\\\\\"https://metadata.datadrivendiscovery.org/types/Attribute\\\\\\\"], \\\\\\\"name\\\\\\\": \\\\\\\"PCIP51\\\\\\\"}, \\\\\\\"column_meta_82\\\\\\\": {\\\\\\\"semantic_type\\\\\\\": [\\\\\\\"http://schema.org/Float\\\\\\\", \\\\\\\"https://metadata.datadrivendiscovery.org/types/Attribute\\\\\\\"], \\\\\\\"name\\\\\\\": \\\\\\\"PCIP52\\\\\\\"}, \\\\\\\"column_meta_83\\\\\\\": {\\\\\\\"semantic_type\\\\\\\": [\\\\\\\"http://schema.org/Float\\\\\\\", \\\\\\\"https://metadata.datadrivendiscovery.org/types/Attribute\\\\\\\"], \\\\\\\"name\\\\\\\": \\\\\\\"PCIP54\\\\\\\"}, \\\\\\\"column_meta_84\\\\\\\": {\\\\\\\"semantic_type\\\\\\\": [\\\\\\\"http://schema.org/Integer\\\\\\\", \\\\\\\"https://metadata.datadrivendiscovery.org/types/Attribute\\\\\\\"], \\\\\\\"name\\\\\\\": \\\\\\\"DISTANCEONLY\\\\\\\"}, \\\\\\\"column_meta_85\\\\\\\": {\\\\\\\"semantic_type\\\\\\\": [\\\\\\\"http://schema.org/Float\\\\\\\", \\\\\\\"https://metadata.datadrivendiscovery.org/types/Attribute\\\\\\\"], \\\\\\\"name\\\\\\\": \\\\\\\"UGDS\\\\\\\"}, \\\\\\\"column_meta_86\\\\\\\": {\\\\\\\"semantic_type\\\\\\\": [\\\\\\\"http://schema.org/Float\\\\\\\", \\\\\\\"https://metadata.datadrivendiscovery.org/types/Attribute\\\\\\\"], \\\\\\\"name\\\\\\\": \\\\\\\"UGDS_WHITE\\\\\\\"}, \\\\\\\"column_meta_87\\\\\\\": {\\\\\\\"semantic_type\\\\\\\": [\\\\\\\"http://schema.org/Float\\\\\\\", \\\\\\\"https://metadata.datadrivendiscovery.org/types/Attribute\\\\\\\"], \\\\\\\"name\\\\\\\": \\\\\\\"UGDS_BLACK\\\\\\\"}, \\\\\\\"column_meta_88\\\\\\\": {\\\\\\\"semantic_type\\\\\\\": [\\\\\\\"http://schema.org/Float\\\\\\\", \\\\\\\"https://metadata.datadrivendiscovery.org/types/Attribute\\\\\\\"], \\\\\\\"name\\\\\\\": \\\\\\\"UGDS_HISP\\\\\\\"}, \\\\\\\"column_meta_89\\\\\\\": {\\\\\\\"semantic_type\\\\\\\": [\\\\\\\"http://schema.org/Float\\\\\\\", \\\\\\\"https://metadata.datadrivendiscovery.org/types/Attribute\\\\\\\"], \\\\\\\"name\\\\\\\": \\\\\\\"UGDS_ASIAN\\\\\\\"}, \\\\\\\"column_meta_90\\\\\\\": {\\\\\\\"semantic_type\\\\\\\": [\\\\\\\"http://schema.org/Float\\\\\\\", \\\\\\\"https://metadata.datadrivendiscovery.org/types/Attribute\\\\\\\"], \\\\\\\"name\\\\\\\": \\\\\\\"UGDS_AIAN\\\\\\\"}, \\\\\\\"column_meta_91\\\\\\\": {\\\\\\\"semantic_type\\\\\\\": [\\\\\\\"http://schema.org/Float\\\\\\\", \\\\\\\"https://metadata.datadrivendiscovery.org/types/Attribute\\\\\\\"], \\\\\\\"name\\\\\\\": \\\\\\\"UGDS_NHPI\\\\\\\"}, \\\\\\\"column_meta_92\\\\\\\": {\\\\\\\"semantic_type\\\\\\\": [\\\\\\\"http://schema.org/Float\\\\\\\", \\\\\\\"https://metadata.datadrivendiscovery.org/types/Attribute\\\\\\\"], \\\\\\\"name\\\\\\\": \\\\\\\"UGDS_2MOR\\\\\\\"}, \\\\\\\"column_meta_93\\\\\\\": {\\\\\\\"semantic_type\\\\\\\": [\\\\\\\"http://schema.org/Float\\\\\\\", \\\\\\\"https://metadata.datadrivendiscovery.org/types/Attribute\\\\\\\"], \\\\\\\"name\\\\\\\": \\\\\\\"UGDS_NRA\\\\\\\"}, \\\\\\\"column_meta_94\\\\\\\": {\\\\\\\"semantic_type\\\\\\\": [\\\\\\\"http://schema.org/Float\\\\\\\", \\\\\\\"https://metadata.datadrivendiscovery.org/types/Attribute\\\\\\\"], \\\\\\\"name\\\\\\\": \\\\\\\"UGDS_UNKN\\\\\\\"}, \\\\\\\"column_meta_95\\\\\\\": {\\\\\\\"semantic_type\\\\\\\": [\\\\\\\"http://schema.org/Float\\\\\\\", \\\\\\\"https://metadata.datadrivendiscovery.org/types/Attribute\\\\\\\"], \\\\\\\"name\\\\\\\": \\\\\\\"PPTUG_EF\\\\\\\"}, \\\\\\\"column_meta_96\\\\\\\": {\\\\\\\"semantic_type\\\\\\\": [\\\\\\\"http://schema.org/Integer\\\\\\\", \\\\\\\"https://metadata.datadrivendiscovery.org/types/Attribute\\\\\\\"], \\\\\\\"name\\\\\\\": \\\\\\\"CURROPER\\\\\\\"}, \\\\\\\"column_meta_97\\\\\\\": {\\\\\\\"semantic_type\\\\\\\": [\\\\\\\"http://schema.org/Float\\\\\\\", \\\\\\\"https://metadata.datadrivendiscovery.org/types/Attribute\\\\\\\"], \\\\\\\"name\\\\\\\": \\\\\\\"NPT4_PUB\\\\\\\"}, \\\\\\\"column_meta_98\\\\\\\": {\\\\\\\"semantic_type\\\\\\\": [\\\\\\\"http://schema.org/Float\\\\\\\", \\\\\\\"https://metadata.datadrivendiscovery.org/types/Attribute\\\\\\\"], \\\\\\\"name\\\\\\\": \\\\\\\"NPT4_PRIV\\\\\\\"}, \\\\\\\"column_meta_99\\\\\\\": {\\\\\\\"semantic_type\\\\\\\": [\\\\\\\"http://schema.org/Float\\\\\\\", \\\\\\\"https://metadata.datadrivendiscovery.org/types/Attribute\\\\\\\"], \\\\\\\"name\\\\\\\": \\\\\\\"NPT41_PUB\\\\\\\"}, \\\\\\\"column_meta_100\\\\\\\": {\\\\\\\"semantic_type\\\\\\\": [\\\\\\\"http://schema.org/Float\\\\\\\", \\\\\\\"https://metadata.datadrivendiscovery.org/types/Attribute\\\\\\\"], \\\\\\\"name\\\\\\\": \\\\\\\"NPT42_PUB\\\\\\\"}, \\\\\\\"column_meta_101\\\\\\\": {\\\\\\\"semantic_type\\\\\\\": [\\\\\\\"http://schema.org/Float\\\\\\\", \\\\\\\"https://metadata.datadrivendiscovery.org/types/Attribute\\\\\\\"], \\\\\\\"name\\\\\\\": \\\\\\\"NPT43_PUB\\\\\\\"}, \\\\\\\"column_meta_102\\\\\\\": {\\\\\\\"semantic_type\\\\\\\": [\\\\\\\"http://schema.org/Float\\\\\\\", \\\\\\\"https://metadata.datadrivendiscovery.org/types/Attribute\\\\\\\"], \\\\\\\"name\\\\\\\": \\\\\\\"NPT44_PUB\\\\\\\"}, \\\\\\\"column_meta_103\\\\\\\": {\\\\\\\"semantic_type\\\\\\\": [\\\\\\\"http://schema.org/Float\\\\\\\", \\\\\\\"https://metadata.datadrivendiscovery.org/types/Attribute\\\\\\\"], \\\\\\\"name\\\\\\\": \\\\\\\"NPT45_PUB\\\\\\\"}, \\\\\\\"column_meta_104\\\\\\\": {\\\\\\\"semantic_type\\\\\\\": [\\\\\\\"http://schema.org/Float\\\\\\\", \\\\\\\"https://metadata.datadrivendiscovery.org/types/Attribute\\\\\\\"], \\\\\\\"name\\\\\\\": \\\\\\\"NPT41_PRIV\\\\\\\"}, \\\\\\\"column_meta_105\\\\\\\": {\\\\\\\"semantic_type\\\\\\\": [\\\\\\\"http://schema.org/Float\\\\\\\", \\\\\\\"https://metadata.datadrivendiscovery.org/types/Attribute\\\\\\\"], \\\\\\\"name\\\\\\\": \\\\\\\"NPT42_PRIV\\\\\\\"}, \\\\\\\"column_meta_106\\\\\\\": {\\\\\\\"semantic_type\\\\\\\": [\\\\\\\"http://schema.org/Float\\\\\\\", \\\\\\\"https://metadata.datadrivendiscovery.org/types/Attribute\\\\\\\"], \\\\\\\"name\\\\\\\": \\\\\\\"NPT43_PRIV\\\\\\\"}, \\\\\\\"column_meta_107\\\\\\\": {\\\\\\\"semantic_type\\\\\\\": [\\\\\\\"http://schema.org/Float\\\\\\\", \\\\\\\"https://metadata.datadrivendiscovery.org/types/Attribute\\\\\\\"], \\\\\\\"name\\\\\\\": \\\\\\\"NPT44_PRIV\\\\\\\"}, \\\\\\\"column_meta_108\\\\\\\": {\\\\\\\"semantic_type\\\\\\\": [\\\\\\\"http://schema.org/Float\\\\\\\", \\\\\\\"https://metadata.datadrivendiscovery.org/types/Attribute\\\\\\\"], \\\\\\\"name\\\\\\\": \\\\\\\"NPT45_PRIV\\\\\\\"}, \\\\\\\"column_meta_109\\\\\\\": {\\\\\\\"semantic_type\\\\\\\": [\\\\\\\"http://schema.org/Float\\\\\\\", \\\\\\\"https://metadata.datadrivendiscovery.org/types/Attribute\\\\\\\"], \\\\\\\"name\\\\\\\": \\\\\\\"PCTPELL\\\\\\\"}, \\\\\\\"column_meta_110\\\\\\\": {\\\\\\\"semantic_type\\\\\\\": [\\\\\\\"http://schema.org/Float\\\\\\\", \\\\\\\"https://metadata.datadrivendiscovery.org/types/Attribute\\\\\\\"], \\\\\\\"name\\\\\\\": \\\\\\\"RET_FT4_POOLED_SUPP\\\\\\\"}, \\\\\\\"column_meta_111\\\\\\\": {\\\\\\\"semantic_type\\\\\\\": [\\\\\\\"http://schema.org/Float\\\\\\\", \\\\\\\"https://metadata.datadrivendiscovery.org/types/Attribute\\\\\\\"], \\\\\\\"name\\\\\\\": \\\\\\\"RET_FTL4_POOLED_SUPP\\\\\\\"}, \\\\\\\"column_meta_112\\\\\\\": {\\\\\\\"semantic_type\\\\\\\": [\\\\\\\"http://schema.org/Float\\\\\\\", \\\\\\\"https://metadata.datadrivendiscovery.org/types/Attribute\\\\\\\"], \\\\\\\"name\\\\\\\": \\\\\\\"RET_PT4_POOLED_SUPP\\\\\\\"}, \\\\\\\"column_meta_113\\\\\\\": {\\\\\\\"semantic_type\\\\\\\": [\\\\\\\"http://schema.org/Float\\\\\\\", \\\\\\\"https://metadata.datadrivendiscovery.org/types/Attribute\\\\\\\"], \\\\\\\"name\\\\\\\": \\\\\\\"RET_PTL4_POOLED_SUPP\\\\\\\"}, \\\\\\\"column_meta_114\\\\\\\": {\\\\\\\"semantic_type\\\\\\\": [\\\\\\\"http://schema.org/Float\\\\\\\", \\\\\\\"https://metadata.datadrivendiscovery.org/types/Attribute\\\\\\\"], \\\\\\\"name\\\\\\\": \\\\\\\"PCTFLOAN\\\\\\\"}, \\\\\\\"column_meta_115\\\\\\\": {\\\\\\\"semantic_type\\\\\\\": [\\\\\\\"http://schema.org/Float\\\\\\\", \\\\\\\"https://metadata.datadrivendiscovery.org/types/Attribute\\\\\\\"], \\\\\\\"name\\\\\\\": \\\\\\\"UG25ABV\\\\\\\"}, \\\\\\\"column_meta_116\\\\\\\": {\\\\\\\"semantic_type\\\\\\\": [\\\\\\\"http://schema.org/Float\\\\\\\", \\\\\\\"https://metadata.datadrivendiscovery.org/types/Attribute\\\\\\\"], \\\\\\\"name\\\\\\\": \\\\\\\"MD_EARN_WNE_P10\\\\\\\"}, \\\\\\\"column_meta_117\\\\\\\": {\\\\\\\"semantic_type\\\\\\\": [\\\\\\\"http://schema.org/Float\\\\\\\", \\\\\\\"https://metadata.datadrivendiscovery.org/types/Attribute\\\\\\\"], \\\\\\\"name\\\\\\\": \\\\\\\"GT_25K_P6\\\\\\\"}, \\\\\\\"column_meta_118\\\\\\\": {\\\\\\\"semantic_type\\\\\\\": [\\\\\\\"http://schema.org/Float\\\\\\\", \\\\\\\"https://metadata.datadrivendiscovery.org/types/Attribute\\\\\\\"], \\\\\\\"name\\\\\\\": \\\\\\\"GT_28K_P6\\\\\\\"}, \\\\\\\"column_meta_119\\\\\\\": {\\\\\\\"semantic_type\\\\\\\": [\\\\\\\"http://schema.org/Float\\\\\\\", \\\\\\\"https://metadata.datadrivendiscovery.org/types/Attribute\\\\\\\"], \\\\\\\"name\\\\\\\": \\\\\\\"GRAD_DEBT_MDN_SUPP\\\\\\\"}, \\\\\\\"column_meta_120\\\\\\\": {\\\\\\\"semantic_type\\\\\\\": [\\\\\\\"http://schema.org/Float\\\\\\\", \\\\\\\"https://metadata.datadrivendiscovery.org/types/Attribute\\\\\\\"], \\\\\\\"name\\\\\\\": \\\\\\\"GRAD_DEBT_MDN10YR_SUPP\\\\\\\"}, \\\\\\\"column_meta_121\\\\\\\": {\\\\\\\"semantic_type\\\\\\\": [\\\\\\\"http://schema.org/Float\\\\\\\", \\\\\\\"https://metadata.datadrivendiscovery.org/types/Attribute\\\\\\\"], \\\\\\\"name\\\\\\\": \\\\\\\"RPY_3YR_RT_SUPP\\\\\\\"}, \\\\\\\"column_meta_122\\\\\\\": {\\\\\\\"semantic_type\\\\\\\": [\\\\\\\"http://schema.org/Float\\\\\\\", \\\\\\\"https://metadata.datadrivendiscovery.org/types/Attribute\\\\\\\"], \\\\\\\"name\\\\\\\": \\\\\\\"C150_L4_POOLED_SUPP\\\\\\\"}, \\\\\\\"column_meta_123\\\\\\\": {\\\\\\\"semantic_type\\\\\\\": [\\\\\\\"http://schema.org/Float\\\\\\\", \\\\\\\"https://metadata.datadrivendiscovery.org/types/Attribute\\\\\\\"], \\\\\\\"name\\\\\\\": \\\\\\\"C150_4_POOLED_SUPP\\\\\\\"}, \\\\\\\"column_meta_124\\\\\\\": {\\\\\\\"semantic_type\\\\\\\": [\\\\\\\"http://schema.org/Text\\\\\\\", \\\\\\\"https://metadata.datadrivendiscovery.org/types/Attribute\\\\\\\"], \\\\\\\"name\\\\\\\": \\\\\\\"UNITID_wikidata\\\\\\\"}, \\\\\\\"column_meta_125\\\\\\\": {\\\\\\\"semantic_type\\\\\\\": [\\\\\\\"http://schema.org/Text\\\\\\\", \\\\\\\"https://metadata.datadrivendiscovery.org/types/Attribute\\\\\\\"], \\\\\\\"name\\\\\\\": \\\\\\\"OPEID6_wikidata\\\\\\\"}, \\\\\\\"column_meta_126\\\\\\\": {\\\\\\\"semantic_type\\\\\\\": [\\\\\\\"http://schema.org/Text\\\\\\\", \\\\\\\"https://metadata.datadrivendiscovery.org/types/Attribute\\\\\\\"], \\\\\\\"name\\\\\\\": \\\\\\\"STABBR_wikidata\\\\\\\"}, \\\\\\\"column_meta_127\\\\\\\": {\\\\\\\"semantic_type\\\\\\\": [\\\\\\\"http://schema.org/Text\\\\\\\", \\\\\\\"https://metadata.datadrivendiscovery.org/types/Attribute\\\\\\\"], \\\\\\\"name\\\\\\\": \\\\\\\"CITY_wikidata\\\\\\\"}, \\\\\\\"data_metadata\\\\\\\": {\\\\\\\"shape_0\\\\\\\": 7175, \\\\\\\"shape_1\\\\\\\": 128}, \\\\\\\"first_10_rows\\\\\\\": \\\\\\\",UNITID,OPEID,OPEID6,INSTNM,CITY,STABBR,INSTURL,NPCURL,HCM2,PREDDEG,HIGHDEG,CONTROL,LOCALE,HBCU,PBI,ANNHI,TRIBAL,AANAPII,HSI,NANTI,MENONLY,WOMENONLY,RELAFFIL,SATVR25,SATVR75,SATMT25,SATMT75,SATWR25,SATWR75,SATVRMID,SATMTMID,SATWRMID,ACTCM25,ACTCM75,ACTEN25,ACTEN75,ACTMT25,ACTMT75,ACTWR25,ACTWR75,ACTCMMID,ACTENMID,ACTMTMID,ACTWRMID,SAT_AVG,SAT_AVG_ALL,PCIP01,PCIP03,PCIP04,PCIP05,PCIP09,PCIP10,PCIP11,PCIP12,PCIP13,PCIP14,PCIP15,PCIP16,PCIP19,PCIP22,PCIP23,PCIP24,PCIP25,PCIP26,PCIP27,PCIP29,PCIP30,PCIP31,PCIP38,PCIP39,PCIP40,PCIP41,PCIP42,PCIP43,PCIP44,PCIP45,PCIP46,PCIP47,PCIP48,PCIP49,PCIP50,PCIP51,PCIP52,PCIP54,DISTANCEONLY,UGDS,UGDS_WHITE,UGDS_BLACK,UGDS_HISP,UGDS_ASIAN,UGDS_AIAN,UGDS_NHPI,UGDS_2MOR,UGDS_NRA,UGDS_UNKN,PPTUG_EF,CURROPER,NPT4_PUB,NPT4_PRIV,NPT41_PUB,NPT42_PUB,NPT43_PUB,NPT44_PUB,NPT45_PUB,NPT41_PRIV,NPT42_PRIV,NPT43_PRIV,NPT44_PRIV,NPT45_PRIV,PCTPELL,RET_FT4_POOLED_SUPP,RET_FTL4_POOLED_SUPP,RET_PT4_POOLED_SUPP,RET_PTL4_POOLED_SUPP,PCTFLOAN,UG25ABV,MD_EARN_WNE_P10,GT_25K_P6,GT_28K_P6,GRAD_DEBT_MDN_SUPP,GRAD_DEBT_MDN10YR_SUPP,RPY_3YR_RT_SUPP,C150_L4_POOLED_SUPP,C150_4_POOLED_SUPP,UNITID_wikidata,OPEID6_wikidata,STABBR_wikidata,CITY_wikidata\\\\\\\\n0,100654,100200,1002,Alabama A & M University,Normal,AL,www.aamu.edu/,www2.aamu.edu/scripts/netpricecalc/npcalc.htm,0,3,4,1,12.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,380.0,470.0,370.0,470.0,370.0,457.0,425.0,420.0,414.0,16.0,19.0,14.0,20.0,15.0,18.0,,,18.0,17.0,17.0,,849.0,849.0,0.0448,0.0142,0.0071,0.0,0.0,0.0354,0.0401,0.0,0.1132,0.0896,0.0472,0.0,0.033,0.0,0.0094,0.066,0.0,0.0708,0.0024,0.0,0.0,0.0,0.0,0.0,0.0307,0.0,0.0472,0.0519,0.0377,0.0448,0.0,0.0,0.0,0.0,0.0283,0.0,0.1863,0.0,0.0,4616.0,0.0256,0.9129,0.0076,0.0019,0.0024,0.0017,0.0401,0.0065,0.0013,0.0877,1,15567.0,,15043.0,15491.0,17335.0,19562.0,18865.0,,,,,,0.7039,0.5774,,0.309,,0.7667,0.0859,31000,0.453,0.431,32750,348.16551225731,0.2531554273,,0.2913,Q39624632,Q17203888,Q173,Q575407\\\\\\\\n1,100663,105200,1052,University of Alabama at Birmingham,Birmingham,AL,www.uab.edu,uab.studentaidcalculator.com/survey.aspx,0,3,4,1,12.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,480.0,640.0,490.0,660.0,,,560.0,575.0,,21.0,28.0,22.0,30.0,19.0,26.0,,,25.0,26.0,23.0,,1125.0,1125.0,0.0,0.0,0.0,0.0005,0.036000000000000004,0.0,0.0131,0.0,0.0748,0.0599,0.0,0.0059,0.0,0.0,0.0158,0.0135,0.0,0.0734,0.009000000000000001,0.0,0.0,0.0,0.005,0.0,0.0212,0.0,0.0766,0.0243,0.0221,0.0365,0.0,0.0,0.0,0.0,0.0392,0.25,0.2072,0.0162,0.0,12047.0,0.5786,0.2626,0.0309,0.0598,0.0028,0.0004,0.0387,0.0179,0.0083,0.2578,1,16475.0,,13849.0,15385.0,18022.0,18705.0,19319.0,,,,,,0.3525,0.8007,,0.5178,,0.5179,0.2363,41200,0.669,0.631,21833,232.106797835537,0.513963161,,0.5384,Q39624677,Q17204336,Q173,Q79867\\\\\\\\n2,100690,2503400,25034,Amridge University,Montgomery,AL,www.amridgeuniversity.edu,www2.amridgeuniversity.edu:9091/,0,3,4,2,12.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,74.0,,,,,,,,,,,,,,,,,,,,,,,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0889,0.0,0.0,0.0889,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.4,0.0,0.0,0.0,0.0667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.3556,0.0,1.0,293.0,0.157,0.2355,0.0068,0.0,0.0,0.0034,0.0,0.0,0.5973,0.5392,1,,10155.0,,,,,,10155.0,,,,,0.6971,PrivacySuppressed,,PrivacySuppressed,,0.8436,0.8571,39600,0.658,0.542,22890,243.343773299842,0.2307692308,,PrivacySuppressed,Q39624831,Q17337864,Q173,Q29364\\\\\\\\n3,100706,105500,1055,University of Alabama in Huntsville,Huntsville,AL,www.uah.edu,finaid.uah.edu/,0,3,4,1,12.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,520.0,660.0,540.0,680.0,,,590.0,610.0,,25.0,31.0,24.0,33.0,23.0,29.0,,,28.0,29.0,26.0,,1257.0,1257.0,0.0,0.0,0.0,0.0,0.0301,0.0,0.0499,0.0,0.0282,0.2702,0.0,0.0151,0.0,0.0,0.0122,0.0,0.0,0.0603,0.0132,0.0,0.0,0.0,0.0113,0.0,0.0226,0.0,0.016,0.0,0.0,0.0188,0.0,0.0,0.0,0.0,0.0264,0.1911,0.225,0.0094,0.0,6346.0,0.7148,0.1131,0.0411,0.0414,0.012,0.0,0.0181,0.0303,0.0292,0.1746,1,19423.0,,15971.0,18016.0,20300.0,21834.0,22059.0,,,,,,0.2949,0.8161,,0.5116,,0.4312,0.2255,46700,0.685,0.649,22647,240.760438353933,0.5485090298,,0.4905,Q39624901,Q17204354,Q173,Q79860\\\\\\\\n4,100724,100500,1005,Alabama State University,Montgomery,AL,www.alasu.edu,www.alasu.edu/cost-aid/forms/calculator/index.aspx,0,3,4,1,12.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,370.0,460.0,360.0,460.0,,,415.0,410.0,,15.0,19.0,14.0,19.0,15.0,17.0,,,17.0,17.0,16.0,,825.0,825.0,0.0,0.0,0.0,0.0,0.1023,0.0,0.0503,0.0,0.1364,0.0,0.0,0.0,0.0,0.0,0.0114,0.0,0.0,0.0779,0.0146,0.0,0.0,0.0211,0.0,0.0,0.0244,0.0,0.0503,0.1412,0.0633,0.013000000000000001,0.0,0.0,0.0,0.0,0.0487,0.1429,0.0974,0.0049,0.0,4704.0,0.0138,0.9337,0.0111,0.0028,0.0013,0.0004,0.0111,0.0159,0.01,0.0727,1,15037.0,,14111.0,15140.0,17492.0,19079.0,18902.0,,,,,,0.7815,0.6138,,0.5313,,0.8113,0.0974,27700,0.393,0.351,31500,334.876752247489,0.2185867473,,0.2475,Q39624974,Q17203904,Q173,Q29364\\\\\\\\n5,100751,105100,1051,The University of Alabama,Tuscaloosa,AL,www.ua.edu/,financialaid.ua.edu/net-price-calculator/,0,3,4,1,13.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,490.0,610.0,490.0,620.0,480.0,600.0,550.0,555.0,540.0,23.0,31.0,23.0,33.0,22.0,29.0,7.0,8.0,27.0,28.0,26.0,8.0,1202.0,1202.0,0.0,0.0039,0.0,0.0042,0.102,0.0,0.0098,0.0,0.0782,0.1036,0.0,0.0057,0.0692,0.0,0.0115,0.0,0.0,0.0338,0.009000000000000001,0.0,0.0206,0.0,0.0031,0.0,0.0115,0.0,0.036000000000000004,0.0263,0.0109,0.0362,0.0,0.0,0.0,0.0,0.026000000000000002,0.0988,0.2879,0.0118,0.0,31663.0,0.7841,0.1037,0.0437,0.0118,0.0036,0.0009,0.0297,0.0192,0.0033,0.0819,1,21676.0,,18686.0,20013.0,22425.0,23666.0,24578.0,,,,,,0.1938,0.8637,,0.4308,,0.4007,0.081,44500,0.695,0.679,23290,247.596176502985,0.6019442985,,0.6793,Q39625107,Q17204328,Q173,Q79580\\\\\\\\n6,100760,100700,1007,Central Alabama Community College,Alexander City,AL,www.cacc.edu,www.cacc.edu/NetPriceCalculator/14-15/npcalc.html,0,2,2,1,32.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,,,,,,,,,,,,,,,,,,,,,,,,0.0,0.0,0.0,0.0,0.0,0.0,0.0266,0.0082,0.0,0.0,0.1025,0.0,0.0,0.0,0.0,0.2787,0.0,0.0,0.0,0.0,0.0287,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0307,0.3176,0.0,0.0,0.1209,0.0861,0.0,0.0,1492.0,0.6877,0.2802,0.0127,0.002,0.004,0.0007,0.0067,0.002,0.004,0.3733,1,9128.0,,8882.0,8647.0,11681.0,11947.0,13868.0,,,,,,0.5109,,0.5666,,0.4554,0.3234,0.263,27700,0.466,0.395,9500,100.994576074639,0.2510056315,0.2136,,Q39625150,Q17203916,Q173,Q79663\\\\\\\\n7,100812,100800,1008,Athens State University,Athens,AL,www.athens.edu,https://24.athens.edu/apex/prod8/f?p=174:1:3941357449598491,0,3,3,1,31.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,,,,,,,,,,,,,,,,,,,,,,,,0.0,0.0,0.0,0.0,0.0,0.0,0.0462,0.0,0.2192,0.0,0.0,0.0,0.0,0.0,0.0346,0.0538,0.0,0.0231,0.0205,0.0,0.0154,0.0154,0.0038,0.0,0.0026,0.0,0.0308,0.0282,0.0,0.0218,0.0,0.0,0.0,0.0,0.0256,0.0064,0.4449,0.0077,0.0,2888.0,0.7784,0.125,0.0215,0.0076,0.0142,0.001,0.0187,0.001,0.0325,0.5817,1,,,,,,,,,,,,,0.4219,,,,,0.6455,0.6774,38700,0.653,0.612,18000,191.358144141422,0.5038167939,,,Q39625389,Q17203920,Q173,Q203263\\\\\\\\n8,100830,831000,8310,Auburn University at Montgomery,Montgomery,AL,www.aum.edu,www.aum.edu/current-students/financial-information/net-price-calculator,0,3,4,1,12.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,435.0,495.0,445.0,495.0,,,465.0,470.0,,19.0,24.0,19.0,24.0,17.0,22.0,,,22.0,22.0,20.0,,1009.0,1009.0,0.0,0.02,0.0,0.0,0.0601,0.0,0.0,0.0,0.0584,0.0,0.0,0.0033,0.0,0.0,0.0067,0.0117,0.0,0.0534,0.0083,0.0,0.0,0.0501,0.0,0.0,0.015,0.0,0.0668,0.0351,0.0,0.0401,0.0,0.0,0.0,0.0,0.0267,0.2621,0.2705,0.0117,0.0,4171.0,0.5126,0.3627,0.0141,0.0247,0.006,0.001,0.0319,0.0412,0.0058,0.2592,1,15053.0,,13480.0,14114.0,16829.0,17950.0,17022.0,,,,,,0.4405,0.6566,,0.4766,,0.5565,0.2257,33300,0.616,0.546,23363,248.372240087558,0.4418886199,,0.2207,Q39625474,Q17613566,Q173,Q29364\\\\\\\\n9,100858,100900,1009,Auburn University,Auburn,AL,www.auburn.edu,https://www.auburn.edu/admissions/netpricecalc/freshman.html,0,3,4,1,13.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,530.0,620.0,530.0,640.0,520.0,620.0,575.0,585.0,570.0,24.0,30.0,25.0,32.0,23.0,28.0,7.0,8.0,27.0,29.0,26.0,8.0,1217.0,1217.0,0.0437,0.0133,0.0226,0.0,0.0575,0.0,0.0079,0.0,0.0941,0.1873,0.0,0.0097,0.0337,0.0,0.0088,0.0,0.0,0.0724,0.0097,0.0,0.0267,0.0,0.0014,0.0,0.0093,0.0,0.033,0.0,0.0179,0.0312,0.0,0.0,0.0,0.0,0.0326,0.0667,0.2113,0.009000000000000001,0.0,22095.0,0.8285,0.0673,0.0335,0.0252,0.0052,0.0003,0.0128,0.0214,0.0059,0.0831,1,21984.0,,15591.0,19655.0,23286.0,24591.0,25402.0,,,,,,0.1532,0.9043,,0.7229,,0.32799999999999996,0.0427,48800,0.741,0.726,21500,228.566672168921,0.7239612977,,0.74,Q39625609,Q17203926,Q173,Q225519\\\\\\\\n10,100937,101200,1012,Birmingham Southern College,Birmingham,AL,www.bsc.edu/,www.bsc.edu/fp/np-calculator.cfm,0,3,3,2,12.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,71.0,500.0,610.0,490.0,570.0,,,555.0,530.0,,23.0,28.0,22.0,29.0,22.0,26.0,,,26.0,26.0,24.0,,1150.0,1150.0,0.0,0.023,0.0,0.0077,0.0268,0.0,0.0,0.0,0.046,0.0077,0.0,0.0077,0.0,0.0,0.023,0.0,0.0,0.1379,0.0498,0.0,0.0383,0.0,0.0307,0.0,0.0575,0.0,0.0728,0.0,0.0,0.0881,0.0,0.0,0.0,0.0,0.1034,0.0,0.2261,0.0536,0.0,1289.0,0.7921,0.1171,0.0217,0.0489,0.006999999999999999,0.0,0.0109,0.0,0.0023,0.0054,1,,23227.0,,,,,,20815.0,19582.0,23126.0,24161.0,25729.0,0.1888,0.8386,,,,0.4729,0.0141,46700,0.637,0.618,26045,276.88460356463,0.7559912854,,0.6439,,Q17203945,Q173,Q79867\\\\\\\\n\\\\\\\", \\\\\\\"local_storage\\\\\\\": \\\\\\\"/data00/dsbox/datamart/memcache_storage/datasets_cache/794d5f7dcddae86817a10e16ee1aecfa.h5\\\\\\\"}\\\"}, \\\"title\\\": {\\\"xml:lang\\\": \\\"en\\\", \\\"type\\\": \\\"literal\\\", \\\"value\\\": \\\"most recent cohorts scorecard elements csv\\\"}, \\\"keywords\\\": {\\\"datatype\\\": \\\"http://www.w3.org/2001/XMLSchema#string\\\", \\\"type\\\": \\\"literal\\\", \\\"value\\\": \\\"unitid opeid opeid6 instnm city stabbr insturl npcurl hcm2 preddeg highdeg control locale hbcu pbi annhi tribal aanapii hsi nanti menonly womenonly relaffil satvr25 satvr75 satmt25 satmt75 satwr25 satwr75 satvrmid satmtmid satwrmid actcm25 actcm75 acten25 acten75 actmt25 actmt75 actwr25 actwr75 actcmmid actenmid actmtmid actwrmid sat avg sat avg all pcip01 pcip03 pcip04 pcip05 pcip09 pcip10 pcip11 pcip12 pcip13 pcip14 pcip15 pcip16 pcip19 pcip22 pcip23 pcip24 pcip25 pcip26 pcip27 pcip29 pcip30 pcip31 pcip38 pcip39 pcip40 pcip41 pcip42 pcip43 pcip44 pcip45 pcip46 pcip47 pcip48 pcip49 pcip50 pcip51 pcip52 pcip54 distanceonly ugds ugds white ugds black ugds hisp ugds asian ugds aian ugds nhpi ugds 2mor ugds nra ugds unkn pptug ef curroper npt4 pub npt4 priv npt41 pub npt42 pub npt43 pub npt44 pub npt45 pub npt41 priv npt42 priv npt43 priv npt44 priv npt45 priv pctpell ret ft4 pooled supp ret ftl4 pooled supp ret pt4 pooled supp ret ptl4 pooled supp pctfloan ug25abv md earn wne p10 gt 25k p6 gt 28k p6 grad debt mdn supp grad debt mdn10yr supp rpy 3yr rt supp c150 l4 pooled supp c150 4 pooled supp unitid wikidata opeid6 wikidata stabbr wikidata city wikidata\\\"}, \\\"datasetLabel\\\": {\\\"xml:lang\\\": \\\"en\\\", \\\"type\\\": \\\"literal\\\", \\\"value\\\": \\\"D4cb70062-77ed-4097-a486-0b43ffe81463\\\"}, \\\"variableName\\\": {\\\"datatype\\\": \\\"http://www.w3.org/2001/XMLSchema#string\\\", \\\"type\\\": \\\"literal\\\", \\\"value\\\": \\\"INSTNM\\\"}, \\\"score\\\": {\\\"datatype\\\": \\\"http://www.w3.org/2001/XMLSchema#double\\\", \\\"type\\\": \\\"literal\\\", \\\"value\\\": \\\"0.9398390424662831\\\"}}, \\\"query_json\\\": {\\\"keywords\\\": [\\\"INSTNM\\\"], \\\"variables\\\": {\\\"INSTNM\\\": \\\"medanos funeral liberty costa calhoun management gardens spartanburg alderson fashion montrose platteville point southern polytechnic mayo wiregrass morris waters cajon cosemtology bunker amboy dubois kapiolani salon toni suny buncombe forsyth mcminnville rose hickey gwinnett cities lewiston greene broadview birmingham mountains michigan altamonte charlotte dental reedley defiance paier mcmurry campus georgetown marlboro springfield missouri josef ingram woodbury goldey stamford schreiner finger bottineau essex linn eastwick wharton furman federal caribbean cogswell art blauvelt cem lee aiken richmond soledad lloyd mohave adult moore yavapai viterbo ipswich scripps exton salt metro dlp mccann school benedict bronx jones miracosta inland mexico workforce faribault cross f midway jose henrietta oriental cerritos bowling langhorne clearwater vega desert pioneer upland wofford orange public porter longwood cod elegance sinai columbiana snyder tarrant ludlow grenada metropark cookman kenneth foley daymar helena concordia broadcasting technicians bennet lower albright community anza wooster rogue studies general range red eves luna allied skyline drury bayshore training linda mgh enterprise cooper mokena ashtabula fletcher chemung irving connecticut clovis merritt lucie harold phagans windsor gloucester carroll hairdressing merchandising massage radford canada collins divinity crescent medford technological henrico garrett slippery online automotive aguadilla coyne collegiate roxborough chipola intercoast extended interactive seattle penn victor dutchess highlands coffeyville lynn valley stratton fortis cda wade arapahoe little coalinga panola duluth mentor wolff vanguard house elmira sunstate estes del rensselaer camp santa tioga ann mackie rush noc granville indianhead frostburg steuben watkins northwestern judson sage roane cordon redwoods parkersburg boise dramatic richland brightwood rivertown pierpont roseville hurst philander stone allegheny alto kaskaskia c nutley schilling hays glen john miramar cookeville phillips du brigham marian platt owen coast guy machias crestwood applied faith syracuse rider sylvania remington andrews studio 5 emporia rice belhaven portfolio citadel dabney bridgeport model conemaugh germanna graceland chamblee holyoke scott apex jenny pro site tech centro ottumwa macalester loma learning camden buffalo heidelberg brookdale mohawk diego erskine dawson chicopee modesto cobleskill bethel belleville ct universidad mines lansing wilson faulkner highland broken devry evers barber fayetteville bay mount ferris macon baltimore consolidated somersworth blackwood lincoln anna berea claire branch caribou methodist pierce shelby northwest cumberlands tribal merrell liu valencia kansas age florham juilliard hillsboro aveda cameo lasell pineville jamaica gastonia taos lindenwood england carolina hamilton name brookline ailano dalton newark chaminade pennco towson davison piedmont baylor douglasville clark prince asbury payne augustana accelerated hands corazon juniata eberly lakeland itasca alamogordo ouachita grants elizabeth cuisine professions environmental capitol westminster danville riverside johnson titusville hardin covenant bradley caldwell leeward hazleton seminary cabarrus douglas averett chattanooga mediatech sauk luke haute vermont baptist denmark lac cowley touro allen centers cottey lyndhurst ivy westbury chadron strayer bossier levittown morrison harrold kd shawnee laguardia passaic ave gainesville virgin healthcare simon jolie briar cumberland pharr treme shore wiley woodland baja hobart mcallen county s main arts luther brevard stockton technology herzing hilliard architecture engineering edic grossmont advanced scottsdale coconino granite forestry quinnipiac le noblesville bon conservatory junction barclay big ripon bowie lenoir halifax mary walden christian navarro salus denison iti computer street montgomery hancock motivation northeast carlow skysong hallmark wabash salkehatchie compass liverpool expressions silicon theology keuka joseph ft lima designory altoona zanesville teterboro wood southmost lansdale muncie by ash integral vet arecibo tennessee unity b waterbury victoria west rockland central butler lakes chambersburg henry rasmussen visual rhodes alternative azusa hall rapid avalon ocean hazard pembroke gardena harbor greensburg grand moravian stevenson golfers shuler keiser pearl industrial abraham sheen y ouachitas woburn liceo clear turnersville virginia non carsten beaumont esani governors dewey hannibal for montcalm franciscan m berk binghamton eunice dixie carbondale clair fisher pharmacy beckfield kentwood embry stautzenberger sand concord alfred barranquitas hobby rogers elon paradise continental owens vista fidm antelope educators keys chester prescott lane styling tacoma universal manhattan wittenberg concorde ibmc dc marion johnston albertus peak presbyterian dixon beaufort bryan midwest samuel assistants waterfront clarkson temple hospitality locations moline avance blessing maharishi lufkin maine tufts smith doheny voorhees montevallo lawton politecnica cutler and sarah arthur beach holistic bard burnsville brockport national specs bancroft collin paramus arlington rochelle fort wichita laurel takoma colgate rieman trailer sentara fargo welder fashions purchase pulaski la marshalltown sail fajardo amityville ulster chesterfield naropa brewton cosumnes quincy cedar hypnosis kennesaw purdue rosa van wic neosho piscataway hawaii duke mission air mar o jacinto intermediate tribes feather yuba ford gene roswell cheyenne colorado palladium machinists hondros depauw belmont strand agnes morrisville nashville way coachella friendswood berry jay fear wesleyan midlands susquehanna avery poplar mayaguez lehigh indiana rocks kent campuses humboldt vermilion denver legal medgar woman lilburn prairie upper kensington osteopathic cancer brownsville shepherd myotherapy wilkes sargeant nebraska caguas concept bene forks broomall middlesex king ridley mendota wise branford weill queens covina holy bucks ideal berkeley northeastern folsom medrash lutheran shear tractor behrend ross area the wayne services mobile robeson lowcountry northcoast joya of summit merrimack a govoha edp luis peter plymouth webber wentworth juan howard continuing aeronautical rainy chandler allentown otero falls associated truett heavilin downtown wachusett arundel niagara blades trumbull omnitech aviation louis bluff georgian portage brick unit living north brittany institutes stafford soma houghton california findlay stonehill anthony automeca colton brickell minnesota visible bakersfield kirtland ces biloxi chapman merrillville mattydale p escanaba forest anselm beaver castle intercontinental tri state schoolcraft park cci bergen rust montreat brecksville ozarks hillsborough angelo sam brooklyn olaf salem sandusky huertas sebastian fredric lourdes trinidad hudson jacksonville 6 cazenovia culinaire dow girardeau providence mason wing moberly trend clarendon knoxville perth brown taylor westboro rio whitewater marietta newport antioch ames wheelock culver fine farmingdale case sparks rochester mesa motlow oconee rock salish peirce ambler louisiana romeoville brentwood honolulu sacramento southwest richfield altierus albert everest dona ecole marywood pikeville gwynedd humacao toms hollywood ort cruces plainfield cincinnati independence skidmore hesperia iowa mountwest layton rancho more cloud sw yti pressley federico liberal cornell gardner spalding barbering region global coleman yale irvine musical mississippi united healing ursinus eureka mj success carmel somerset monterey presentation fiance mercyhurst bj durham ultimate whittier florence cuny plaza muskingum johns philadelphia linwood calvin davis springs colby olivet lakeview medical master pikes visalia washington dubuque richey claremont moorhead saint system oneida land asheville pennsylvania staten cbd charzanne corona don l montana ego whatcom billings benedictine christi lancaster oral resurrection mitchell md corporation idaho trade reserve alaska oxford vocations renton adrian court clayton everett mesabi mining hillyard camelot g arkansas sussex appling bloomfield greece worcester willoughby scioto jordan jewell cosmetology clarksburg calvary albizu alvin trinity chemeketa bible chesapeake grambling harrisburg tarleton ogle dartmouth murphy shorter exposito professionals simpson dci xenon powersport therapy huntsville nursing kankakee skin northern vaughn morningside miller windward e kootenai umpqua enfield lafayette mifflin gulf heights mont research ravenna hurstborne college juarez auburn eastfield texas marin nails pitt montclair mankato robert harris pasco lorain massachusetts palm charlie institute verne sum iona arbor manassas valparaiso siloam mcpherson drive barbara nuys papageorge anaheim law diesel norwich lone savannah paramount cameron pitzer southtowns wisconsin cape spartan brandman delmarva augsburg campbellsville rey lowell eugene kokomo spencerian barton kaye parker canyons ad marshall perimeter d mcneese bellevue ranger hampton sweet nunez landmark riddle bradford fond brandon university 2017 winthrop pomeroy bramson hilo ventura haverford willamette natchez hair sheffield natural este milwaukee darlington fe bela troy harrison gavilan hardeman fairleigh frontier mcconnell multnomah 1 dominguez drafting margaret gupton charles asm norfolk wake margate woodbridge wheeling mssu utah practical normandale american hollins nashua puerto bordentown thomas kalamazoo cardinal melbourne charleston sprunt hawk parkside itawamba clara cobb inc south district ayers chicago oswego parishes scholars obispo mills israel madison chamberlain bluegrass j dayton green great bridgevalley james life molloy rouge pellissippi hammond brownsburg olympian steven helene everglades kennedy aeronautics berklee welding port technical biomedical manatee huron german forrest regional dakota brighton wellness seminole hofstra sae saddleback hato watsonville centura nevada roman morrow mansfield capital harvard restaurant schools waynesburg murray beal lawrenceville pensacola calumet medicine luzerne albuquerque thiel grayson whitman kirkwood media carlos boston america creighton kettering de clary division capri bismarck niles hernando stephens klamath education northland ithaca dodge queensborough muhlenberg mckendree webb sarasota eti hulman orlando joliet union texarkana livingstone hagerstown weatherford hackensack southeastern hunter wausau joint sacred pontifical edu staples euphoria schaumburg perry ex greenspoint recording midland quinta parma mountain motte abington mullins sunset las buena word hopkins shenango carrollton okc catholic eagan academy uei puget youngstown baker traverse francis edinboro schuyler quest catherine snow omaha redlands professional harlingen london capstone gulfport vernon mycomputercareer southeast carnegie ata donnelly focus tyler manor winston images warminster emmanuel southfield mercy bellus brite design berkshire old forty sues brunswick accountancy graduate webster quinn hilbert woonsocket ponce scranton cortiva lynwood tinley drew hamrick white rudae dordt talladega jessup paris carrington at hanceville ai david paul catawba tabor bloomington danbury northpoint andrew rosalind gonzaga creative william health washtenaw tallahassee hospital westchester farmington los owings hood rapids richardson akron gateway centenary coker sierra rizzieri dunwoody junior clinton manchester laredo hope univeristy tampa sinclair edward alvernia jefferson modern ntma rolla murfreesboro carson steubenville beacom anderson lea southwestern depaul ferrum ruidoso dean guayama centra arte bethune barry esthetics loudoun essential amherst tiffin ravenscroft ontario music meadows lesley coba emerson geauga blaine walsh batesville dickinson eastern cochran francisco clinic atep loyola 10 nw tusculum lexington hartford radiation vanderbilt slidell warren opportunities laboure company evergreen hills advancing huntington burlington pepperdine circus ucas ancilla esthiology albany five nhti pasadena nassau horry academic ridgewater memphis cordova argosy golf wellspring emory lagrange truckee hawkeye maritime potomac woods caribe sandy clifton killeen collegeamerica superior seacoast evansville ashland morrilton equestrian hennepin maryville alamos linfield brazosport norbert landover fox westport corning orleans haven austin pine location delaware brandeis darby coe six middletown fredonia hamline rhode clinical star welch apollo trine bryant marcos walnut eagle vance flint nazarene boylston boricua christopher gettysburg edison marine dupage clarksville oregon sanford roche jarvis siena q valdosta oehrlein alabama wilmington walla martinsburg appleton stroudsburg houma bob swarthmore motoring joaquin angeles galveston richard make moines oak cortland laboratory malden whitworth neumann meredith 4 houston world lauderhill zane kutztown el on jersey hampshire roanoke terre moyne jacobs poway in mars ridge merced clackamas sawyer peay dividend citrus cattaraugus stevens military atlanta george agricultural louisville wallace salina therapeutic anoka adolphus parisian prism mesquite lithonia fairbanks trades lauderdale excellence katy salle sullivan gill cambridge aurora galen dallas gregory carlsbad northcentral newberry crossroads parish abilene snead lebanon marymount orlo brownson georgia support franklin chapel tricoci islands llc vincent directions humphreys lpn maranatha vogue careers academie architectural century tulsa spa hanover lander kentucky inter kauai indianapolis belle highline simmons oklahoma alliance clover middlebury allan commonwealth skagit nicholls pj crouse river grove carthage northwood stowe irene flathead hesston monroe up wells assumption scholastica hannah lamson san pc oakwood thaddeus clatsop millikin dimondale bucknell beaverton alameda tunxis daytona med fair nyack commerce bissonnet fresno coastline dominican newington brookfield stanford bland culinary pillar tempe tecnologia jamestown cornish smiths headlines florida cuyahoga neumont rocky sound dorothy mahwah fall yeshiva sioux cedarville ringling names orion soka benjamin pueblo ambrose eau bothell brookhaven kilgore mott brothers twin suffolk solano cascades bluffton roger madisonville rowan otterbein aquinas setters cecil science berks rutgers shreveport fuld eastland augusta dover innsbrook se heath sumter westfield broaddus states city mineral summerlin towns heathrow digital weslaco hodges hill bodywork beauty elaine elkhart baymeadows merchant conway taunton bayamon st zona tompkins pedro wales spokane leandro bethany plains farms pinnacle pines worth jameson tualatin walters turlock vegas antonio glenville corpus beau industrialization maryland mr pleasant denton hairstyling marygrove centre salisbury assist young portsmouth beth baton ursuline blue portland phoenix culture sterling vatterott wright barre greystone goodwin news lawrence manati delta boces chula chattahoochee career wytheville kenyon colleges freed pomona campbell muskegon vocational empire chillicothe maccormac lukes glendale stanbridge island missoula oaks monroeville adventist carbon toledo veeb ana bend high elkins nicolet pratt fredericksburg morgantown polaris elley theological hialeah ranken cet lewisville amarillo fitchburg greensboro mildred agriculture arizona corban westech dynamics saginaw anne mendocino montserrat mid international davenport panhandle des occidental wheaton minneapolis family sagrado artistic villanova pottsville indian shoals tidewater miat maintenance elizabethtown acupuncture kittanning long warrendale columbia clarke laconia anschutz charter maria columbus cairn limestone stritch wyoming alexandria eastlake grabber mt mckenna decatur henager networks pass mcnally southgate thornton iselin intellitec arrow gerbers raleigh european greenville michael mti monde lewis allegany center brainerd rhyne philip administration magnus capella pci treasury choffin triangle carey whitestone rockford stratford grace heart western traditional middle closed bonaventure program mclean myomassology bennington williston gate basin illinois abbey bloomsburg cogliano moscow cliff ramsey cherry chestnut programs williams warner southington petersburg ottawa roosevelt seguin canton moraine dearborn sonoma fayette bristol tucson ecpi metropolitan kendall cheeks hastings east cleveland genesis crosse copiah granger care milan madonna dorsey film lemoore lake rockhurst stetson logan trocaire dominion midstate cannella town pittsburgh schenectady gordon secours manhattanville unitech grande herkimer black homestead fremont business fulton jasper monticello edgecombe lubbock shelton gallaudet grays wor ogeechee gustavus strongsville britain princeton martin spelman station nelly johnstown testing all york view coastal bleu geneseo barrett advancement casper reporting aims lehman albion waubonsee line degree xavier golden flagstaff licking upstate atlantic antonelli memorial mchenry monmouth magnolia divers davidson boulder rob lassen langston trevecca athens holmes jackson women metairie doane goshen rollins pacific full swlr fairfield erie benton stark ohio ivc bartending onondaga alice deaf chenoweth spring greater baldwin riverhead alliant salter reynolds cozmo radiologic paso kaplan bangor miles leon jesuit roberts stephen ball sciences metropolitana jfk bastyr beltsville italy rico miami earlham reno new fairmont roy quinsigamond myrtle paltz ne ogden tuskegee keystone hibbing service wesley pima sewanee blinn lamar monica\\\"}, \\\"keywords_search\\\": [\\\"college\\\", \\\"scorecard\\\", \\\"finance\\\", \\\"debt\\\", \\\"earnings\\\"], \\\"variables_search\\\": {}}, \\\"search_type\\\": \\\"general\\\"}, \\\"augmentation\\\": {\\\"properties\\\": \\\"join\\\", \\\"right_columns\\\": [3], \\\"left_columns\\\": [2]}, \\\"datamart_type\\\": \\\"isi\\\"}\", \"metadata\": [{\"metadata\": {\"dimension\": {\"length\": 7175, \"name\": \"rows\", \"semantic_types\": [\"https://metadata.datadrivendiscovery.org/types/TabularRow\"]}, \"schema\": \"https://metadata.datadrivendiscovery.org/schemas/v0/container.json\", \"semantic_types\": [\"https://metadata.datadrivendiscovery.org/types/Table\"], \"structural_type\": \"d3m.container.pandas.DataFrame\"}, \"selector\": []}, {\"metadata\": {\"dimension\": {\"length\": 128, \"name\": \"columns\", \"semantic_types\": [\"https://metadata.datadrivendiscovery.org/types/TabularColumn\"]}}, \"selector\": [\"__ALL_ELEMENTS__\"]}, {\"metadata\": {\"name\": \"UNITID\", \"semantic_types\": [\"http://schema.org/Integer\", \"https://metadata.datadrivendiscovery.org/types/Attribute\"], \"structural_type\": \"str\"}, \"selector\": [\"__ALL_ELEMENTS__\", 0]}, {\"metadata\": {\"name\": \"OPEID\", \"semantic_types\": [\"http://schema.org/Integer\", \"https://metadata.datadrivendiscovery.org/types/Attribute\"], \"structural_type\": \"str\"}, \"selector\": [\"__ALL_ELEMENTS__\", 1]}, {\"metadata\": {\"name\": \"OPEID6\", \"semantic_types\": [\"http://schema.org/Integer\", \"https://metadata.datadrivendiscovery.org/types/Attribute\"], \"structural_type\": \"str\"}, \"selector\": [\"__ALL_ELEMENTS__\", 2]}, {\"metadata\": {\"name\": \"INSTNM\", \"semantic_types\": [\"http://schema.org/Text\", \"https://metadata.datadrivendiscovery.org/types/Attribute\"], \"structural_type\": \"str\"}, \"selector\": [\"__ALL_ELEMENTS__\", 3]}, {\"metadata\": {\"name\": \"CITY\", \"semantic_types\": [\"http://schema.org/Text\", \"https://metadata.datadrivendiscovery.org/types/Attribute\"], \"structural_type\": \"str\"}, \"selector\": [\"__ALL_ELEMENTS__\", 4]}, {\"metadata\": {\"name\": \"STABBR\", \"semantic_types\": [\"http://schema.org/Text\", \"https://metadata.datadrivendiscovery.org/types/Attribute\"], \"structural_type\": \"str\"}, \"selector\": [\"__ALL_ELEMENTS__\", 5]}, {\"metadata\": {\"name\": \"INSTURL\", \"semantic_types\": [\"http://schema.org/Text\", \"https://metadata.datadrivendiscovery.org/types/Attribute\"], \"structural_type\": \"str\"}, \"selector\": [\"__ALL_ELEMENTS__\", 6]}, {\"metadata\": {\"name\": \"NPCURL\", \"semantic_types\": [\"http://schema.org/Text\", \"https://metadata.datadrivendiscovery.org/types/Attribute\"], \"structural_type\": \"str\"}, \"selector\": [\"__ALL_ELEMENTS__\", 7]}, {\"metadata\": {\"name\": \"HCM2\", \"semantic_types\": [\"http://schema.org/Integer\", \"https://metadata.datadrivendiscovery.org/types/Attribute\"], \"structural_type\": \"str\"}, \"selector\": [\"__ALL_ELEMENTS__\", 8]}, {\"metadata\": {\"name\": \"PREDDEG\", \"semantic_types\": [\"http://schema.org/Integer\", \"https://metadata.datadrivendiscovery.org/types/Attribute\"], \"structural_type\": \"str\"}, \"selector\": [\"__ALL_ELEMENTS__\", 9]}, {\"metadata\": {\"name\": \"HIGHDEG\", \"semantic_types\": [\"http://schema.org/Integer\", \"https://metadata.datadrivendiscovery.org/types/Attribute\"], \"structural_type\": \"str\"}, \"selector\": [\"__ALL_ELEMENTS__\", 10]}, {\"metadata\": {\"name\": \"CONTROL\", \"semantic_types\": [\"http://schema.org/Integer\", \"https://metadata.datadrivendiscovery.org/types/Attribute\"], \"structural_type\": \"str\"}, \"selector\": [\"__ALL_ELEMENTS__\", 11]}, {\"metadata\": {\"name\": \"LOCALE\", \"semantic_types\": [\"http://schema.org/Integer\", \"https://metadata.datadrivendiscovery.org/types/Attribute\"], \"structural_type\": \"str\"}, \"selector\": [\"__ALL_ELEMENTS__\", 12]}, {\"metadata\": {\"name\": \"HBCU\", \"semantic_types\": [\"http://schema.org/Integer\", \"https://metadata.datadrivendiscovery.org/types/Attribute\"], \"structural_type\": \"str\"}, \"selector\": [\"__ALL_ELEMENTS__\", 13]}, {\"metadata\": {\"name\": \"PBI\", \"semantic_types\": [\"http://schema.org/Integer\", \"https://metadata.datadrivendiscovery.org/types/Attribute\"], \"structural_type\": \"str\"}, \"selector\": [\"__ALL_ELEMENTS__\", 14]}, {\"metadata\": {\"name\": \"ANNHI\", \"semantic_types\": [\"http://schema.org/Integer\", \"https://metadata.datadrivendiscovery.org/types/Attribute\"], \"structural_type\": \"str\"}, \"selector\": [\"__ALL_ELEMENTS__\", 15]}, {\"metadata\": {\"name\": \"TRIBAL\", \"semantic_types\": [\"http://schema.org/Integer\", \"https://metadata.datadrivendiscovery.org/types/Attribute\"], \"structural_type\": \"str\"}, \"selector\": [\"__ALL_ELEMENTS__\", 16]}, {\"metadata\": {\"name\": \"AANAPII\", \"semantic_types\": [\"http://schema.org/Integer\", \"https://metadata.datadrivendiscovery.org/types/Attribute\"], \"structural_type\": \"str\"}, \"selector\": [\"__ALL_ELEMENTS__\", 17]}, {\"metadata\": {\"name\": \"HSI\", \"semantic_types\": [\"http://schema.org/Integer\", \"https://metadata.datadrivendiscovery.org/types/Attribute\"], \"structural_type\": \"str\"}, \"selector\": [\"__ALL_ELEMENTS__\", 18]}, {\"metadata\": {\"name\": \"NANTI\", \"semantic_types\": [\"http://schema.org/Integer\", \"https://metadata.datadrivendiscovery.org/types/Attribute\"], \"structural_type\": \"str\"}, \"selector\": [\"__ALL_ELEMENTS__\", 19]}, {\"metadata\": {\"name\": \"MENONLY\", \"semantic_types\": [\"http://schema.org/Integer\", \"https://metadata.datadrivendiscovery.org/types/Attribute\"], \"structural_type\": \"str\"}, \"selector\": [\"__ALL_ELEMENTS__\", 20]}, {\"metadata\": {\"name\": \"WOMENONLY\", \"semantic_types\": [\"http://schema.org/Integer\", \"https://metadata.datadrivendiscovery.org/types/Attribute\"], \"structural_type\": \"str\"}, \"selector\": [\"__ALL_ELEMENTS__\", 21]}, {\"metadata\": {\"name\": \"RELAFFIL\", \"semantic_types\": [\"http://schema.org/Float\", \"https://metadata.datadrivendiscovery.org/types/Attribute\"], \"structural_type\": \"str\"}, \"selector\": [\"__ALL_ELEMENTS__\", 22]}, {\"metadata\": {\"name\": \"SATVR25\", \"semantic_types\": [\"http://schema.org/Float\", \"https://metadata.datadrivendiscovery.org/types/Attribute\"], \"structural_type\": \"str\"}, \"selector\": [\"__ALL_ELEMENTS__\", 23]}, {\"metadata\": {\"name\": \"SATVR75\", \"semantic_types\": [\"http://schema.org/Float\", \"https://metadata.datadrivendiscovery.org/types/Attribute\"], \"structural_type\": \"str\"}, \"selector\": [\"__ALL_ELEMENTS__\", 24]}, {\"metadata\": {\"name\": \"SATMT25\", \"semantic_types\": [\"http://schema.org/Float\", \"https://metadata.datadrivendiscovery.org/types/Attribute\"], \"structural_type\": \"str\"}, \"selector\": [\"__ALL_ELEMENTS__\", 25]}, {\"metadata\": {\"name\": \"SATMT75\", \"semantic_types\": [\"http://schema.org/Float\", \"https://metadata.datadrivendiscovery.org/types/Attribute\"], \"structural_type\": \"str\"}, \"selector\": [\"__ALL_ELEMENTS__\", 26]}, {\"metadata\": {\"name\": \"SATWR25\", \"semantic_types\": [\"http://schema.org/Float\", \"https://metadata.datadrivendiscovery.org/types/Attribute\"], \"structural_type\": \"str\"}, \"selector\": [\"__ALL_ELEMENTS__\", 27]}, {\"metadata\": {\"name\": \"SATWR75\", \"semantic_types\": [\"http://schema.org/Float\", \"https://metadata.datadrivendiscovery.org/types/Attribute\"], \"structural_type\": \"str\"}, \"selector\": [\"__ALL_ELEMENTS__\", 28]}, {\"metadata\": {\"name\": \"SATVRMID\", \"semantic_types\": [\"http://schema.org/Float\", \"https://metadata.datadrivendiscovery.org/types/Attribute\"], \"structural_type\": \"str\"}, \"selector\": [\"__ALL_ELEMENTS__\", 29]}, {\"metadata\": {\"name\": \"SATMTMID\", \"semantic_types\": [\"http://schema.org/Float\", \"https://metadata.datadrivendiscovery.org/types/Attribute\"], \"structural_type\": \"str\"}, \"selector\": [\"__ALL_ELEMENTS__\", 30]}, {\"metadata\": {\"name\": \"SATWRMID\", \"semantic_types\": [\"http://schema.org/Float\", \"https://metadata.datadrivendiscovery.org/types/Attribute\"], \"structural_type\": \"str\"}, \"selector\": [\"__ALL_ELEMENTS__\", 31]}, {\"metadata\": {\"name\": \"ACTCM25\", \"semantic_types\": [\"http://schema.org/Float\", \"https://metadata.datadrivendiscovery.org/types/Attribute\"], \"structural_type\": \"str\"}, \"selector\": [\"__ALL_ELEMENTS__\", 32]}, {\"metadata\": {\"name\": \"ACTCM75\", \"semantic_types\": [\"http://schema.org/Float\", \"https://metadata.datadrivendiscovery.org/types/Attribute\"], \"structural_type\": \"str\"}, \"selector\": [\"__ALL_ELEMENTS__\", 33]}, {\"metadata\": {\"name\": \"ACTEN25\", \"semantic_types\": [\"http://schema.org/Float\", \"https://metadata.datadrivendiscovery.org/types/Attribute\"], \"structural_type\": \"str\"}, \"selector\": [\"__ALL_ELEMENTS__\", 34]}, {\"metadata\": {\"name\": \"ACTEN75\", \"semantic_types\": [\"http://schema.org/Float\", \"https://metadata.datadrivendiscovery.org/types/Attribute\"], \"structural_type\": \"str\"}, \"selector\": [\"__ALL_ELEMENTS__\", 35]}, {\"metadata\": {\"name\": \"ACTMT25\", \"semantic_types\": [\"http://schema.org/Float\", \"https://metadata.datadrivendiscovery.org/types/Attribute\"], \"structural_type\": \"str\"}, \"selector\": [\"__ALL_ELEMENTS__\", 36]}, {\"metadata\": {\"name\": \"ACTMT75\", \"semantic_types\": [\"http://schema.org/Float\", \"https://metadata.datadrivendiscovery.org/types/Attribute\"], \"structural_type\": \"str\"}, \"selector\": [\"__ALL_ELEMENTS__\", 37]}, {\"metadata\": {\"name\": \"ACTWR25\", \"semantic_types\": [\"https://metadata.datadrivendiscovery.org/types/CategoricalData\", \"https://metadata.datadrivendiscovery.org/types/Attribute\"], \"structural_type\": \"str\"}, \"selector\": [\"__ALL_ELEMENTS__\", 38]}, {\"metadata\": {\"name\": \"ACTWR75\", \"semantic_types\": [\"https://metadata.datadrivendiscovery.org/types/CategoricalData\", \"https://metadata.datadrivendiscovery.org/types/Attribute\"], \"structural_type\": \"str\"}, \"selector\": [\"__ALL_ELEMENTS__\", 39]}, {\"metadata\": {\"name\": \"ACTCMMID\", \"semantic_types\": [\"http://schema.org/Float\", \"https://metadata.datadrivendiscovery.org/types/Attribute\"], \"structural_type\": \"str\"}, \"selector\": [\"__ALL_ELEMENTS__\", 40]}, {\"metadata\": {\"name\": \"ACTENMID\", \"semantic_types\": [\"http://schema.org/Float\", \"https://metadata.datadrivendiscovery.org/types/Attribute\"], \"structural_type\": \"str\"}, \"selector\": [\"__ALL_ELEMENTS__\", 41]}, {\"metadata\": {\"name\": \"ACTMTMID\", \"semantic_types\": [\"http://schema.org/Float\", \"https://metadata.datadrivendiscovery.org/types/Attribute\"], \"structural_type\": \"str\"}, \"selector\": [\"__ALL_ELEMENTS__\", 42]}, {\"metadata\": {\"name\": \"ACTWRMID\", \"semantic_types\": [\"https://metadata.datadrivendiscovery.org/types/CategoricalData\", \"https://metadata.datadrivendiscovery.org/types/Attribute\"], \"structural_type\": \"str\"}, \"selector\": [\"__ALL_ELEMENTS__\", 43]}, {\"metadata\": {\"name\": \"SAT_AVG\", \"semantic_types\": [\"http://schema.org/Float\", \"https://metadata.datadrivendiscovery.org/types/Attribute\"], \"structural_type\": \"str\"}, \"selector\": [\"__ALL_ELEMENTS__\", 44]}, {\"metadata\": {\"name\": \"SAT_AVG_ALL\", \"semantic_types\": [\"http://schema.org/Float\", \"https://metadata.datadrivendiscovery.org/types/Attribute\"], \"structural_type\": \"str\"}, \"selector\": [\"__ALL_ELEMENTS__\", 45]}, {\"metadata\": {\"name\": \"PCIP01\", \"semantic_types\": [\"http://schema.org/Float\", \"https://metadata.datadrivendiscovery.org/types/Attribute\"], \"structural_type\": \"str\"}, \"selector\": [\"__ALL_ELEMENTS__\", 46]}, {\"metadata\": {\"name\": \"PCIP03\", \"semantic_types\": [\"http://schema.org/Float\", \"https://metadata.datadrivendiscovery.org/types/Attribute\"], \"structural_type\": \"str\"}, \"selector\": [\"__ALL_ELEMENTS__\", 47]}, {\"metadata\": {\"name\": \"PCIP04\", \"semantic_types\": [\"http://schema.org/Float\", \"https://metadata.datadrivendiscovery.org/types/Attribute\"], \"structural_type\": \"str\"}, \"selector\": [\"__ALL_ELEMENTS__\", 48]}, {\"metadata\": {\"name\": \"PCIP05\", \"semantic_types\": [\"http://schema.org/Float\", \"https://metadata.datadrivendiscovery.org/types/Attribute\"], \"structural_type\": \"str\"}, \"selector\": [\"__ALL_ELEMENTS__\", 49]}, {\"metadata\": {\"name\": \"PCIP09\", \"semantic_types\": [\"http://schema.org/Float\", \"https://metadata.datadrivendiscovery.org/types/Attribute\"], \"structural_type\": \"str\"}, \"selector\": [\"__ALL_ELEMENTS__\", 50]}, {\"metadata\": {\"name\": \"PCIP10\", \"semantic_types\": [\"http://schema.org/Float\", \"https://metadata.datadrivendiscovery.org/types/Attribute\"], \"structural_type\": \"str\"}, \"selector\": [\"__ALL_ELEMENTS__\", 51]}, {\"metadata\": {\"name\": \"PCIP11\", \"semantic_types\": [\"http://schema.org/Float\", \"https://metadata.datadrivendiscovery.org/types/Attribute\"], \"structural_type\": \"str\"}, \"selector\": [\"__ALL_ELEMENTS__\", 52]}, {\"metadata\": {\"name\": \"PCIP12\", \"semantic_types\": [\"http://schema.org/Float\", \"https://metadata.datadrivendiscovery.org/types/Attribute\"], \"structural_type\": \"str\"}, \"selector\": [\"__ALL_ELEMENTS__\", 53]}, {\"metadata\": {\"name\": \"PCIP13\", \"semantic_types\": [\"http://schema.org/Float\", \"https://metadata.datadrivendiscovery.org/types/Attribute\"], \"structural_type\": \"str\"}, \"selector\": [\"__ALL_ELEMENTS__\", 54]}, {\"metadata\": {\"name\": \"PCIP14\", \"semantic_types\": [\"http://schema.org/Float\", \"https://metadata.datadrivendiscovery.org/types/Attribute\"], \"structural_type\": \"str\"}, \"selector\": [\"__ALL_ELEMENTS__\", 55]}, {\"metadata\": {\"name\": \"PCIP15\", \"semantic_types\": [\"http://schema.org/Float\", \"https://metadata.datadrivendiscovery.org/types/Attribute\"], \"structural_type\": \"str\"}, \"selector\": [\"__ALL_ELEMENTS__\", 56]}, {\"metadata\": {\"name\": \"PCIP16\", \"semantic_types\": [\"http://schema.org/Float\", \"https://metadata.datadrivendiscovery.org/types/Attribute\"], \"structural_type\": \"str\"}, \"selector\": [\"__ALL_ELEMENTS__\", 57]}, {\"metadata\": {\"name\": \"PCIP19\", \"semantic_types\": [\"http://schema.org/Float\", \"https://metadata.datadrivendiscovery.org/types/Attribute\"], \"structural_type\": \"str\"}, \"selector\": [\"__ALL_ELEMENTS__\", 58]}, {\"metadata\": {\"name\": \"PCIP22\", \"semantic_types\": [\"http://schema.org/Float\", \"https://metadata.datadrivendiscovery.org/types/Attribute\"], \"structural_type\": \"str\"}, \"selector\": [\"__ALL_ELEMENTS__\", 59]}, {\"metadata\": {\"name\": \"PCIP23\", \"semantic_types\": [\"http://schema.org/Float\", \"https://metadata.datadrivendiscovery.org/types/Attribute\"], \"structural_type\": \"str\"}, \"selector\": [\"__ALL_ELEMENTS__\", 60]}, {\"metadata\": {\"name\": \"PCIP24\", \"semantic_types\": [\"http://schema.org/Float\", \"https://metadata.datadrivendiscovery.org/types/Attribute\"], \"structural_type\": \"str\"}, \"selector\": [\"__ALL_ELEMENTS__\", 61]}, {\"metadata\": {\"name\": \"PCIP25\", \"semantic_types\": [\"http://schema.org/Float\", \"https://metadata.datadrivendiscovery.org/types/Attribute\"], \"structural_type\": \"str\"}, \"selector\": [\"__ALL_ELEMENTS__\", 62]}, {\"metadata\": {\"name\": \"PCIP26\", \"semantic_types\": [\"http://schema.org/Float\", \"https://metadata.datadrivendiscovery.org/types/Attribute\"], \"structural_type\": \"str\"}, \"selector\": [\"__ALL_ELEMENTS__\", 63]}, {\"metadata\": {\"name\": \"PCIP27\", \"semantic_types\": [\"http://schema.org/Float\", \"https://metadata.datadrivendiscovery.org/types/Attribute\"], \"structural_type\": \"str\"}, \"selector\": [\"__ALL_ELEMENTS__\", 64]}, {\"metadata\": {\"name\": \"PCIP29\", \"semantic_types\": [\"http://schema.org/Float\", \"https://metadata.datadrivendiscovery.org/types/Attribute\"], \"structural_type\": \"str\"}, \"selector\": [\"__ALL_ELEMENTS__\", 65]}, {\"metadata\": {\"name\": \"PCIP30\", \"semantic_types\": [\"http://schema.org/Float\", \"https://metadata.datadrivendiscovery.org/types/Attribute\"], \"structural_type\": \"str\"}, \"selector\": [\"__ALL_ELEMENTS__\", 66]}, {\"metadata\": {\"name\": \"PCIP31\", \"semantic_types\": [\"http://schema.org/Float\", \"https://metadata.datadrivendiscovery.org/types/Attribute\"], \"structural_type\": \"str\"}, \"selector\": [\"__ALL_ELEMENTS__\", 67]}, {\"metadata\": {\"name\": \"PCIP38\", \"semantic_types\": [\"http://schema.org/Float\", \"https://metadata.datadrivendiscovery.org/types/Attribute\"], \"structural_type\": \"str\"}, \"selector\": [\"__ALL_ELEMENTS__\", 68]}, {\"metadata\": {\"name\": \"PCIP39\", \"semantic_types\": [\"http://schema.org/Float\", \"https://metadata.datadrivendiscovery.org/types/Attribute\"], \"structural_type\": \"str\"}, \"selector\": [\"__ALL_ELEMENTS__\", 69]}, {\"metadata\": {\"name\": \"PCIP40\", \"semantic_types\": [\"http://schema.org/Float\", \"https://metadata.datadrivendiscovery.org/types/Attribute\"], \"structural_type\": \"str\"}, \"selector\": [\"__ALL_ELEMENTS__\", 70]}, {\"metadata\": {\"name\": \"PCIP41\", \"semantic_types\": [\"http://schema.org/Float\", \"https://metadata.datadrivendiscovery.org/types/Attribute\"], \"structural_type\": \"str\"}, \"selector\": [\"__ALL_ELEMENTS__\", 71]}, {\"metadata\": {\"name\": \"PCIP42\", \"semantic_types\": [\"http://schema.org/Float\", \"https://metadata.datadrivendiscovery.org/types/Attribute\"], \"structural_type\": \"str\"}, \"selector\": [\"__ALL_ELEMENTS__\", 72]}, {\"metadata\": {\"name\": \"PCIP43\", \"semantic_types\": [\"http://schema.org/Float\", \"https://metadata.datadrivendiscovery.org/types/Attribute\"], \"structural_type\": \"str\"}, \"selector\": [\"__ALL_ELEMENTS__\", 73]}, {\"metadata\": {\"name\": \"PCIP44\", \"semantic_types\": [\"http://schema.org/Float\", \"https://metadata.datadrivendiscovery.org/types/Attribute\"], \"structural_type\": \"str\"}, \"selector\": [\"__ALL_ELEMENTS__\", 74]}, {\"metadata\": {\"name\": \"PCIP45\", \"semantic_types\": [\"http://schema.org/Float\", \"https://metadata.datadrivendiscovery.org/types/Attribute\"], \"structural_type\": \"str\"}, \"selector\": [\"__ALL_ELEMENTS__\", 75]}, {\"metadata\": {\"name\": \"PCIP46\", \"semantic_types\": [\"http://schema.org/Float\", \"https://metadata.datadrivendiscovery.org/types/Attribute\"], \"structural_type\": \"str\"}, \"selector\": [\"__ALL_ELEMENTS__\", 76]}, {\"metadata\": {\"name\": \"PCIP47\", \"semantic_types\": [\"http://schema.org/Float\", \"https://metadata.datadrivendiscovery.org/types/Attribute\"], \"structural_type\": \"str\"}, \"selector\": [\"__ALL_ELEMENTS__\", 77]}, {\"metadata\": {\"name\": \"PCIP48\", \"semantic_types\": [\"http://schema.org/Float\", \"https://metadata.datadrivendiscovery.org/types/Attribute\"], \"structural_type\": \"str\"}, \"selector\": [\"__ALL_ELEMENTS__\", 78]}, {\"metadata\": {\"name\": \"PCIP49\", \"semantic_types\": [\"http://schema.org/Float\", \"https://metadata.datadrivendiscovery.org/types/Attribute\"], \"structural_type\": \"str\"}, \"selector\": [\"__ALL_ELEMENTS__\", 79]}, {\"metadata\": {\"name\": \"PCIP50\", \"semantic_types\": [\"http://schema.org/Float\", \"https://metadata.datadrivendiscovery.org/types/Attribute\"], \"structural_type\": \"str\"}, \"selector\": [\"__ALL_ELEMENTS__\", 80]}, {\"metadata\": {\"name\": \"PCIP51\", \"semantic_types\": [\"http://schema.org/Float\", \"https://metadata.datadrivendiscovery.org/types/Attribute\"], \"structural_type\": \"str\"}, \"selector\": [\"__ALL_ELEMENTS__\", 81]}, {\"metadata\": {\"name\": \"PCIP52\", \"semantic_types\": [\"http://schema.org/Float\", \"https://metadata.datadrivendiscovery.org/types/Attribute\"], \"structural_type\": \"str\"}, \"selector\": [\"__ALL_ELEMENTS__\", 82]}, {\"metadata\": {\"name\": \"PCIP54\", \"semantic_types\": [\"http://schema.org/Float\", \"https://metadata.datadrivendiscovery.org/types/Attribute\"], \"structural_type\": \"str\"}, \"selector\": [\"__ALL_ELEMENTS__\", 83]}, {\"metadata\": {\"name\": \"DISTANCEONLY\", \"semantic_types\": [\"http://schema.org/Integer\", \"https://metadata.datadrivendiscovery.org/types/Attribute\"], \"structural_type\": \"str\"}, \"selector\": [\"__ALL_ELEMENTS__\", 84]}, {\"metadata\": {\"name\": \"UGDS\", \"semantic_types\": [\"http://schema.org/Float\", \"https://metadata.datadrivendiscovery.org/types/Attribute\"], \"structural_type\": \"str\"}, \"selector\": [\"__ALL_ELEMENTS__\", 85]}, {\"metadata\": {\"name\": \"UGDS_WHITE\", \"semantic_types\": [\"http://schema.org/Float\", \"https://metadata.datadrivendiscovery.org/types/Attribute\"], \"structural_type\": \"str\"}, \"selector\": [\"__ALL_ELEMENTS__\", 86]}, {\"metadata\": {\"name\": \"UGDS_BLACK\", \"semantic_types\": [\"http://schema.org/Float\", \"https://metadata.datadrivendiscovery.org/types/Attribute\"], \"structural_type\": \"str\"}, \"selector\": [\"__ALL_ELEMENTS__\", 87]}, {\"metadata\": {\"name\": \"UGDS_HISP\", \"semantic_types\": [\"http://schema.org/Float\", \"https://metadata.datadrivendiscovery.org/types/Attribute\"], \"structural_type\": \"str\"}, \"selector\": [\"__ALL_ELEMENTS__\", 88]}, {\"metadata\": {\"name\": \"UGDS_ASIAN\", \"semantic_types\": [\"http://schema.org/Float\", \"https://metadata.datadrivendiscovery.org/types/Attribute\"], \"structural_type\": \"str\"}, \"selector\": [\"__ALL_ELEMENTS__\", 89]}, {\"metadata\": {\"name\": \"UGDS_AIAN\", \"semantic_types\": [\"http://schema.org/Float\", \"https://metadata.datadrivendiscovery.org/types/Attribute\"], \"structural_type\": \"str\"}, \"selector\": [\"__ALL_ELEMENTS__\", 90]}, {\"metadata\": {\"name\": \"UGDS_NHPI\", \"semantic_types\": [\"http://schema.org/Float\", \"https://metadata.datadrivendiscovery.org/types/Attribute\"], \"structural_type\": \"str\"}, \"selector\": [\"__ALL_ELEMENTS__\", 91]}, {\"metadata\": {\"name\": \"UGDS_2MOR\", \"semantic_types\": [\"http://schema.org/Float\", \"https://metadata.datadrivendiscovery.org/types/Attribute\"], \"structural_type\": \"str\"}, \"selector\": [\"__ALL_ELEMENTS__\", 92]}, {\"metadata\": {\"name\": \"UGDS_NRA\", \"semantic_types\": [\"http://schema.org/Float\", \"https://metadata.datadrivendiscovery.org/types/Attribute\"], \"structural_type\": \"str\"}, \"selector\": [\"__ALL_ELEMENTS__\", 93]}, {\"metadata\": {\"name\": \"UGDS_UNKN\", \"semantic_types\": [\"http://schema.org/Float\", \"https://metadata.datadrivendiscovery.org/types/Attribute\"], \"structural_type\": \"str\"}, \"selector\": [\"__ALL_ELEMENTS__\", 94]}, {\"metadata\": {\"name\": \"PPTUG_EF\", \"semantic_types\": [\"http://schema.org/Float\", \"https://metadata.datadrivendiscovery.org/types/Attribute\"], \"structural_type\": \"str\"}, \"selector\": [\"__ALL_ELEMENTS__\", 95]}, {\"metadata\": {\"name\": \"CURROPER\", \"semantic_types\": [\"http://schema.org/Integer\", \"https://metadata.datadrivendiscovery.org/types/Attribute\"], \"structural_type\": \"str\"}, \"selector\": [\"__ALL_ELEMENTS__\", 96]}, {\"metadata\": {\"name\": \"NPT4_PUB\", \"semantic_types\": [\"http://schema.org/Float\", \"https://metadata.datadrivendiscovery.org/types/Attribute\"], \"structural_type\": \"str\"}, \"selector\": [\"__ALL_ELEMENTS__\", 97]}, {\"metadata\": {\"name\": \"NPT4_PRIV\", \"semantic_types\": [\"http://schema.org/Float\", \"https://metadata.datadrivendiscovery.org/types/Attribute\"], \"structural_type\": \"str\"}, \"selector\": [\"__ALL_ELEMENTS__\", 98]}, {\"metadata\": {\"name\": \"NPT41_PUB\", \"semantic_types\": [\"http://schema.org/Float\", \"https://metadata.datadrivendiscovery.org/types/Attribute\"], \"structural_type\": \"str\"}, \"selector\": [\"__ALL_ELEMENTS__\", 99]}, {\"metadata\": {\"name\": \"NPT42_PUB\", \"semantic_types\": [\"http://schema.org/Float\", \"https://metadata.datadrivendiscovery.org/types/Attribute\"], \"structural_type\": \"str\"}, \"selector\": [\"__ALL_ELEMENTS__\", 100]}, {\"metadata\": {\"name\": \"NPT43_PUB\", \"semantic_types\": [\"http://schema.org/Float\", \"https://metadata.datadrivendiscovery.org/types/Attribute\"], \"structural_type\": \"str\"}, \"selector\": [\"__ALL_ELEMENTS__\", 101]}, {\"metadata\": {\"name\": \"NPT44_PUB\", \"semantic_types\": [\"http://schema.org/Float\", \"https://metadata.datadrivendiscovery.org/types/Attribute\"], \"structural_type\": \"str\"}, \"selector\": [\"__ALL_ELEMENTS__\", 102]}, {\"metadata\": {\"name\": \"NPT45_PUB\", \"semantic_types\": [\"http://schema.org/Float\", \"https://metadata.datadrivendiscovery.org/types/Attribute\"], \"structural_type\": \"str\"}, \"selector\": [\"__ALL_ELEMENTS__\", 103]}, {\"metadata\": {\"name\": \"NPT41_PRIV\", \"semantic_types\": [\"http://schema.org/Float\", \"https://metadata.datadrivendiscovery.org/types/Attribute\"], \"structural_type\": \"str\"}, \"selector\": [\"__ALL_ELEMENTS__\", 104]}, {\"metadata\": {\"name\": \"NPT42_PRIV\", \"semantic_types\": [\"http://schema.org/Float\", \"https://metadata.datadrivendiscovery.org/types/Attribute\"], \"structural_type\": \"str\"}, \"selector\": [\"__ALL_ELEMENTS__\", 105]}, {\"metadata\": {\"name\": \"NPT43_PRIV\", \"semantic_types\": [\"http://schema.org/Float\", \"https://metadata.datadrivendiscovery.org/types/Attribute\"], \"structural_type\": \"str\"}, \"selector\": [\"__ALL_ELEMENTS__\", 106]}, {\"metadata\": {\"name\": \"NPT44_PRIV\", \"semantic_types\": [\"http://schema.org/Float\", \"https://metadata.datadrivendiscovery.org/types/Attribute\"], \"structural_type\": \"str\"}, \"selector\": [\"__ALL_ELEMENTS__\", 107]}, {\"metadata\": {\"name\": \"NPT45_PRIV\", \"semantic_types\": [\"http://schema.org/Float\", \"https://metadata.datadrivendiscovery.org/types/Attribute\"], \"structural_type\": \"str\"}, \"selector\": [\"__ALL_ELEMENTS__\", 108]}, {\"metadata\": {\"name\": \"PCTPELL\", \"semantic_types\": [\"http://schema.org/Float\", \"https://metadata.datadrivendiscovery.org/types/Attribute\"], \"structural_type\": \"str\"}, \"selector\": [\"__ALL_ELEMENTS__\", 109]}, {\"metadata\": {\"name\": \"RET_FT4_POOLED_SUPP\", \"semantic_types\": [\"http://schema.org/Float\", \"https://metadata.datadrivendiscovery.org/types/Attribute\"], \"structural_type\": \"str\"}, \"selector\": [\"__ALL_ELEMENTS__\", 110]}, {\"metadata\": {\"name\": \"RET_FTL4_POOLED_SUPP\", \"semantic_types\": [\"http://schema.org/Float\", \"https://metadata.datadrivendiscovery.org/types/Attribute\"], \"structural_type\": \"str\"}, \"selector\": [\"__ALL_ELEMENTS__\", 111]}, {\"metadata\": {\"name\": \"RET_PT4_POOLED_SUPP\", \"semantic_types\": [\"http://schema.org/Float\", \"https://metadata.datadrivendiscovery.org/types/Attribute\"], \"structural_type\": \"str\"}, \"selector\": [\"__ALL_ELEMENTS__\", 112]}, {\"metadata\": {\"name\": \"RET_PTL4_POOLED_SUPP\", \"semantic_types\": [\"http://schema.org/Float\", \"https://metadata.datadrivendiscovery.org/types/Attribute\"], \"structural_type\": \"str\"}, \"selector\": [\"__ALL_ELEMENTS__\", 113]}, {\"metadata\": {\"name\": \"PCTFLOAN\", \"semantic_types\": [\"http://schema.org/Float\", \"https://metadata.datadrivendiscovery.org/types/Attribute\"], \"structural_type\": \"str\"}, \"selector\": [\"__ALL_ELEMENTS__\", 114]}, {\"metadata\": {\"name\": \"UG25ABV\", \"semantic_types\": [\"http://schema.org/Float\", \"https://metadata.datadrivendiscovery.org/types/Attribute\"], \"structural_type\": \"str\"}, \"selector\": [\"__ALL_ELEMENTS__\", 115]}, {\"metadata\": {\"name\": \"MD_EARN_WNE_P10\", \"semantic_types\": [\"http://schema.org/Float\", \"https://metadata.datadrivendiscovery.org/types/Attribute\"], \"structural_type\": \"str\"}, \"selector\": [\"__ALL_ELEMENTS__\", 116]}, {\"metadata\": {\"name\": \"GT_25K_P6\", \"semantic_types\": [\"http://schema.org/Float\", \"https://metadata.datadrivendiscovery.org/types/Attribute\"], \"structural_type\": \"str\"}, \"selector\": [\"__ALL_ELEMENTS__\", 117]}, {\"metadata\": {\"name\": \"GT_28K_P6\", \"semantic_types\": [\"http://schema.org/Float\", \"https://metadata.datadrivendiscovery.org/types/Attribute\"], \"structural_type\": \"str\"}, \"selector\": [\"__ALL_ELEMENTS__\", 118]}, {\"metadata\": {\"name\": \"GRAD_DEBT_MDN_SUPP\", \"semantic_types\": [\"http://schema.org/Float\", \"https://metadata.datadrivendiscovery.org/types/Attribute\"], \"structural_type\": \"str\"}, \"selector\": [\"__ALL_ELEMENTS__\", 119]}, {\"metadata\": {\"name\": \"GRAD_DEBT_MDN10YR_SUPP\", \"semantic_types\": [\"http://schema.org/Float\", \"https://metadata.datadrivendiscovery.org/types/Attribute\"], \"structural_type\": \"str\"}, \"selector\": [\"__ALL_ELEMENTS__\", 120]}, {\"metadata\": {\"name\": \"RPY_3YR_RT_SUPP\", \"semantic_types\": [\"http://schema.org/Float\", \"https://metadata.datadrivendiscovery.org/types/Attribute\"], \"structural_type\": \"str\"}, \"selector\": [\"__ALL_ELEMENTS__\", 121]}, {\"metadata\": {\"name\": \"C150_L4_POOLED_SUPP\", \"semantic_types\": [\"http://schema.org/Float\", \"https://metadata.datadrivendiscovery.org/types/Attribute\"], \"structural_type\": \"str\"}, \"selector\": [\"__ALL_ELEMENTS__\", 122]}, {\"metadata\": {\"name\": \"C150_4_POOLED_SUPP\", \"semantic_types\": [\"http://schema.org/Float\", \"https://metadata.datadrivendiscovery.org/types/Attribute\"], \"structural_type\": \"str\"}, \"selector\": [\"__ALL_ELEMENTS__\", 123]}, {\"metadata\": {\"name\": \"UNITID_wikidata\", \"semantic_types\": [\"http://schema.org/Text\", \"https://metadata.datadrivendiscovery.org/types/Attribute\"], \"structural_type\": \"str\"}, \"selector\": [\"__ALL_ELEMENTS__\", 124]}, {\"metadata\": {\"name\": \"OPEID6_wikidata\", \"semantic_types\": [\"http://schema.org/Text\", \"https://metadata.datadrivendiscovery.org/types/Attribute\"], \"structural_type\": \"str\"}, \"selector\": [\"__ALL_ELEMENTS__\", 125]}, {\"metadata\": {\"name\": \"STABBR_wikidata\", \"semantic_types\": [\"http://schema.org/Text\", \"https://metadata.datadrivendiscovery.org/types/Attribute\"], \"structural_type\": \"str\"}, \"selector\": [\"__ALL_ELEMENTS__\", 126]}, {\"metadata\": {\"name\": \"CITY_wikidata\", \"semantic_types\": [\"http://schema.org/Text\", \"https://metadata.datadrivendiscovery.org/types/Attribute\"], \"structural_type\": \"str\"}, \"selector\": [\"__ALL_ELEMENTS__\", 127]}], \"score\": 0.9398390424662831, \"summary\": {\"Columns\": [\"[0] UNITID\", \"[1] OPEID\", \"[2] OPEID6\", \"[3] INSTNM\", \"[4] CITY\", \"[5] STABBR\", \"[6] INSTURL\", \"[7] NPCURL\", \"[8] HCM2\", \"[9] PREDDEG\", \"[10] HIGHDEG\", \"[11] CONTROL\", \"[12] LOCALE\", \"[13] HBCU\", \"[14] PBI\", \"[15] ANNHI\", \"[16] TRIBAL\", \"[17] AANAPII\", \"[18] HSI\", \"[19] NANTI\", \"[20] MENONLY\", \"[21] WOMENONLY\", \"[22] RELAFFIL\", \"[23] SATVR25\", \"[24] SATVR75\", \"[25] SATMT25\", \"[26] SATMT75\", \"[27] SATWR25\", \"[28] SATWR75\", \"[29] SATVRMID\", \"[30] SATMTMID\", \"[31] SATWRMID\", \"[32] ACTCM25\", \"[33] ACTCM75\", \"[34] ACTEN25\", \"[35] ACTEN75\", \"[36] ACTMT25\", \"[37] ACTMT75\", \"[38] ACTWR25\", \"[39] ACTWR75\", \"[40] ACTCMMID\", \"[41] ACTENMID\", \"[42] ACTMTMID\", \"[43] ACTWRMID\", \"[44] SAT_AVG\", \"[45] SAT_AVG_ALL\", \"[46] PCIP01\", \"[47] PCIP03\", \"[48] PCIP04\", \"[49] PCIP05\", \"[50] PCIP09\", \"[51] PCIP10\", \"[52] PCIP11\", \"[53] PCIP12\", \"[54] PCIP13\", \"[55] PCIP14\", \"[56] PCIP15\", \"[57] PCIP16\", \"[58] PCIP19\", \"[59] PCIP22\", \"[60] PCIP23\", \"[61] PCIP24\", \"[62] PCIP25\", \"[63] PCIP26\", \"[64] PCIP27\", \"[65] PCIP29\", \"[66] PCIP30\", \"[67] PCIP31\", \"[68] PCIP38\", \"[69] PCIP39\", \"[70] PCIP40\", \"[71] PCIP41\", \"[72] PCIP42\", \"[73] PCIP43\", \"[74] PCIP44\", \"[75] PCIP45\", \"[76] PCIP46\", \"[77] PCIP47\", \"[78] PCIP48\", \"[79] PCIP49\", \"[80] PCIP50\", \"[81] PCIP51\", \"[82] PCIP52\", \"[83] PCIP54\", \"[84] DISTANCEONLY\", \"[85] UGDS\", \"[86] UGDS_WHITE\", \"[87] UGDS_BLACK\", \"[88] UGDS_HISP\", \"[89] UGDS_ASIAN\", \"[90] UGDS_AIAN\", \"[91] UGDS_NHPI\", \"[92] UGDS_2MOR\", \"[93] UGDS_NRA\", \"[94] UGDS_UNKN\", \"[95] PPTUG_EF\", \"[96] CURROPER\", \"[97] NPT4_PUB\", \"[98] NPT4_PRIV\", \"[99] NPT41_PUB\", \"[100] NPT42_PUB\", \"[101] NPT43_PUB\", \"[102] NPT44_PUB\", \"[103] NPT45_PUB\", \"[104] NPT41_PRIV\", \"[105] NPT42_PRIV\", \"[106] NPT43_PRIV\", \"[107] NPT44_PRIV\", \"[108] NPT45_PRIV\", \"[109] PCTPELL\", \"[110] RET_FT4_POOLED_SUPP\", \"[111] RET_FTL4_POOLED_SUPP\", \"[112] RET_PT4_POOLED_SUPP\", \"[113] RET_PTL4_POOLED_SUPP\", \"[114] PCTFLOAN\", \"[115] UG25ABV\", \"[116] MD_EARN_WNE_P10\", \"[117] GT_25K_P6\", \"[118] GT_28K_P6\", \"[119] GRAD_DEBT_MDN_SUPP\", \"[120] GRAD_DEBT_MDN10YR_SUPP\", \"[121] RPY_3YR_RT_SUPP\", \"[122] C150_L4_POOLED_SUPP\", \"[123] C150_4_POOLED_SUPP\", \"[124] UNITID_wikidata\", \"[125] OPEID6_wikidata\", \"[126] STABBR_wikidata\", \"[127] CITY_wikidata\"], \"Datamart ID\": \"D4cb70062-77ed-4097-a486-0b43ffe81463\", \"Recommend Join Columns\": \"INSTNM\", \"Score\": \"0.9398390424662831\", \"URL\": \"http://dsbox02.isi.edu:9000/upload/local_datasets/Most-Recent-Cohorts-Scorecard-Elements.csv\", \"title\": \"most recent cohorts scorecard elements csv\"}, \"supplied_id\": \"DA_college_debt_dataset_TRAIN\", \"supplied_resource_id\": \"learningData\"}" - } - } - }, - { - "type": "PRIMITIVE", - "primitive": { - "id": "4b42ce1e-9b98-4a25-b68e-fad13311eb65", - "version": "0.3.0", - "python_path": "d3m.primitives.data_transformation.dataset_to_dataframe.Common", - "name": "Extract a DataFrame from a Dataset", - "digest": "57475517f8d20c260757a13497239c3ddfb3c0949ab9769e5c177c18b919eaa1" - }, - "arguments": { - "inputs": { - "type": "CONTAINER", - "data": "steps.3.produce" - } - }, - "outputs": [ - { - "id": "produce" - } - ] - }, - { - "type": "PRIMITIVE", - "primitive": { - "id": "3002bc5b-fa47-4a3d-882e-a8b5f3d756aa", - "version": "0.1.0", - "python_path": "d3m.primitives.data_transformation.remove_semantic_types.Common", - "name": "Remove semantic types from columns", - "digest": "a7a99c19c430ad238787bb17f33bb5ad6dd62f350190284dae86798f880281c0" - }, - "arguments": { - "inputs": { - "type": "CONTAINER", - "data": "steps.4.produce" - } - }, - "outputs": [ - { - "id": "produce" - } - ], - "hyperparams": { - "semantic_types": { - "type": "VALUE", - "data": [ - "http://wikidata.org/qnode" - ] - } - } - }, - { - "type": "PRIMITIVE", - "primitive": { - "id": "4503a4c6-42f7-45a1-a1d4-ed69699cf5e1", - "version": "0.3.0", - "python_path": "d3m.primitives.data_transformation.extract_columns_by_semantic_types.Common", - "name": "Extracts columns by semantic type", - "digest": "75a68013cd3c12e77ba31e392298d2a62766ae00d556fdaf30401f7ba4a29b8c" - }, - "arguments": { - "inputs": { - "type": "CONTAINER", - "data": "steps.5.produce" - } - }, - "outputs": [ - { - "id": "produce" - } - ], - "hyperparams": { - "semantic_types": { - "type": "VALUE", - "data": [ - "https://metadata.datadrivendiscovery.org/types/PrimaryKey", - "https://metadata.datadrivendiscovery.org/types/Attribute" - ] - } - } - }, - { - "type": "PRIMITIVE", - "primitive": { - "id": "b2612849-39e4-33ce-bfda-24f3e2cb1e93", - "version": "1.5.3", - "python_path": "d3m.primitives.schema_discovery.profiler.DSBOX", - "name": "DSBox Profiler", - "digest": "d584c3e2af2f60947f9703fd8aa22ea04ccf9fe20266a3f7ac87da939838fe5f" - }, - "arguments": { - "inputs": { - "type": "CONTAINER", - "data": "steps.6.produce" - } - }, - "outputs": [ - { - "id": "produce" - } - ] - }, - { - "type": "PRIMITIVE", - "primitive": { - "id": "dsbox-cleaning-featurizer", - "version": "1.5.3", - "python_path": "d3m.primitives.data_cleaning.cleaning_featurizer.DSBOX", - "name": "DSBox Cleaning Featurizer", - "digest": "3e0646c87ba9d9745ff0ced1ef381da434a29af1654bd3cdc2db46a7f1a87f20" - }, - "arguments": { - "inputs": { - "type": "CONTAINER", - "data": "steps.7.produce" - } - }, - "outputs": [ - { - "id": "produce" - } - ] - }, - { - "type": "PRIMITIVE", - "primitive": { - "id": "18f0bb42-6350-3753-8f2d-d1c3da70f279", - "version": "1.5.3", - "python_path": "d3m.primitives.data_preprocessing.encoder.DSBOX", - "name": "ISI DSBox Data Encoder", - "digest": "026f3fb4af7c426034e492829a2fb6968bb6961fee868f6d3be4fd5c0aae72f7" - }, - "arguments": { - "inputs": { - "type": "CONTAINER", - "data": "steps.8.produce" - } - }, - "outputs": [ - { - "id": "produce" - } - ] - }, - { - "type": "PRIMITIVE", - "primitive": { - "id": "0c64ffd6-cb9e-49f0-b7cb-abd70a5a8261", - "version": "1.0.0", - "python_path": "d3m.primitives.feature_construction.corex_text.DSBOX", - "name": "CorexText", - "digest": "7d942ed753a5d1d4089a37aa446c25cf80a14e6fb0feb2a6a4fc0218d5f88292" - }, - "arguments": { - "inputs": { - "type": "CONTAINER", - "data": "steps.9.produce" - } - }, - "outputs": [ - { - "id": "produce" - } - ] - }, - { - "type": "PRIMITIVE", - "primitive": { - "id": "7ddf2fd8-2f7f-4e53-96a7-0d9f5aeecf93", - "version": "1.5.3", - "python_path": "d3m.primitives.data_transformation.to_numeric.DSBOX", - "name": "ISI DSBox To Numeric DataFrame", - "digest": "0c06f13376139c95f9c7ee2c4ea0e1b74242092a9c0a5359444b584b7b26b4b6" - }, - "arguments": { - "inputs": { - "type": "CONTAINER", - "data": "steps.10.produce" - } - }, - "outputs": [ - { - "id": "produce" - } - ] - }, - { - "type": "PRIMITIVE", - "primitive": { - "id": "7894b699-61e9-3a50-ac9f-9bc510466667", - "version": "1.5.3", - "python_path": "d3m.primitives.data_preprocessing.mean_imputation.DSBOX", - "name": "DSBox Mean Imputer", - "digest": "c06061074a29dffda6f59d779bf6658fd69f5d101a4e48569cd6ad35775da9f0" - }, - "arguments": { - "inputs": { - "type": "CONTAINER", - "data": "steps.11.produce" - } - }, - "outputs": [ - { - "id": "produce" - } - ] - }, - { - "type": "PRIMITIVE", - "primitive": { - "id": "dsbox-featurizer-do-nothing", - "version": "1.5.3", - "python_path": "d3m.primitives.data_preprocessing.do_nothing.DSBOX", - "name": "DSBox do-nothing primitive", - "digest": "b540e87d22c38511e88693cce3dcdba7085ede3119d2b18c1172f734df16ce43" - }, - "arguments": { - "inputs": { - "type": "CONTAINER", - "data": "steps.12.produce" - } - }, - "outputs": [ - { - "id": "produce" - } - ] - }, - { - "type": "PRIMITIVE", - "primitive": { - "id": "dsbox-featurizer-do-nothing", - "version": "1.5.3", - "python_path": "d3m.primitives.data_preprocessing.do_nothing.DSBOX", - "name": "DSBox do-nothing primitive", - "digest": "b540e87d22c38511e88693cce3dcdba7085ede3119d2b18c1172f734df16ce43" - }, - "arguments": { - "inputs": { - "type": "CONTAINER", - "data": "steps.13.produce" - } - }, - "outputs": [ - { - "id": "produce" - } - ] - }, - { - "type": "PRIMITIVE", - "primitive": { - "id": "4503a4c6-42f7-45a1-a1d4-ed69699cf5e1", - "version": "0.3.0", - "python_path": "d3m.primitives.data_transformation.extract_columns_by_semantic_types.Common", - "name": "Extracts columns by semantic type", - "digest": "75a68013cd3c12e77ba31e392298d2a62766ae00d556fdaf30401f7ba4a29b8c" - }, - "arguments": { - "inputs": { - "type": "CONTAINER", - "data": "steps.4.produce" - } - }, - "outputs": [ - { - "id": "produce" - } - ], - "hyperparams": { - "semantic_types": { - "type": "VALUE", - "data": [ - "https://metadata.datadrivendiscovery.org/types/TrueTarget" - ] - } - } - }, - { - "type": "PRIMITIVE", - "primitive": { - "id": "7ddf2fd8-2f7f-4e53-96a7-0d9f5aeecf93", - "version": "1.5.3", - "python_path": "d3m.primitives.data_transformation.to_numeric.DSBOX", - "name": "ISI DSBox To Numeric DataFrame", - "digest": "0c06f13376139c95f9c7ee2c4ea0e1b74242092a9c0a5359444b584b7b26b4b6" - }, - "arguments": { - "inputs": { - "type": "CONTAINER", - "data": "steps.15.produce" - } - }, - "outputs": [ - { - "id": "produce" - } - ], - "hyperparams": { - "drop_non_numeric_columns": { - "type": "VALUE", - "data": false - } - } - }, - { - "type": "PRIMITIVE", - "primitive": { - "id": "35321059-2a1a-31fd-9509-5494efc751c7", - "version": "2019.6.7", - "python_path": "d3m.primitives.regression.extra_trees.SKlearn", - "name": "sklearn.ensemble.forest.ExtraTreesRegressor", - "digest": "0a8153e2821cacf807429c02b1b210ed6c700e8342b7af988b93245514b6f345" - }, - "arguments": { - "inputs": { - "type": "CONTAINER", - "data": "steps.14.produce" - }, - "outputs": { - "type": "CONTAINER", - "data": "steps.16.produce" - } - }, - "outputs": [ - { - "id": "produce" - } - ], - "hyperparams": { - "bootstrap": { - "type": "VALUE", - "data": "bootstrap" - }, - "max_depth": { - "type": "VALUE", - "data": { - "case": "none", - "value": null - } - }, - "min_samples_leaf": { - "type": "VALUE", - "data": { - "case": "absolute", - "value": 1 - } - }, - "min_samples_split": { - "type": "VALUE", - "data": { - "case": "int", - "value": 10 - } - }, - "max_features": { - "type": "VALUE", - "data": { - "case": "calculated", - "value": "auto" - } - }, - "n_estimators": { - "type": "VALUE", - "data": 100 - }, - "add_index_columns": { - "type": "VALUE", - "data": true - }, - "use_semantic_types": { - "type": "VALUE", - "data": true - } - } - } - ], - "source": { - "name": "ISI", - "contact": "mailto:kyao@isi.edu" - }, - "name": "default_regression_template:140004175511624", - "description": "", - "digest": "e1a65c6510329dcee1df7ebee899a5b554578b054e8e0cb1434af88cce4b8d45" -} diff --git a/common-primitives/pipelines/data_augmentation.datamart_augmentation.Common/4ff2f21d-1bba-4c44-bb96-e05728bcf6ed.json b/common-primitives/pipelines/data_augmentation.datamart_augmentation.Common/4ff2f21d-1bba-4c44-bb96-e05728bcf6ed.json deleted file mode 100644 index 5dad103..0000000 --- a/common-primitives/pipelines/data_augmentation.datamart_augmentation.Common/4ff2f21d-1bba-4c44-bb96-e05728bcf6ed.json +++ /dev/null @@ -1,342 +0,0 @@ -{ - "id": "4ff2f21d-1bba-4c44-bb96-e05728bcf6ed", - "name": "classification_template(imputer=d3m.primitives.data_cleaning.imputer.SKlearn, classifier=d3m.primitives.regression.random_forest.SKlearn)", - "description": "To be used with NYU datamart.", - "schema": "https://metadata.datadrivendiscovery.org/schemas/v0/pipeline.json", - "created": "2019-06-06T21:30:30Z", - "context": "TESTING", - "inputs": [ - { - "name": "input dataset" - } - ], - "outputs": [ - { - "data": "steps.12.produce", - "name": "predictions" - } - ], - "steps": [ - { - "type": "PRIMITIVE", - "primitive": { - "id": "fe0f1ac8-1d39-463a-b344-7bd498a31b91", - "version": "0.1", - "name": "Perform dataset augmentation using Datamart", - "python_path": "d3m.primitives.data_augmentation.datamart_augmentation.Common" - }, - "arguments": { - "inputs": { - "type": "CONTAINER", - "data": "inputs.0" - } - }, - "outputs": [ - { - "id": "produce" - } - ], - "hyperparams": { - "system_identifier": { - "type": "VALUE", - "data": "NYU" - }, - "search_result": { - "type": "VALUE", - "data": "{\"augmentation\": {\"left_columns\": [[1]], \"left_columns_names\": [\"tpep_pickup_datetime\"], \"right_columns\": [[0]], \"type\": \"join\"}, \"id\": \"datamart.url.a3943fd7892d5d219012f889327c6661\", \"metadata\": {\"columns\": [{\"coverage\": [{\"range\": {\"gte\": 1451610000.0, \"lte\": 1540252800.0}}], \"mean\": 1495931400.0, \"name\": \"DATE\", \"semantic_types\": [\"http://schema.org/DateTime\"], \"stddev\": 25590011.431395352, \"structural_type\": \"http://schema.org/Text\"}, {\"name\": \"HOURLYSKYCONDITIONS\", \"semantic_types\": [], \"structural_type\": \"http://schema.org/Text\"}, {\"coverage\": [{\"range\": {\"gte\": -17.2, \"lte\": 37.8}}], \"mean\": 14.666224009096823, \"name\": \"HOURLYDRYBULBTEMPC\", \"semantic_types\": [], \"stddev\": 9.973788193915643, \"structural_type\": \"http://schema.org/Float\"}, {\"coverage\": [{\"range\": {\"gte\": 11.0, \"lte\": 100.0}}], \"mean\": 60.70849577647823, \"name\": \"HOURLYRelativeHumidity\", \"semantic_types\": [], \"stddev\": 18.42048051096981, \"structural_type\": \"http://schema.org/Float\"}, {\"coverage\": [{\"range\": {\"gte\": 0.0, \"lte\": 41.0}}], \"mean\": 10.68859649122807, \"name\": \"HOURLYWindSpeed\", \"semantic_types\": [], \"stddev\": 5.539675475162907, \"structural_type\": \"http://schema.org/Float\"}, {\"name\": \"HOURLYWindDirection\", \"semantic_types\": [], \"structural_type\": \"http://schema.org/Text\"}, {\"coverage\": [{\"range\": {\"gte\": 28.89, \"lte\": 30.81}}], \"mean\": 29.90760315139694, \"name\": \"HOURLYStationPressure\", \"semantic_types\": [\"https://metadata.datadrivendiscovery.org/types/PhoneNumber\"], \"stddev\": 0.24584097919742368, \"structural_type\": \"http://schema.org/Float\"}], \"date\": \"2019-01-22T01:54:58.281183Z\", \"description\": \"This data contains weather information for NY city around LaGuardia Airport from 2016 to 2018; weath...\", \"materialize\": {\"direct_url\": \"https://drive.google.com/uc?export=download&id=1jRwzZwEGMICE3n6-nwmVxMD2c0QCHad4\", \"identifier\": \"datamart.url\"}, \"name\": \"Newyork Weather Data around Airport 2016-18\", \"nb_rows\": 24624, \"size\": 1523693}, \"score\": 1.0, \"supplied_id\": \"DA_ny_taxi_demand_dataset_TRAIN\", \"supplied_resource_id\": \"learningData\"}" - } - } - }, - { - "type": "PRIMITIVE", - "primitive": { - "id": "f31f8c1f-d1c5-43e5-a4b2-2ae4a761ef2e", - "version": "0.2.0", - "name": "Denormalize datasets", - "python_path": "d3m.primitives.data_transformation.denormalize.Common" - }, - "arguments": { - "inputs": { - "type": "CONTAINER", - "data": "steps.0.produce" - } - }, - "outputs": [ - { - "id": "produce" - } - ] - }, - { - "type": "PRIMITIVE", - "primitive": { - "id": "4b42ce1e-9b98-4a25-b68e-fad13311eb65", - "version": "0.3.0", - "name": "Extract a DataFrame from a Dataset", - "python_path": "d3m.primitives.data_transformation.dataset_to_dataframe.Common" - }, - "arguments": { - "inputs": { - "type": "CONTAINER", - "data": "steps.1.produce" - } - }, - "outputs": [ - { - "id": "produce" - } - ] - }, - { - "type": "PRIMITIVE", - "primitive": { - "id": "d510cb7a-1782-4f51-b44c-58f0236e47c7", - "version": "0.6.0", - "name": "Parses strings into their types", - "python_path": "d3m.primitives.data_transformation.column_parser.Common" - }, - "arguments": { - "inputs": { - "type": "CONTAINER", - "data": "steps.2.produce" - } - }, - "outputs": [ - { - "id": "produce" - } - ] - }, - { - "type": "PRIMITIVE", - "primitive": { - "id": "4503a4c6-42f7-45a1-a1d4-ed69699cf5e1", - "version": "0.3.0", - "name": "Extracts columns by semantic type", - "python_path": "d3m.primitives.data_transformation.extract_columns_by_semantic_types.Common" - }, - "arguments": { - "inputs": { - "type": "CONTAINER", - "data": "steps.3.produce" - } - }, - "outputs": [ - { - "id": "produce" - } - ], - "hyperparams": { - "semantic_types": { - "type": "VALUE", - "data": [ - "https://metadata.datadrivendiscovery.org/types/Attribute" - ] - } - } - }, - { - "type": "PRIMITIVE", - "primitive": { - "name": "sklearn.impute.SimpleImputer", - "python_path": "d3m.primitives.data_cleaning.imputer.SKlearn", - "version": "2019.11.13", - "id": "d016df89-de62-3c53-87ed-c06bb6a23cde" - }, - "arguments": { - "inputs": { - "type": "CONTAINER", - "data": "steps.4.produce" - } - }, - "outputs": [ - { - "id": "produce" - } - ], - "hyperparams": { - "strategy": { - "type": "VALUE", - "data": "most_frequent" - } - } - }, - { - "type": "PRIMITIVE", - "primitive": { - "name": "sklearn.preprocessing.data.OneHotEncoder", - "python_path": "d3m.primitives.data_transformation.one_hot_encoder.SKlearn", - "version": "2019.11.13", - "id": "c977e879-1bf5-3829-b5b0-39b00233aff5" - }, - "arguments": { - "inputs": { - "type": "CONTAINER", - "data": "steps.5.produce" - } - }, - "outputs": [ - { - "id": "produce" - } - ], - "hyperparams": { - "handle_unknown": { - "type": "VALUE", - "data": "ignore" - } - } - }, - { - "type": "PRIMITIVE", - "primitive": { - "id": "eb5fe752-f22a-4090-948b-aafcef203bf5", - "version": "0.2.0", - "name": "Casts DataFrame", - "python_path": "d3m.primitives.data_transformation.cast_to_type.Common" - }, - "arguments": { - "inputs": { - "type": "CONTAINER", - "data": "steps.6.produce" - } - }, - "outputs": [ - { - "id": "produce" - } - ], - "hyperparams": { - "type_to_cast": { - "type": "VALUE", - "data": "float" - } - } - }, - { - "type": "PRIMITIVE", - "primitive": { - "id": "4503a4c6-42f7-45a1-a1d4-ed69699cf5e1", - "version": "0.3.0", - "name": "Extracts columns by semantic type", - "python_path": "d3m.primitives.data_transformation.extract_columns_by_semantic_types.Common" - }, - "arguments": { - "inputs": { - "type": "CONTAINER", - "data": "steps.3.produce" - } - }, - "outputs": [ - { - "id": "produce" - } - ], - "hyperparams": { - "semantic_types": { - "type": "VALUE", - "data": [ - "https://metadata.datadrivendiscovery.org/types/TrueTarget" - ] - } - } - }, - { - "type": "PRIMITIVE", - "primitive": { - "id": "eb5fe752-f22a-4090-948b-aafcef203bf5", - "version": "0.2.0", - "name": "Casts DataFrame", - "python_path": "d3m.primitives.data_transformation.cast_to_type.Common" - }, - "arguments": { - "inputs": { - "type": "CONTAINER", - "data": "steps.8.produce" - } - }, - "outputs": [ - { - "id": "produce" - } - ] - }, - { - "type": "PRIMITIVE", - "primitive": { - "name": "sklearn.ensemble.forest.RandomForestRegressor", - "python_path": "d3m.primitives.regression.random_forest.SKlearn", - "version": "2019.11.13", - "id": "f0fd7a62-09b5-3abc-93bb-f5f999f7cc80" - }, - "arguments": { - "inputs": { - "type": "CONTAINER", - "data": "steps.7.produce" - }, - "outputs": { - "type": "CONTAINER", - "data": "steps.9.produce" - } - }, - "outputs": [ - { - "id": "produce" - } - ] - }, - { - "type": "PRIMITIVE", - "primitive": { - "id": "4503a4c6-42f7-45a1-a1d4-ed69699cf5e1", - "version": "0.3.0", - "name": "Extracts columns by semantic type", - "python_path": "d3m.primitives.data_transformation.extract_columns_by_semantic_types.Common" - }, - "arguments": { - "inputs": { - "type": "CONTAINER", - "data": "steps.3.produce" - } - }, - "outputs": [ - { - "id": "produce" - } - ], - "hyperparams": { - "semantic_types": { - "type": "VALUE", - "data": [ - "https://metadata.datadrivendiscovery.org/types/Target", - "https://metadata.datadrivendiscovery.org/types/PrimaryKey" - ] - } - } - }, - { - "type": "PRIMITIVE", - "primitive": { - "id": "8d38b340-f83f-4877-baaa-162f8e551736", - "version": "0.3.0", - "name": "Construct pipeline predictions output", - "python_path": "d3m.primitives.data_transformation.construct_predictions.Common" - }, - "arguments": { - "inputs": { - "type": "CONTAINER", - "data": "steps.10.produce" - }, - "reference": { - "type": "CONTAINER", - "data": "steps.11.produce" - } - }, - "outputs": [ - { - "id": "produce" - } - ] - } - ] -} diff --git a/common-primitives/pipelines/data_preprocessing.dataset_sample.Common/387d432a-9893-4558-b190-1c5e9e399dbf.yaml b/common-primitives/pipelines/data_preprocessing.dataset_sample.Common/387d432a-9893-4558-b190-1c5e9e399dbf.yaml deleted file mode 100644 index d8ece59..0000000 --- a/common-primitives/pipelines/data_preprocessing.dataset_sample.Common/387d432a-9893-4558-b190-1c5e9e399dbf.yaml +++ /dev/null @@ -1,123 +0,0 @@ -id: 387d432a-9893-4558-b190-1c5e9e399dbf -schema: https://metadata.datadrivendiscovery.org/schemas/v0/pipeline.json -source: - name: Jeffrey Gleason -created: "2019-06-05T2:48:52.806069Z" -context: TESTING -name: Dataset sample test pipeline -description: | - A simple pipeline which runs Random Forest classifier on tabular data after sampling the dataset (50% of rows) -inputs: - - name: input dataset -outputs: - - name: predictions - data: steps.6.produce -steps: - # Step 0. - - type: PRIMITIVE - primitive: - id: 268315c1-7549-4aee-a4cc-28921cba74c0 - version: 0.1.0 - python_path: d3m.primitives.data_preprocessing.dataset_sample.Common - name: Dataset sampling primitive - arguments: - inputs: - type: CONTAINER - data: inputs.0 - outputs: - - id: produce - # Step 1. - - type: PRIMITIVE - primitive: - id: f31f8c1f-d1c5-43e5-a4b2-2ae4a761ef2e - version: 0.2.0 - python_path: d3m.primitives.data_transformation.denormalize.Common - name: Denormalize datasets - arguments: - inputs: - type: CONTAINER - data: steps.0.produce - outputs: - - id: produce - # Step 2. - - type: PRIMITIVE - primitive: - id: 4b42ce1e-9b98-4a25-b68e-fad13311eb65 - version: 0.3.0 - python_path: d3m.primitives.data_transformation.dataset_to_dataframe.Common - name: Extract a DataFrame from a Dataset - arguments: - inputs: - type: CONTAINER - data: steps.1.produce - outputs: - - id: produce - # Step 3. - - type: PRIMITIVE - primitive: - id: d510cb7a-1782-4f51-b44c-58f0236e47c7 - version: 0.6.0 - python_path: d3m.primitives.data_transformation.column_parser.Common - name: Parses strings into their types - arguments: - inputs: - type: CONTAINER - data: steps.2.produce - outputs: - - id: produce - # Step 4. - - type: PRIMITIVE - primitive: - id: d016df89-de62-3c53-87ed-c06bb6a23cde - version: 2019.6.7 - python_path: d3m.primitives.data_cleaning.imputer.SKlearn - name: sklearn.impute.SimpleImputer - arguments: - inputs: - type: CONTAINER - data: steps.3.produce - outputs: - - id: produce - hyperparams: - use_semantic_types: - type: VALUE - data: true - return_result: - type: VALUE - data: replace - # Step 5. - - type: PRIMITIVE - primitive: - id: 37c2b19d-bdab-4a30-ba08-6be49edcc6af - version: 0.4.0 - python_path: d3m.primitives.classification.random_forest.Common - name: Random forest classifier - arguments: - inputs: - type: CONTAINER - data: steps.4.produce - outputs: - type: CONTAINER - data: steps.4.produce - outputs: - - id: produce - hyperparams: - return_result: - type: VALUE - data: replace - # Step 6. - - type: PRIMITIVE - primitive: - id: 8d38b340-f83f-4877-baaa-162f8e551736 - version: 0.3.0 - python_path: d3m.primitives.data_transformation.construct_predictions.Common - name: Construct pipeline predictions output - arguments: - inputs: - type: CONTAINER - data: steps.5.produce - reference: - type: CONTAINER - data: steps.3.produce - outputs: - - id: produce diff --git a/common-primitives/pipelines/data_preprocessing.one_hot_encoder.MakerCommon/2b307634-f01e-412e-8d95-7e54afd4731f.json b/common-primitives/pipelines/data_preprocessing.one_hot_encoder.MakerCommon/2b307634-f01e-412e-8d95-7e54afd4731f.json deleted file mode 100644 index 5606e66..0000000 --- a/common-primitives/pipelines/data_preprocessing.one_hot_encoder.MakerCommon/2b307634-f01e-412e-8d95-7e54afd4731f.json +++ /dev/null @@ -1,300 +0,0 @@ -{ - "context": "TESTING", - "created": "2019-02-12T02:10:00.929519Z", - "id": "2b307634-f01e-412e-8d95-7e54afd4731f", - "inputs": [ - { - "name": "inputs" - } - ], - "outputs": [ - { - "data": "steps.9.produce", - "name": "output predictions" - } - ], - "schema": "https://metadata.datadrivendiscovery.org/schemas/v0/pipeline.json", - "steps": [ - { - "arguments": { - "inputs": { - "data": "inputs.0", - "type": "CONTAINER" - } - }, - "outputs": [ - { - "id": "produce" - } - ], - "primitive": { - "id": "4b42ce1e-9b98-4a25-b68e-fad13311eb65", - "name": "Extract a DataFrame from a Dataset", - "python_path": "d3m.primitives.data_transformation.dataset_to_dataframe.Common", - "version": "0.3.0" - }, - "type": "PRIMITIVE" - }, - { - "arguments": { - "inputs": { - "data": "steps.0.produce", - "type": "CONTAINER" - } - }, - "hyperparams": { - "parse_semantic_types": { - "data": [ - "http://schema.org/Boolean", - "http://schema.org/Integer", - "http://schema.org/Float", - "https://metadata.datadrivendiscovery.org/types/FloatVector", - "http://schema.org/DateTime" - ], - "type": "VALUE" - } - }, - "outputs": [ - { - "id": "produce" - } - ], - "primitive": { - "id": "d510cb7a-1782-4f51-b44c-58f0236e47c7", - "name": "Parses strings into their types", - "python_path": "d3m.primitives.data_transformation.column_parser.Common", - "version": "0.6.0" - }, - "type": "PRIMITIVE" - }, - { - "arguments": { - "inputs": { - "data": "steps.1.produce", - "type": "CONTAINER" - } - }, - "hyperparams": { - "semantic_types": { - "data": [ - "https://metadata.datadrivendiscovery.org/types/CategoricalData" - ], - "type": "VALUE" - } - }, - "outputs": [ - { - "id": "produce" - } - ], - "primitive": { - "id": "4503a4c6-42f7-45a1-a1d4-ed69699cf5e1", - "name": "Extracts columns by semantic type", - "python_path": "d3m.primitives.data_transformation.extract_columns_by_semantic_types.Common", - "version": "0.3.0" - }, - "type": "PRIMITIVE" - }, - { - "arguments": { - "inputs": { - "data": "steps.1.produce", - "type": "CONTAINER" - } - }, - "hyperparams": { - "exclude_columns": { - "data": [ - 0 - ], - "type": "VALUE" - }, - "semantic_types": { - "data": [ - "http://schema.org/Integer", - "http://schema.org/Float" - ], - "type": "VALUE" - } - }, - "outputs": [ - { - "id": "produce" - } - ], - "primitive": { - "id": "4503a4c6-42f7-45a1-a1d4-ed69699cf5e1", - "name": "Extracts columns by semantic type", - "python_path": "d3m.primitives.data_transformation.extract_columns_by_semantic_types.Common", - "version": "0.3.0" - }, - "type": "PRIMITIVE" - }, - { - "arguments": { - "inputs": { - "data": "steps.0.produce", - "type": "CONTAINER" - } - }, - "hyperparams": { - "semantic_types": { - "data": [ - "https://metadata.datadrivendiscovery.org/types/TrueTarget" - ], - "type": "VALUE" - } - }, - "outputs": [ - { - "id": "produce" - } - ], - "primitive": { - "id": "4503a4c6-42f7-45a1-a1d4-ed69699cf5e1", - "name": "Extracts columns by semantic type", - "python_path": "d3m.primitives.data_transformation.extract_columns_by_semantic_types.Common", - "version": "0.3.0" - }, - "type": "PRIMITIVE" - }, - { - "arguments": { - "inputs": { - "data": "steps.3.produce", - "type": "CONTAINER" - } - }, - "hyperparams": { - "return_result": { - "data": "replace", - "type": "VALUE" - }, - "use_semantic_types": { - "data": true, - "type": "VALUE" - } - }, - "outputs": [ - { - "id": "produce" - } - ], - "primitive": { - "id": "d016df89-de62-3c53-87ed-c06bb6a23cde", - "name": "sklearn.impute.SimpleImputer", - "python_path": "d3m.primitives.data_cleaning.imputer.SKlearn", - "version": "2019.6.7" - }, - "type": "PRIMITIVE" - }, - { - "arguments": { - "inputs": { - "data": "steps.2.produce", - "type": "CONTAINER" - } - }, - "hyperparams": { - "return_result": { - "data": "replace", - "type": "VALUE" - } - }, - "outputs": [ - { - "id": "produce" - } - ], - "primitive": { - "id": "eaec420d-46eb-4ddf-a2cd-b8097345ff3e", - "name": "One-hot maker", - "python_path": "d3m.primitives.data_preprocessing.one_hot_encoder.MakerCommon", - "version": "0.2.0" - }, - "type": "PRIMITIVE" - }, - { - "arguments": { - "left": { - "data": "steps.6.produce", - "type": "CONTAINER" - }, - "right": { - "data": "steps.5.produce", - "type": "CONTAINER" - } - }, - "outputs": [ - { - "id": "produce" - } - ], - "primitive": { - "id": "aff6a77a-faa0-41c5-9595-de2e7f7c4760", - "name": "Concatenate two dataframes", - "python_path": "d3m.primitives.data_transformation.horizontal_concat.DataFrameCommon", - "version": "0.2.0" - }, - "type": "PRIMITIVE" - }, - { - "arguments": { - "inputs": { - "data": "steps.7.produce", - "type": "CONTAINER" - }, - "outputs": { - "data": "steps.4.produce", - "type": "CONTAINER" - } - }, - "hyperparams": { - "return_result": { - "data": "replace", - "type": "VALUE" - }, - "use_semantic_types": { - "data": true, - "type": "VALUE" - } - }, - "outputs": [ - { - "id": "produce" - } - ], - "primitive": { - "id": "1dd82833-5692-39cb-84fb-2455683075f3", - "name": "sklearn.ensemble.forest.RandomForestClassifier", - "python_path": "d3m.primitives.classification.random_forest.SKlearn", - "version": "2019.6.7" - }, - "type": "PRIMITIVE" - }, - { - "arguments": { - "inputs": { - "data": "steps.8.produce", - "type": "CONTAINER" - }, - "reference": { - "data": "steps.1.produce", - "type": "CONTAINER" - } - }, - "outputs": [ - { - "id": "produce" - } - ], - "primitive": { - "id": "8d38b340-f83f-4877-baaa-162f8e551736", - "name": "Construct pipeline predictions output", - "python_path": "d3m.primitives.data_transformation.construct_predictions.Common", - "version": "0.3.0" - }, - "type": "PRIMITIVE" - } - ] -} diff --git a/common-primitives/pipelines/data_preprocessing.one_hot_encoder.PandasCommon/b523335c-0c47-4d02-a582-f69609cde1e8.json b/common-primitives/pipelines/data_preprocessing.one_hot_encoder.PandasCommon/b523335c-0c47-4d02-a582-f69609cde1e8.json deleted file mode 120000 index 51266fd..0000000 --- a/common-primitives/pipelines/data_preprocessing.one_hot_encoder.PandasCommon/b523335c-0c47-4d02-a582-f69609cde1e8.json +++ /dev/null @@ -1 +0,0 @@ -../data_transformation.extract_columns_by_structural_types.Common/b523335c-0c47-4d02-a582-f69609cde1e8.json \ No newline at end of file diff --git a/common-primitives/pipelines/data_transformation.column_parser.DataFrameCommon/4ec215d1-6484-4502-a6dd-f659943ccb94.json b/common-primitives/pipelines/data_transformation.column_parser.DataFrameCommon/4ec215d1-6484-4502-a6dd-f659943ccb94.json deleted file mode 120000 index 0deae2e..0000000 --- a/common-primitives/pipelines/data_transformation.column_parser.DataFrameCommon/4ec215d1-6484-4502-a6dd-f659943ccb94.json +++ /dev/null @@ -1 +0,0 @@ -../data_transformation.extract_columns.Common/4ec215d1-6484-4502-a6dd-f659943ccb94.json \ No newline at end of file diff --git a/common-primitives/pipelines/data_transformation.column_parser.DataFrameCommon/a8c40699-c48d-4f12-aa18-639c5fb6baae.json b/common-primitives/pipelines/data_transformation.column_parser.DataFrameCommon/a8c40699-c48d-4f12-aa18-639c5fb6baae.json deleted file mode 120000 index b1225d9..0000000 --- a/common-primitives/pipelines/data_transformation.column_parser.DataFrameCommon/a8c40699-c48d-4f12-aa18-639c5fb6baae.json +++ /dev/null @@ -1 +0,0 @@ -../data_transformation.grouping_field_compose.Common/a8c40699-c48d-4f12-aa18-639c5fb6baae.json \ No newline at end of file diff --git a/common-primitives/pipelines/data_transformation.column_parser.DataFrameCommon/b523335c-0c47-4d02-a582-f69609cde1e8.json b/common-primitives/pipelines/data_transformation.column_parser.DataFrameCommon/b523335c-0c47-4d02-a582-f69609cde1e8.json deleted file mode 120000 index 51266fd..0000000 --- a/common-primitives/pipelines/data_transformation.column_parser.DataFrameCommon/b523335c-0c47-4d02-a582-f69609cde1e8.json +++ /dev/null @@ -1 +0,0 @@ -../data_transformation.extract_columns_by_structural_types.Common/b523335c-0c47-4d02-a582-f69609cde1e8.json \ No newline at end of file diff --git a/common-primitives/pipelines/data_transformation.column_parser.DataFrameCommon/d2473bbc-7839-4deb-9ba4-4ff4bc9b0bde.json b/common-primitives/pipelines/data_transformation.column_parser.DataFrameCommon/d2473bbc-7839-4deb-9ba4-4ff4bc9b0bde.json deleted file mode 120000 index 080c8da..0000000 --- a/common-primitives/pipelines/data_transformation.column_parser.DataFrameCommon/d2473bbc-7839-4deb-9ba4-4ff4bc9b0bde.json +++ /dev/null @@ -1 +0,0 @@ -../classification.light_gbm.DataFrameCommon/d2473bbc-7839-4deb-9ba4-4ff4bc9b0bde.json \ No newline at end of file diff --git a/common-primitives/pipelines/data_transformation.construct_predictions.DataFrameCommon/4ec215d1-6484-4502-a6dd-f659943ccb94.json b/common-primitives/pipelines/data_transformation.construct_predictions.DataFrameCommon/4ec215d1-6484-4502-a6dd-f659943ccb94.json deleted file mode 120000 index 0deae2e..0000000 --- a/common-primitives/pipelines/data_transformation.construct_predictions.DataFrameCommon/4ec215d1-6484-4502-a6dd-f659943ccb94.json +++ /dev/null @@ -1 +0,0 @@ -../data_transformation.extract_columns.Common/4ec215d1-6484-4502-a6dd-f659943ccb94.json \ No newline at end of file diff --git a/common-primitives/pipelines/data_transformation.construct_predictions.DataFrameCommon/b523335c-0c47-4d02-a582-f69609cde1e8.json b/common-primitives/pipelines/data_transformation.construct_predictions.DataFrameCommon/b523335c-0c47-4d02-a582-f69609cde1e8.json deleted file mode 120000 index 51266fd..0000000 --- a/common-primitives/pipelines/data_transformation.construct_predictions.DataFrameCommon/b523335c-0c47-4d02-a582-f69609cde1e8.json +++ /dev/null @@ -1 +0,0 @@ -../data_transformation.extract_columns_by_structural_types.Common/b523335c-0c47-4d02-a582-f69609cde1e8.json \ No newline at end of file diff --git a/common-primitives/pipelines/data_transformation.construct_predictions.DataFrameCommon/d2473bbc-7839-4deb-9ba4-4ff4bc9b0bde.json b/common-primitives/pipelines/data_transformation.construct_predictions.DataFrameCommon/d2473bbc-7839-4deb-9ba4-4ff4bc9b0bde.json deleted file mode 120000 index 080c8da..0000000 --- a/common-primitives/pipelines/data_transformation.construct_predictions.DataFrameCommon/d2473bbc-7839-4deb-9ba4-4ff4bc9b0bde.json +++ /dev/null @@ -1 +0,0 @@ -../classification.light_gbm.DataFrameCommon/d2473bbc-7839-4deb-9ba4-4ff4bc9b0bde.json \ No newline at end of file diff --git a/common-primitives/pipelines/data_transformation.dataset_to_dataframe.Common/4ec215d1-6484-4502-a6dd-f659943ccb94.json b/common-primitives/pipelines/data_transformation.dataset_to_dataframe.Common/4ec215d1-6484-4502-a6dd-f659943ccb94.json deleted file mode 120000 index 0deae2e..0000000 --- a/common-primitives/pipelines/data_transformation.dataset_to_dataframe.Common/4ec215d1-6484-4502-a6dd-f659943ccb94.json +++ /dev/null @@ -1 +0,0 @@ -../data_transformation.extract_columns.Common/4ec215d1-6484-4502-a6dd-f659943ccb94.json \ No newline at end of file diff --git a/common-primitives/pipelines/data_transformation.dataset_to_dataframe.Common/a8c40699-c48d-4f12-aa18-639c5fb6baae.json b/common-primitives/pipelines/data_transformation.dataset_to_dataframe.Common/a8c40699-c48d-4f12-aa18-639c5fb6baae.json deleted file mode 120000 index b1225d9..0000000 --- a/common-primitives/pipelines/data_transformation.dataset_to_dataframe.Common/a8c40699-c48d-4f12-aa18-639c5fb6baae.json +++ /dev/null @@ -1 +0,0 @@ -../data_transformation.grouping_field_compose.Common/a8c40699-c48d-4f12-aa18-639c5fb6baae.json \ No newline at end of file diff --git a/common-primitives/pipelines/data_transformation.dataset_to_dataframe.Common/b523335c-0c47-4d02-a582-f69609cde1e8.json b/common-primitives/pipelines/data_transformation.dataset_to_dataframe.Common/b523335c-0c47-4d02-a582-f69609cde1e8.json deleted file mode 120000 index 51266fd..0000000 --- a/common-primitives/pipelines/data_transformation.dataset_to_dataframe.Common/b523335c-0c47-4d02-a582-f69609cde1e8.json +++ /dev/null @@ -1 +0,0 @@ -../data_transformation.extract_columns_by_structural_types.Common/b523335c-0c47-4d02-a582-f69609cde1e8.json \ No newline at end of file diff --git a/common-primitives/pipelines/data_transformation.dataset_to_dataframe.Common/d2473bbc-7839-4deb-9ba4-4ff4bc9b0bde.json b/common-primitives/pipelines/data_transformation.dataset_to_dataframe.Common/d2473bbc-7839-4deb-9ba4-4ff4bc9b0bde.json deleted file mode 120000 index 080c8da..0000000 --- a/common-primitives/pipelines/data_transformation.dataset_to_dataframe.Common/d2473bbc-7839-4deb-9ba4-4ff4bc9b0bde.json +++ /dev/null @@ -1 +0,0 @@ -../classification.light_gbm.DataFrameCommon/d2473bbc-7839-4deb-9ba4-4ff4bc9b0bde.json \ No newline at end of file diff --git a/common-primitives/pipelines/data_transformation.extract_columns.Common/4ec215d1-6484-4502-a6dd-f659943ccb94.json b/common-primitives/pipelines/data_transformation.extract_columns.Common/4ec215d1-6484-4502-a6dd-f659943ccb94.json deleted file mode 100644 index 1217fd3..0000000 --- a/common-primitives/pipelines/data_transformation.extract_columns.Common/4ec215d1-6484-4502-a6dd-f659943ccb94.json +++ /dev/null @@ -1 +0,0 @@ -{"id": "4ec215d1-6484-4502-a6dd-f659943ccb94", "schema": "https://metadata.datadrivendiscovery.org/schemas/v0/pipeline.json", "created": "2020-01-15T17:49:59.327063Z", "inputs": [{"name": "inputs"}], "outputs": [{"data": "steps.7.produce", "name": "output predictions"}], "steps": [{"type": "PRIMITIVE", "primitive": {"id": "4b42ce1e-9b98-4a25-b68e-fad13311eb65", "version": "0.3.0", "python_path": "d3m.primitives.data_transformation.dataset_to_dataframe.Common", "name": "Extract a DataFrame from a Dataset", "digest": "a1a0109be87a6ae578fd20e9d46c70c806059076c041b80b6314e7e41cf62d82"}, "arguments": {"inputs": {"type": "CONTAINER", "data": "inputs.0"}}, "outputs": [{"id": "produce"}]}, {"type": "PRIMITIVE", "primitive": {"id": "e193afa1-b45e-4d29-918f-5bb1fa3b88a7", "version": "0.2.0", "python_path": "d3m.primitives.schema_discovery.profiler.Common", "name": "Determine missing semantic types for columns automatically", "digest": "a3d51cbc0bf18168114c1c8f12c497d691dbe30b71667f355f30c13a9a08ba32"}, "arguments": {"inputs": {"type": "CONTAINER", "data": "steps.0.produce"}}, "outputs": [{"id": "produce"}]}, {"type": "PRIMITIVE", "primitive": {"id": "d510cb7a-1782-4f51-b44c-58f0236e47c7", "version": "0.6.0", "python_path": "d3m.primitives.data_transformation.column_parser.Common", "name": "Parses strings into their types", "digest": "b020e14e3d4f1e4266aa8a0680d83afcf2862300549c6f6c903742d7d171f879"}, "arguments": {"inputs": {"type": "CONTAINER", "data": "steps.1.produce"}}, "outputs": [{"id": "produce"}]}, {"type": "PRIMITIVE", "primitive": {"id": "81d7e261-e25b-4721-b091-a31cd46e99ae", "version": "0.1.0", "python_path": "d3m.primitives.data_transformation.extract_columns.Common", "name": "Extracts columns", "digest": "7b9ba98e3b7b9d1d8e17547249c7a25cd8d58ec60d957217f772753e37526145"}, "arguments": {"inputs": {"type": "CONTAINER", "data": "steps.2.produce"}}, "outputs": [{"id": "produce"}], "hyperparams": {"columns": {"type": "VALUE", "data": [25]}}}, {"type": "PRIMITIVE", "primitive": {"id": "81d7e261-e25b-4721-b091-a31cd46e99ae", "version": "0.1.0", "python_path": "d3m.primitives.data_transformation.extract_columns.Common", "name": "Extracts columns", "digest": "7b9ba98e3b7b9d1d8e17547249c7a25cd8d58ec60d957217f772753e37526145"}, "arguments": {"inputs": {"type": "CONTAINER", "data": "steps.2.produce"}}, "outputs": [{"id": "produce"}], "hyperparams": {"columns": {"type": "VALUE", "data": [6]}}}, {"type": "PRIMITIVE", "primitive": {"id": "09f252eb-215d-4e0b-9a60-fcd967f5e708", "version": "0.2.0", "python_path": "d3m.primitives.data_transformation.encoder.DistilTextEncoder", "name": "Text encoder", "digest": "e468d66d1eda057a61b2c79ecf5288f137778f47dac9eabdc60707a4941532a3"}, "arguments": {"inputs": {"type": "CONTAINER", "data": "steps.3.produce"}, "outputs": {"type": "CONTAINER", "data": "steps.4.produce"}}, "outputs": [{"id": "produce"}], "hyperparams": {"encoder_type": {"type": "VALUE", "data": "tfidf"}}}, {"type": "PRIMITIVE", "primitive": {"id": "e0ad06ce-b484-46b0-a478-c567e1ea7e02", "version": "0.2.0", "python_path": "d3m.primitives.learner.random_forest.DistilEnsembleForest", "name": "EnsembleForest", "digest": "4ba7a354b15ea626bf96aa771a2a3cba034ad5d0a8ccdbbf68bce2d828db1b4d"}, "arguments": {"inputs": {"type": "CONTAINER", "data": "steps.5.produce"}, "outputs": {"type": "CONTAINER", "data": "steps.4.produce"}}, "outputs": [{"id": "produce"}]}, {"type": "PRIMITIVE", "primitive": {"id": "8d38b340-f83f-4877-baaa-162f8e551736", "version": "0.3.0", "python_path": "d3m.primitives.data_transformation.construct_predictions.Common", "name": "Construct pipeline predictions output", "digest": "674a644333a3a481769591341591461b06de566fef7439010284739194e18af8"}, "arguments": {"inputs": {"type": "CONTAINER", "data": "steps.6.produce"}, "reference": {"type": "CONTAINER", "data": "steps.0.produce"}}, "outputs": [{"id": "produce"}]}], "digest": "a26edc0cc9bcf9121189186d621ff1b4cebb2afc76b6ef171d7d8194e55cf475"} \ No newline at end of file diff --git a/common-primitives/pipelines/data_transformation.extract_columns.Common/pipeline.py b/common-primitives/pipelines/data_transformation.extract_columns.Common/pipeline.py deleted file mode 100644 index e307251..0000000 --- a/common-primitives/pipelines/data_transformation.extract_columns.Common/pipeline.py +++ /dev/null @@ -1,71 +0,0 @@ -from d3m import index -from d3m.metadata.base import ArgumentType, Context -from d3m.metadata.pipeline import Pipeline, PrimitiveStep - -# Creating pipeline -pipeline_description = Pipeline() -pipeline_description.add_input(name='inputs') - -# Step 0: dataset_to_dataframe -step_0 = PrimitiveStep(primitive=index.get_primitive('d3m.primitives.data_transformation.dataset_to_dataframe.Common')) -step_0.add_argument(name='inputs', argument_type=ArgumentType.CONTAINER, data_reference='inputs.0') -step_0.add_output('produce') -pipeline_description.add_step(step_0) - -# Step 1: Simple profiler primitive -step_1 = PrimitiveStep(primitive=index.get_primitive('d3m.primitives.schema_discovery.profiler.Common')) -step_1.add_argument(name='inputs', argument_type=ArgumentType.CONTAINER, data_reference='steps.0.produce') -step_1.add_output('produce') -pipeline_description.add_step(step_1) - -# Step 2: column_parser -step_2 = PrimitiveStep(primitive=index.get_primitive('d3m.primitives.data_transformation.column_parser.Common')) -step_2.add_argument(name='inputs', argument_type=ArgumentType.CONTAINER, data_reference='steps.1.produce') -step_2.add_output('produce') -pipeline_description.add_step(step_2) - -# Step 3: Extract text column explicitly -step_3 = PrimitiveStep(primitive=index.get_primitive('d3m.primitives.data_transformation.extract_columns.Common')) -step_3.add_argument(name='inputs', argument_type=ArgumentType.CONTAINER, data_reference='steps.2.produce') -step_3.add_hyperparameter(name='columns', argument_type=ArgumentType.VALUE, data = [25]) -step_3.add_output('produce') -pipeline_description.add_step(step_3) - -# Step 4: Extract target column explicitly -step_4 = PrimitiveStep(primitive=index.get_primitive('d3m.primitives.data_transformation.extract_columns.Common')) -step_4.add_argument(name='inputs', argument_type=ArgumentType.CONTAINER, data_reference='steps.2.produce') -step_4.add_hyperparameter(name='columns', argument_type=ArgumentType.VALUE, data = [6]) -step_4.add_output('produce') -pipeline_description.add_step(step_4) - -# Step 5: encode text column -step_5 = PrimitiveStep(primitive=index.get_primitive('d3m.primitives.data_transformation.encoder.DistilTextEncoder')) -step_5.add_argument(name='inputs', argument_type=ArgumentType.CONTAINER, data_reference='steps.3.produce') -step_5.add_argument(name='outputs', argument_type=ArgumentType.CONTAINER, data_reference='steps.4.produce') -step_5.add_hyperparameter(name='encoder_type', argument_type=ArgumentType.VALUE, data = 'tfidf') -step_5.add_output('produce') -pipeline_description.add_step(step_5) - -# Step 6: classifier -step_6 = PrimitiveStep(primitive=index.get_primitive('d3m.primitives.learner.random_forest.DistilEnsembleForest')) -step_6.add_argument(name='inputs', argument_type=ArgumentType.CONTAINER, data_reference='steps.5.produce') -step_6.add_argument(name='outputs', argument_type=ArgumentType.CONTAINER, data_reference='steps.4.produce') -step_6.add_output('produce') -pipeline_description.add_step(step_6) - -# Step 7: construct output -step_7 = PrimitiveStep(primitive=index.get_primitive('d3m.primitives.data_transformation.construct_predictions.Common')) -step_7.add_argument(name='inputs', argument_type=ArgumentType.CONTAINER, data_reference='steps.6.produce') -step_7.add_argument(name='reference', argument_type=ArgumentType.CONTAINER, data_reference='steps.0.produce') -step_7.add_output('produce') -pipeline_description.add_step(step_7) - -# Final Output -pipeline_description.add_output(name='output predictions', data_reference='steps.7.produce') - -# Output json pipeline -blob = pipeline_description.to_json() -filename = blob[8:44] + '.json' -with open(filename, 'w') as outfile: - outfile.write(blob) - diff --git a/common-primitives/pipelines/data_transformation.extract_columns_by_semantic_types.DataFrameCommon/b523335c-0c47-4d02-a582-f69609cde1e8.json b/common-primitives/pipelines/data_transformation.extract_columns_by_semantic_types.DataFrameCommon/b523335c-0c47-4d02-a582-f69609cde1e8.json deleted file mode 120000 index 51266fd..0000000 --- a/common-primitives/pipelines/data_transformation.extract_columns_by_semantic_types.DataFrameCommon/b523335c-0c47-4d02-a582-f69609cde1e8.json +++ /dev/null @@ -1 +0,0 @@ -../data_transformation.extract_columns_by_structural_types.Common/b523335c-0c47-4d02-a582-f69609cde1e8.json \ No newline at end of file diff --git a/common-primitives/pipelines/data_transformation.extract_columns_by_semantic_types.DataFrameCommon/d2473bbc-7839-4deb-9ba4-4ff4bc9b0bde.json b/common-primitives/pipelines/data_transformation.extract_columns_by_semantic_types.DataFrameCommon/d2473bbc-7839-4deb-9ba4-4ff4bc9b0bde.json deleted file mode 120000 index 080c8da..0000000 --- a/common-primitives/pipelines/data_transformation.extract_columns_by_semantic_types.DataFrameCommon/d2473bbc-7839-4deb-9ba4-4ff4bc9b0bde.json +++ /dev/null @@ -1 +0,0 @@ -../classification.light_gbm.DataFrameCommon/d2473bbc-7839-4deb-9ba4-4ff4bc9b0bde.json \ No newline at end of file diff --git a/common-primitives/pipelines/data_transformation.extract_columns_by_structural_types.Common/b523335c-0c47-4d02-a582-f69609cde1e8.json b/common-primitives/pipelines/data_transformation.extract_columns_by_structural_types.Common/b523335c-0c47-4d02-a582-f69609cde1e8.json deleted file mode 100644 index ca4500d..0000000 --- a/common-primitives/pipelines/data_transformation.extract_columns_by_structural_types.Common/b523335c-0c47-4d02-a582-f69609cde1e8.json +++ /dev/null @@ -1 +0,0 @@ -{"id": "b523335c-0c47-4d02-a582-f69609cde1e8", "schema": "https://metadata.datadrivendiscovery.org/schemas/v0/pipeline.json", "created": "2020-01-15T19:51:17.782254Z", "inputs": [{"name": "inputs"}], "outputs": [{"data": "steps.9.produce", "name": "output predictions"}], "steps": [{"type": "PRIMITIVE", "primitive": {"id": "4b42ce1e-9b98-4a25-b68e-fad13311eb65", "version": "0.3.0", "python_path": "d3m.primitives.data_transformation.dataset_to_dataframe.Common", "name": "Extract a DataFrame from a Dataset", "digest": "a1a0109be87a6ae578fd20e9d46c70c806059076c041b80b6314e7e41cf62d82"}, "arguments": {"inputs": {"type": "CONTAINER", "data": "inputs.0"}}, "outputs": [{"id": "produce"}]}, {"type": "PRIMITIVE", "primitive": {"id": "e193afa1-b45e-4d29-918f-5bb1fa3b88a7", "version": "0.2.0", "python_path": "d3m.primitives.schema_discovery.profiler.Common", "name": "Determine missing semantic types for columns automatically", "digest": "a3d51cbc0bf18168114c1c8f12c497d691dbe30b71667f355f30c13a9a08ba32"}, "arguments": {"inputs": {"type": "CONTAINER", "data": "steps.0.produce"}}, "outputs": [{"id": "produce"}]}, {"type": "PRIMITIVE", "primitive": {"id": "79674d68-9b93-4359-b385-7b5f60645b06", "version": "0.1.0", "python_path": "d3m.primitives.data_transformation.extract_columns_by_structural_types.Common", "name": "Extracts columns by structural type", "digest": "7805010b9581bb96c035fefa5943209c69a1e234f10d9057d487af42c0fd4830"}, "arguments": {"inputs": {"type": "CONTAINER", "data": "steps.1.produce"}}, "outputs": [{"id": "produce"}]}, {"type": "PRIMITIVE", "primitive": {"id": "d510cb7a-1782-4f51-b44c-58f0236e47c7", "version": "0.6.0", "python_path": "d3m.primitives.data_transformation.column_parser.Common", "name": "Parses strings into their types", "digest": "b020e14e3d4f1e4266aa8a0680d83afcf2862300549c6f6c903742d7d171f879"}, "arguments": {"inputs": {"type": "CONTAINER", "data": "steps.2.produce"}}, "outputs": [{"id": "produce"}]}, {"type": "PRIMITIVE", "primitive": {"id": "f6315ca9-ca39-4e13-91ba-1964ee27281c", "version": "0.1.0", "python_path": "d3m.primitives.data_preprocessing.one_hot_encoder.PandasCommon", "name": "Pandas one hot encoder", "digest": "ed1217d4d7c017d8239b4f958c8e6ca0b3b67966ccb50cc5c578a9f14e465ec0"}, "arguments": {"inputs": {"type": "CONTAINER", "data": "steps.3.produce"}}, "outputs": [{"id": "produce"}], "hyperparams": {"use_columns": {"type": "VALUE", "data": [2, 5]}}}, {"type": "PRIMITIVE", "primitive": {"id": "3b09ba74-cc90-4f22-9e0a-0cf4f29a7e28", "version": "0.1.0", "python_path": "d3m.primitives.data_transformation.remove_columns.Common", "name": "Removes columns", "digest": "a725d149595186b85f1dea2bacbf4b853712b6a50eddb7c4c2295fabc3a04df1"}, "arguments": {"inputs": {"type": "CONTAINER", "data": "steps.4.produce"}}, "outputs": [{"id": "produce"}], "hyperparams": {"columns": {"type": "VALUE", "data": [25]}}}, {"type": "PRIMITIVE", "primitive": {"id": "4503a4c6-42f7-45a1-a1d4-ed69699cf5e1", "version": "0.3.0", "python_path": "d3m.primitives.data_transformation.extract_columns_by_semantic_types.Common", "name": "Extracts columns by semantic type", "digest": "505df38f9be4964ff19683ab3e185f19333fb35c26121c12a1c55bddd9d38f72"}, "arguments": {"inputs": {"type": "CONTAINER", "data": "steps.5.produce"}}, "outputs": [{"id": "produce"}], "hyperparams": {"semantic_types": {"type": "VALUE", "data": ["https://metadata.datadrivendiscovery.org/types/Attribute"]}}}, {"type": "PRIMITIVE", "primitive": {"id": "4503a4c6-42f7-45a1-a1d4-ed69699cf5e1", "version": "0.3.0", "python_path": "d3m.primitives.data_transformation.extract_columns_by_semantic_types.Common", "name": "Extracts columns by semantic type", "digest": "505df38f9be4964ff19683ab3e185f19333fb35c26121c12a1c55bddd9d38f72"}, "arguments": {"inputs": {"type": "CONTAINER", "data": "steps.5.produce"}}, "outputs": [{"id": "produce"}], "hyperparams": {"semantic_types": {"type": "VALUE", "data": ["https://metadata.datadrivendiscovery.org/types/Target"]}}}, {"type": "PRIMITIVE", "primitive": {"id": "37c2b19d-bdab-4a30-ba08-6be49edcc6af", "version": "0.4.0", "python_path": "d3m.primitives.classification.random_forest.Common", "name": "Random forest classifier", "digest": "f5f702fc561775a6064c64c008a519f605eb00ca80f59a5d5e39b1340c7c015e"}, "arguments": {"inputs": {"type": "CONTAINER", "data": "steps.6.produce"}, "outputs": {"type": "CONTAINER", "data": "steps.7.produce"}}, "outputs": [{"id": "produce"}]}, {"type": "PRIMITIVE", "primitive": {"id": "8d38b340-f83f-4877-baaa-162f8e551736", "version": "0.3.0", "python_path": "d3m.primitives.data_transformation.construct_predictions.Common", "name": "Construct pipeline predictions output", "digest": "674a644333a3a481769591341591461b06de566fef7439010284739194e18af8"}, "arguments": {"inputs": {"type": "CONTAINER", "data": "steps.8.produce"}, "reference": {"type": "CONTAINER", "data": "steps.0.produce"}}, "outputs": [{"id": "produce"}]}], "digest": "7929f79fa8e2aaddcbe66d0f592525081280549e0713198e583728ff88b0f895"} \ No newline at end of file diff --git a/common-primitives/pipelines/data_transformation.extract_columns_by_structural_types.Common/pipeline.py b/common-primitives/pipelines/data_transformation.extract_columns_by_structural_types.Common/pipeline.py deleted file mode 100644 index ae876cd..0000000 --- a/common-primitives/pipelines/data_transformation.extract_columns_by_structural_types.Common/pipeline.py +++ /dev/null @@ -1,83 +0,0 @@ -from d3m import index -from d3m.metadata.base import ArgumentType, Context -from d3m.metadata.pipeline import Pipeline, PrimitiveStep - -# Creating pipeline -pipeline_description = Pipeline() -pipeline_description.add_input(name='inputs') - -# Step 0: dataset_to_dataframe -step_0 = PrimitiveStep(primitive=index.get_primitive('d3m.primitives.data_transformation.dataset_to_dataframe.Common')) -step_0.add_argument(name='inputs', argument_type=ArgumentType.CONTAINER, data_reference='inputs.0') -step_0.add_output('produce') -pipeline_description.add_step(step_0) - -# Step 1: Simple profiler primitive -step_1 = PrimitiveStep(primitive=index.get_primitive('d3m.primitives.schema_discovery.profiler.Common')) -step_1.add_argument(name='inputs', argument_type=ArgumentType.CONTAINER, data_reference='steps.0.produce') -step_1.add_output('produce') -pipeline_description.add_step(step_1) - -# Step 2: Extract columns by structural type -step_2 = PrimitiveStep(primitive=index.get_primitive('d3m.primitives.data_transformation.extract_columns_by_structural_types.Common')) -step_2.add_argument(name='inputs', argument_type=ArgumentType.CONTAINER, data_reference='steps.1.produce') -step_2.add_output('produce') -pipeline_description.add_step(step_2) - -# Step 3: column_parser -step_3 = PrimitiveStep(primitive=index.get_primitive('d3m.primitives.data_transformation.column_parser.Common')) -step_3.add_argument(name='inputs', argument_type=ArgumentType.CONTAINER, data_reference='steps.2.produce') -step_3.add_output('produce') -pipeline_description.add_step(step_3) - -# Step 4 one hot encode -step_4 = PrimitiveStep(primitive=index.get_primitive('d3m.primitives.data_preprocessing.one_hot_encoder.PandasCommon')) -step_4.add_argument(name='inputs', argument_type=ArgumentType.CONTAINER, data_reference='steps.3.produce') -step_4.add_hyperparameter(name='use_columns', argument_type=ArgumentType.VALUE, data = [2,5]) -step_4.add_output('produce') -pipeline_description.add_step(step_4) - -# Step 5 remove text -step_5 = PrimitiveStep(primitive=index.get_primitive('d3m.primitives.data_transformation.remove_columns.Common')) -step_5.add_argument(name='inputs', argument_type=ArgumentType.CONTAINER, data_reference='steps.4.produce') -step_5.add_hyperparameter(name='columns', argument_type=ArgumentType.VALUE, data = [25]) -step_5.add_output('produce') -pipeline_description.add_step(step_5) - -# Step 6 extract attributes -step_6 = PrimitiveStep(primitive=index.get_primitive('d3m.primitives.data_transformation.extract_columns_by_semantic_types.Common')) -step_6.add_argument(name='inputs', argument_type=ArgumentType.CONTAINER, data_reference='steps.5.produce') -step_6.add_hyperparameter(name="semantic_types", argument_type=ArgumentType.VALUE, data=["https://metadata.datadrivendiscovery.org/types/Attribute"],) -step_6.add_output('produce') -pipeline_description.add_step(step_6) - -# Step 7 extract target -step_7 = PrimitiveStep(primitive=index.get_primitive('d3m.primitives.data_transformation.extract_columns_by_semantic_types.Common')) -step_7.add_argument(name='inputs', argument_type=ArgumentType.CONTAINER, data_reference='steps.5.produce') -step_7.add_hyperparameter(name="semantic_types", argument_type=ArgumentType.VALUE, data=["https://metadata.datadrivendiscovery.org/types/Target"],) -step_7.add_output('produce') -pipeline_description.add_step(step_7) - -# Step 8: classifier -step_8 = PrimitiveStep(primitive=index.get_primitive('d3m.primitives.classification.random_forest.Common')) -step_8.add_argument(name='inputs', argument_type=ArgumentType.CONTAINER, data_reference='steps.6.produce') -step_8.add_argument(name='outputs', argument_type=ArgumentType.CONTAINER, data_reference='steps.7.produce') -step_8.add_output('produce') -pipeline_description.add_step(step_8) - -# Step 9: construct output -step_9 = PrimitiveStep(primitive=index.get_primitive('d3m.primitives.data_transformation.construct_predictions.Common')) -step_9.add_argument(name='inputs', argument_type=ArgumentType.CONTAINER, data_reference='steps.8.produce') -step_9.add_argument(name='reference', argument_type=ArgumentType.CONTAINER, data_reference='steps.0.produce') -step_9.add_output('produce') -pipeline_description.add_step(step_9) - -# Final Output -pipeline_description.add_output(name='output predictions', data_reference='steps.9.produce') - -# Output json pipeline -blob = pipeline_description.to_json() -filename = blob[8:44] + '.json' -with open(filename, 'w') as outfile: - outfile.write(blob) - diff --git a/common-primitives/pipelines/data_transformation.grouping_field_compose.Common/a8c40699-c48d-4f12-aa18-639c5fb6baae.json b/common-primitives/pipelines/data_transformation.grouping_field_compose.Common/a8c40699-c48d-4f12-aa18-639c5fb6baae.json deleted file mode 100644 index dbaf998..0000000 --- a/common-primitives/pipelines/data_transformation.grouping_field_compose.Common/a8c40699-c48d-4f12-aa18-639c5fb6baae.json +++ /dev/null @@ -1 +0,0 @@ -{"id": "a8c40699-c48d-4f12-aa18-639c5fb6baae", "schema": "https://metadata.datadrivendiscovery.org/schemas/v0/pipeline.json", "created": "2020-01-15T19:35:58.976691Z", "inputs": [{"name": "inputs"}], "outputs": [{"data": "steps.4.produce", "name": "output predictions"}], "steps": [{"type": "PRIMITIVE", "primitive": {"id": "4b42ce1e-9b98-4a25-b68e-fad13311eb65", "version": "0.3.0", "python_path": "d3m.primitives.data_transformation.dataset_to_dataframe.Common", "name": "Extract a DataFrame from a Dataset", "digest": "a1a0109be87a6ae578fd20e9d46c70c806059076c041b80b6314e7e41cf62d82"}, "arguments": {"inputs": {"type": "CONTAINER", "data": "inputs.0"}}, "outputs": [{"id": "produce"}]}, {"type": "PRIMITIVE", "primitive": {"id": "e193afa1-b45e-4d29-918f-5bb1fa3b88a7", "version": "0.2.0", "python_path": "d3m.primitives.schema_discovery.profiler.Common", "name": "Determine missing semantic types for columns automatically", "digest": "a3d51cbc0bf18168114c1c8f12c497d691dbe30b71667f355f30c13a9a08ba32"}, "arguments": {"inputs": {"type": "CONTAINER", "data": "steps.0.produce"}}, "outputs": [{"id": "produce"}]}, {"type": "PRIMITIVE", "primitive": {"id": "d510cb7a-1782-4f51-b44c-58f0236e47c7", "version": "0.6.0", "python_path": "d3m.primitives.data_transformation.column_parser.Common", "name": "Parses strings into their types", "digest": "b020e14e3d4f1e4266aa8a0680d83afcf2862300549c6f6c903742d7d171f879"}, "arguments": {"inputs": {"type": "CONTAINER", "data": "steps.1.produce"}}, "outputs": [{"id": "produce"}], "hyperparams": {"parse_semantic_types": {"type": "VALUE", "data": ["http://schema.org/Boolean", "http://schema.org/Integer", "http://schema.org/Float", "https://metadata.datadrivendiscovery.org/types/FloatVector", "http://schema.org/DateTime"]}}}, {"type": "PRIMITIVE", "primitive": {"id": "59db88b9-dd81-4e50-8f43-8f2af959560b", "version": "0.1.0", "python_path": "d3m.primitives.data_transformation.grouping_field_compose.Common", "name": "Grouping Field Compose", "digest": "e93815bfdb1c82ce0e2fa61f092d6ee9bcf39367a27072accbb9f0dd9189fb03"}, "arguments": {"inputs": {"type": "CONTAINER", "data": "steps.2.produce"}}, "outputs": [{"id": "produce"}]}, {"type": "PRIMITIVE", "primitive": {"id": "76b5a479-c209-4d94-92b5-7eba7a4d4499", "version": "1.0.2", "python_path": "d3m.primitives.time_series_forecasting.vector_autoregression.VAR", "name": "VAR", "digest": "7e22a1e7fe228114a5788f16a8d3c7709ed3a98a90e9cc82e3b80ab5f232d352"}, "arguments": {"inputs": {"type": "CONTAINER", "data": "steps.3.produce"}, "outputs": {"type": "CONTAINER", "data": "steps.3.produce"}}, "outputs": [{"id": "produce"}]}], "digest": "da2c7d2605256f263ca4725fe7385be5e027a3ddadc8dbf7523ff98bcd016005"} \ No newline at end of file diff --git a/common-primitives/pipelines/data_transformation.grouping_field_compose.Common/pipeline.py b/common-primitives/pipelines/data_transformation.grouping_field_compose.Common/pipeline.py deleted file mode 100644 index 9a9ebb1..0000000 --- a/common-primitives/pipelines/data_transformation.grouping_field_compose.Common/pipeline.py +++ /dev/null @@ -1,100 +0,0 @@ -from d3m import index -from d3m.metadata.base import ArgumentType -from d3m.metadata.pipeline import Pipeline, PrimitiveStep - -# Creating pipeline -pipeline_description = Pipeline() -pipeline_description.add_input(name="inputs") - -# Step 0: DS to DF on input DS -step_0 = PrimitiveStep( - primitive=index.get_primitive( - "d3m.primitives.data_transformation.dataset_to_dataframe.Common" - ) -) -step_0.add_argument( - name="inputs", argument_type=ArgumentType.CONTAINER, data_reference="inputs.0" -) -step_0.add_output("produce") -pipeline_description.add_step(step_0) - -# Step 1: Simple Profiler Column Role Annotation -step_1 = PrimitiveStep( - primitive=index.get_primitive("d3m.primitives.schema_discovery.profiler.Common") -) -step_1.add_argument( - name="inputs", - argument_type=ArgumentType.CONTAINER, - data_reference="steps.0.produce", -) -step_1.add_output("produce") -pipeline_description.add_step(step_1) - -# Step 2: column parser on input DF -step_2 = PrimitiveStep( - primitive=index.get_primitive( - "d3m.primitives.data_transformation.column_parser.Common" - ) -) -step_2.add_argument( - name="inputs", - argument_type=ArgumentType.CONTAINER, - data_reference="steps.1.produce", -) -step_2.add_output("produce") -step_2.add_hyperparameter( - name="parse_semantic_types", - argument_type=ArgumentType.VALUE, - data=[ - "http://schema.org/Boolean", - "http://schema.org/Integer", - "http://schema.org/Float", - "https://metadata.datadrivendiscovery.org/types/FloatVector", - "http://schema.org/DateTime", - ], -) -pipeline_description.add_step(step_2) - -# Step 3: Grouping Field Compose -step_3 = PrimitiveStep( - primitive=index.get_primitive( - "d3m.primitives.data_transformation.grouping_field_compose.Common" - ) -) -step_3.add_argument( - name="inputs", - argument_type=ArgumentType.CONTAINER, - data_reference="steps.2.produce", -) -step_3.add_output("produce") -pipeline_description.add_step(step_3) - -# Step 4: forecasting primitive -step_4 = PrimitiveStep( - primitive=index.get_primitive( - "d3m.primitives.time_series_forecasting.vector_autoregression.VAR" - ) -) -step_4.add_argument( - name="inputs", - argument_type=ArgumentType.CONTAINER, - data_reference="steps.3.produce", -) -step_4.add_argument( - name="outputs", - argument_type=ArgumentType.CONTAINER, - data_reference="steps.3.produce", -) -step_4.add_output("produce") -pipeline_description.add_step(step_4) - -# Final Output -pipeline_description.add_output( - name="output predictions", data_reference="steps.4.produce" -) - -# Output json pipeline -blob = pipeline_description.to_json() -filename = blob[8:44] + ".json" -with open(filename, "w") as outfile: - outfile.write(blob) diff --git a/common-primitives/pipelines/data_transformation.horizontal_concat.DataFrameConcat/2b307634-f01e-412e-8d95-7e54afd4731f.json b/common-primitives/pipelines/data_transformation.horizontal_concat.DataFrameConcat/2b307634-f01e-412e-8d95-7e54afd4731f.json deleted file mode 120000 index 146d403..0000000 --- a/common-primitives/pipelines/data_transformation.horizontal_concat.DataFrameConcat/2b307634-f01e-412e-8d95-7e54afd4731f.json +++ /dev/null @@ -1 +0,0 @@ -../data_preprocessing.one_hot_encoder.MakerCommon/2b307634-f01e-412e-8d95-7e54afd4731f.json \ No newline at end of file diff --git a/common-primitives/pipelines/data_transformation.remove_columns.Common/b523335c-0c47-4d02-a582-f69609cde1e8.json b/common-primitives/pipelines/data_transformation.remove_columns.Common/b523335c-0c47-4d02-a582-f69609cde1e8.json deleted file mode 120000 index 51266fd..0000000 --- a/common-primitives/pipelines/data_transformation.remove_columns.Common/b523335c-0c47-4d02-a582-f69609cde1e8.json +++ /dev/null @@ -1 +0,0 @@ -../data_transformation.extract_columns_by_structural_types.Common/b523335c-0c47-4d02-a582-f69609cde1e8.json \ No newline at end of file diff --git a/common-primitives/pipelines/data_transformation.rename_duplicate_name.DataFrameCommon/11ee9290-992d-4e48-97ed-1a6e4c15f92f.json b/common-primitives/pipelines/data_transformation.rename_duplicate_name.DataFrameCommon/11ee9290-992d-4e48-97ed-1a6e4c15f92f.json deleted file mode 100644 index 8ea69cd..0000000 --- a/common-primitives/pipelines/data_transformation.rename_duplicate_name.DataFrameCommon/11ee9290-992d-4e48-97ed-1a6e4c15f92f.json +++ /dev/null @@ -1,272 +0,0 @@ -{ - "context": "TESTING", - "created": "2019-02-12T02:01:52.663008Z", - "id": "11ee9290-992d-4e48-97ed-1a6e4c15f92f", - "inputs": [ - { - "name": "inputs" - } - ], - "outputs": [ - { - "data": "steps.8.produce", - "name": "output predictions" - } - ], - "schema": "https://metadata.datadrivendiscovery.org/schemas/v0/pipeline.json", - "steps": [ - { - "arguments": { - "inputs": { - "data": "inputs.0", - "type": "CONTAINER" - } - }, - "outputs": [ - { - "id": "produce" - } - ], - "primitive": { - "id": "4b42ce1e-9b98-4a25-b68e-fad13311eb65", - "name": "Extract a DataFrame from a Dataset", - "python_path": "d3m.primitives.data_transformation.dataset_to_dataframe.Common", - "version": "0.3.0" - }, - "type": "PRIMITIVE" - }, - { - "arguments": { - "inputs": { - "data": "steps.0.produce", - "type": "CONTAINER" - } - }, - "hyperparams": { - "parse_semantic_types": { - "data": [ - "http://schema.org/Boolean", - "http://schema.org/Integer", - "http://schema.org/Float", - "https://metadata.datadrivendiscovery.org/types/FloatVector", - "http://schema.org/DateTime" - ], - "type": "VALUE" - } - }, - "outputs": [ - { - "id": "produce" - } - ], - "primitive": { - "id": "d510cb7a-1782-4f51-b44c-58f0236e47c7", - "name": "Parses strings into their types", - "python_path": "d3m.primitives.data_transformation.column_parser.Common", - "version": "0.6.0" - }, - "type": "PRIMITIVE" - }, - { - "arguments": { - "inputs": { - "data": "steps.1.produce", - "type": "CONTAINER" - } - }, - "hyperparams": { - "separator": { - "data": "----", - "type": "VALUE" - } - }, - "outputs": [ - { - "id": "produce" - } - ], - "primitive": { - "id": "7b067a78-4ad4-411d-9cf9-87bcee38ac73", - "name": "Rename all the duplicated name column in DataFrame", - "python_path": "d3m.primitives.data_transformation.rename_duplicate_name.DataFrameCommon", - "version": "0.2.0" - }, - "type": "PRIMITIVE" - }, - { - "arguments": { - "inputs": { - "data": "steps.2.produce", - "type": "CONTAINER" - } - }, - "hyperparams": { - "semantic_types": { - "data": [ - "https://metadata.datadrivendiscovery.org/types/CategoricalData" - ], - "type": "VALUE" - } - }, - "outputs": [ - { - "id": "produce" - } - ], - "primitive": { - "id": "4503a4c6-42f7-45a1-a1d4-ed69699cf5e1", - "name": "Extracts columns by semantic type", - "python_path": "d3m.primitives.data_transformation.extract_columns_by_semantic_types.Common", - "version": "0.3.0" - }, - "type": "PRIMITIVE" - }, - { - "arguments": { - "inputs": { - "data": "steps.2.produce", - "type": "CONTAINER" - } - }, - "hyperparams": { - "exclude_columns": { - "data": [ - 0 - ], - "type": "VALUE" - }, - "semantic_types": { - "data": [ - "http://schema.org/Integer", - "http://schema.org/Float" - ], - "type": "VALUE" - } - }, - "outputs": [ - { - "id": "produce" - } - ], - "primitive": { - "id": "4503a4c6-42f7-45a1-a1d4-ed69699cf5e1", - "name": "Extracts columns by semantic type", - "python_path": "d3m.primitives.data_transformation.extract_columns_by_semantic_types.Common", - "version": "0.3.0" - }, - "type": "PRIMITIVE" - }, - { - "arguments": { - "inputs": { - "data": "steps.0.produce", - "type": "CONTAINER" - } - }, - "hyperparams": { - "semantic_types": { - "data": [ - "https://metadata.datadrivendiscovery.org/types/TrueTarget" - ], - "type": "VALUE" - } - }, - "outputs": [ - { - "id": "produce" - } - ], - "primitive": { - "id": "4503a4c6-42f7-45a1-a1d4-ed69699cf5e1", - "name": "Extracts columns by semantic type", - "python_path": "d3m.primitives.data_transformation.extract_columns_by_semantic_types.Common", - "version": "0.3.0" - }, - "type": "PRIMITIVE" - }, - { - "arguments": { - "inputs": { - "data": "steps.4.produce", - "type": "CONTAINER" - } - }, - "hyperparams": { - "return_result": { - "data": "replace", - "type": "VALUE" - }, - "use_semantic_types": { - "data": true, - "type": "VALUE" - } - }, - "outputs": [ - { - "id": "produce" - } - ], - "primitive": { - "id": "d016df89-de62-3c53-87ed-c06bb6a23cde", - "name": "sklearn.impute.SimpleImputer", - "python_path": "d3m.primitives.data_cleaning.imputer.SKlearn", - "version": "2019.6.7" - }, - "type": "PRIMITIVE" - }, - { - "arguments": { - "inputs": { - "data": "steps.6.produce", - "type": "CONTAINER" - }, - "outputs": { - "data": "steps.5.produce", - "type": "CONTAINER" - } - }, - "hyperparams": { - "return_result": { - "data": "replace", - "type": "VALUE" - } - }, - "outputs": [ - { - "id": "produce" - } - ], - "primitive": { - "id": "1dd82833-5692-39cb-84fb-2455683075f3", - "name": "sklearn.ensemble.forest.RandomForestClassifier", - "python_path": "d3m.primitives.classification.random_forest.SKlearn", - "version": "2019.6.7" - }, - "type": "PRIMITIVE" - }, - { - "arguments": { - "inputs": { - "data": "steps.7.produce", - "type": "CONTAINER" - }, - "reference": { - "data": "steps.1.produce", - "type": "CONTAINER" - } - }, - "outputs": [ - { - "id": "produce" - } - ], - "primitive": { - "id": "8d38b340-f83f-4877-baaa-162f8e551736", - "name": "Construct pipeline predictions output", - "python_path": "d3m.primitives.data_transformation.construct_predictions.Common", - "version": "0.3.0" - }, - "type": "PRIMITIVE" - } - ] -} diff --git a/common-primitives/pipelines/evaluation.kfold_timeseries_split.Common/k-fold-timeseries-split.yml b/common-primitives/pipelines/evaluation.kfold_timeseries_split.Common/k-fold-timeseries-split.yml deleted file mode 100644 index 88e99d6..0000000 --- a/common-primitives/pipelines/evaluation.kfold_timeseries_split.Common/k-fold-timeseries-split.yml +++ /dev/null @@ -1,83 +0,0 @@ -id: 5bed1f23-ac17-4b52-9d06-a5b77a6aea51 -schema: https://metadata.datadrivendiscovery.org/schemas/v0/pipeline.json -source: - name: Jeffrey Gleason -created: "2019-04-08T16:18:27.250294Z" -context: TESTING -name: K-fold split of timeseries datasets -description: | - K-fold split of timeseries datasets for cross-validation. -inputs: - - name: folds - - name: full dataset -outputs: - - name: train datasets - data: steps.0.produce - - name: test datasets - data: steps.2.produce - - name: score datasets - data: steps.1.produce -steps: - # Step 0. - - type: PRIMITIVE - primitive: - id: 002f9ad1-46e3-40f4-89ed-eeffbb3a102b - version: 0.1.0 - python_path: d3m.primitives.evaluation.kfold_time_series_split.Common - name: K-fold cross-validation timeseries dataset splits - arguments: - inputs: - type: CONTAINER - data: inputs.0 - dataset: - type: CONTAINER - data: inputs.1 - outputs: - - id: produce - - id: produce_score_data - # Step 1. We redact privileged attributes for both score and test splits. - - type: PRIMITIVE - primitive: - id: 744c4090-e2f6-489e-8efc-8b1e051bfad6 - version: 0.2.0 - python_path: d3m.primitives.evaluation.redact_columns.Common - name: Redact columns for evaluation - arguments: - inputs: - type: CONTAINER - data: steps.0.produce_score_data - outputs: - - id: produce - hyperparams: - semantic_types: - type: VALUE - data: - - https://metadata.datadrivendiscovery.org/types/PrivilegedData - add_semantic_types: - type: VALUE - data: - - https://metadata.datadrivendiscovery.org/types/RedactedPrivilegedData - - https://metadata.datadrivendiscovery.org/types/MissingData - # Step 2. We further redact targets in test split. - - type: PRIMITIVE - primitive: - id: 744c4090-e2f6-489e-8efc-8b1e051bfad6 - version: 0.2.0 - python_path: d3m.primitives.evaluation.redact_columns.Common - name: Redact columns for evaluation - arguments: - inputs: - type: CONTAINER - data: steps.1.produce - outputs: - - id: produce - hyperparams: - semantic_types: - type: VALUE - data: - - https://metadata.datadrivendiscovery.org/types/TrueTarget - add_semantic_types: - type: VALUE - data: - - https://metadata.datadrivendiscovery.org/types/RedactedTarget - - https://metadata.datadrivendiscovery.org/types/MissingData diff --git a/common-primitives/pipelines/operator.dataset_map.DataFrameCommon/k-fold-timeseries-split-raw.yml b/common-primitives/pipelines/operator.dataset_map.DataFrameCommon/k-fold-timeseries-split-raw.yml deleted file mode 100644 index ea0047c..0000000 --- a/common-primitives/pipelines/operator.dataset_map.DataFrameCommon/k-fold-timeseries-split-raw.yml +++ /dev/null @@ -1,108 +0,0 @@ -# todo change name -id: 5bed1f23-ac17-4b52-9d06-a5b77a6aea51 -schema: https://metadata.datadrivendiscovery.org/schemas/v0/pipeline.json -source: - name: Jeffrey Gleason -created: "2019-12-19T16:29:34.702933Z" -context: TESTING -name: K-fold split of timeseries datasets -description: | - K-fold split of timeseries datasets for cross-validation. -inputs: - - name: folds - - name: full dataset -outputs: - - name: train datasets - data: steps.2.produce - - name: test datasets - data: steps.4.produce - - name: score datasets - data: steps.3.produce -steps: - # Step 0. Simon Data Typing primitive to infer DateTime column - - type: PRIMITIVE - primitive: - id: d2fa8df2-6517-3c26-bafc-87b701c4043a - version: 1.2.2 - python_path: d3m.primitives.data_cleaning.column_type_profiler.Simon - name: simon - # Step 1. Mapped Simon Data Typing primitive to infer DateTime column - - type: PRIMITIVE - primitive: - id: 5bef5738-1638-48d6-9935-72445f0eecdc - version: 0.1.0 - python_path: d3m.primitives.operator.dataset_map.DataFrameCommon - name: Map DataFrame resources to new resources using provided primitive - arguments: - inputs: - type: CONTAINER - data: inputs.1 - outputs: - - id: produce - hyperparams: - primitive: - type: PRIMITIVE - data: 0 - # Step 2. K-fold cross-validation timeseries dataset splits - - type: PRIMITIVE - primitive: - id: 002f9ad1-46e3-40f4-89ed-eeffbb3a102b - version: 0.1.0 - python_path: d3m.primitives.evaluation.kfold_time_series_split.Common - name: K-fold cross-validation timeseries dataset splits - arguments: - inputs: - type: CONTAINER - data: inputs.0 - dataset: - type: CONTAINER - data: steps.1.produce - outputs: - - id: produce - - id: produce_score_data - # Step 3. We redact privileged attributes for both score and test splits. - - type: PRIMITIVE - primitive: - id: 744c4090-e2f6-489e-8efc-8b1e051bfad6 - version: 0.2.0 - python_path: d3m.primitives.evaluation.redact_columns.Common - name: Redact columns for evaluation - arguments: - inputs: - type: CONTAINER - data: steps.2.produce_score_data - outputs: - - id: produce - hyperparams: - semantic_types: - type: VALUE - data: - - https://metadata.datadrivendiscovery.org/types/PrivilegedData - add_semantic_types: - type: VALUE - data: - - https://metadata.datadrivendiscovery.org/types/RedactedPrivilegedData - - https://metadata.datadrivendiscovery.org/types/MissingData - # Step 4. We further redact targets in test split. - - type: PRIMITIVE - primitive: - id: 744c4090-e2f6-489e-8efc-8b1e051bfad6 - version: 0.2.0 - python_path: d3m.primitives.evaluation.redact_columns.Common - name: Redact columns for evaluation - arguments: - inputs: - type: CONTAINER - data: steps.3.produce - outputs: - - id: produce - hyperparams: - semantic_types: - type: VALUE - data: - - https://metadata.datadrivendiscovery.org/types/TrueTarget - add_semantic_types: - type: VALUE - data: - - https://metadata.datadrivendiscovery.org/types/RedactedTarget - - https://metadata.datadrivendiscovery.org/types/MissingData diff --git a/common-primitives/pipelines/regression.xgboost_gbtree.DataFrameCommon/0f636602-6299-411b-9873-4b974cd393ba.json b/common-primitives/pipelines/regression.xgboost_gbtree.DataFrameCommon/0f636602-6299-411b-9873-4b974cd393ba.json deleted file mode 100644 index 1ae892b..0000000 --- a/common-primitives/pipelines/regression.xgboost_gbtree.DataFrameCommon/0f636602-6299-411b-9873-4b974cd393ba.json +++ /dev/null @@ -1,247 +0,0 @@ - -{ - "context": "TESTING", - "created": "2019-02-12T01:35:59.402796Z", - "id": "0f636602-6299-411b-9873-4b974cd393ba", - "inputs": [ - { - "name": "inputs" - } - ], - "outputs": [ - { - "data": "steps.7.produce", - "name": "output predictions" - } - ], - "schema": "https://metadata.datadrivendiscovery.org/schemas/v0/pipeline.json", - "steps": [ - { - "arguments": { - "inputs": { - "data": "inputs.0", - "type": "CONTAINER" - } - }, - "outputs": [ - { - "id": "produce" - } - ], - "primitive": { - "id": "4b42ce1e-9b98-4a25-b68e-fad13311eb65", - "name": "Extract a DataFrame from a Dataset", - "python_path": "d3m.primitives.data_transformation.dataset_to_dataframe.Common", - "version": "0.3.0" - }, - "type": "PRIMITIVE" - }, - { - "arguments": { - "inputs": { - "data": "steps.0.produce", - "type": "CONTAINER" - } - }, - "hyperparams": { - "parse_semantic_types": { - "data": [ - "http://schema.org/Boolean", - "http://schema.org/Integer", - "http://schema.org/Float", - "https://metadata.datadrivendiscovery.org/types/FloatVector", - "http://schema.org/DateTime" - ], - "type": "VALUE" - } - }, - "outputs": [ - { - "id": "produce" - } - ], - "primitive": { - "id": "d510cb7a-1782-4f51-b44c-58f0236e47c7", - "name": "Parses strings into their types", - "python_path": "d3m.primitives.data_transformation.column_parser.Common", - "version": "0.6.0" - }, - "type": "PRIMITIVE" - }, - { - "arguments": { - "inputs": { - "data": "steps.1.produce", - "type": "CONTAINER" - } - }, - "hyperparams": { - "semantic_types": { - "data": [ - "https://metadata.datadrivendiscovery.org/types/CategoricalData" - ], - "type": "VALUE" - } - }, - "outputs": [ - { - "id": "produce" - } - ], - "primitive": { - "id": "4503a4c6-42f7-45a1-a1d4-ed69699cf5e1", - "name": "Extracts columns by semantic type", - "python_path": "d3m.primitives.data_transformation.extract_columns_by_semantic_types.Common", - "version": "0.3.0" - }, - "type": "PRIMITIVE" - }, - { - "arguments": { - "inputs": { - "data": "steps.1.produce", - "type": "CONTAINER" - } - }, - "hyperparams": { - "exclude_columns": { - "data": [ - 0 - ], - "type": "VALUE" - }, - "semantic_types": { - "data": [ - "http://schema.org/Integer", - "http://schema.org/Float" - ], - "type": "VALUE" - } - }, - "outputs": [ - { - "id": "produce" - } - ], - "primitive": { - "id": "4503a4c6-42f7-45a1-a1d4-ed69699cf5e1", - "name": "Extracts columns by semantic type", - "python_path": "d3m.primitives.data_transformation.extract_columns_by_semantic_types.Common", - "version": "0.3.0" - }, - "type": "PRIMITIVE" - }, - { - "arguments": { - "inputs": { - "data": "steps.0.produce", - "type": "CONTAINER" - } - }, - "hyperparams": { - "semantic_types": { - "data": [ - "https://metadata.datadrivendiscovery.org/types/TrueTarget" - ], - "type": "VALUE" - } - }, - "outputs": [ - { - "id": "produce" - } - ], - "primitive": { - "id": "4503a4c6-42f7-45a1-a1d4-ed69699cf5e1", - "name": "Extracts columns by semantic type", - "python_path": "d3m.primitives.data_transformation.extract_columns_by_semantic_types.Common", - "version": "0.3.0" - }, - "type": "PRIMITIVE" - }, - { - "arguments": { - "inputs": { - "data": "steps.3.produce", - "type": "CONTAINER" - } - }, - "hyperparams": { - "return_result": { - "data": "replace", - "type": "VALUE" - }, - "use_semantic_types": { - "data": true, - "type": "VALUE" - } - }, - "outputs": [ - { - "id": "produce" - } - ], - "primitive": { - "id": "d016df89-de62-3c53-87ed-c06bb6a23cde", - "name": "sklearn.impute.SimpleImputer", - "python_path": "d3m.primitives.data_cleaning.imputer.SKlearn", - "version": "2019.6.7" - }, - "type": "PRIMITIVE" - }, - { - "arguments": { - "inputs": { - "data": "steps.5.produce", - "type": "CONTAINER" - }, - "outputs": { - "data": "steps.4.produce", - "type": "CONTAINER" - } - }, - "hyperparams": { - "return_result": { - "data": "replace", - "type": "VALUE" - } - }, - "outputs": [ - { - "id": "produce" - } - ], - "primitive": { - "id": "cdbb80e4-e9de-4caa-a710-16b5d727b959", - "name": "XGBoost GBTree regressor", - "python_path": "d3m.primitives.regression.xgboost_gbtree.Common", - "version": "0.1.0" - }, - "type": "PRIMITIVE" - }, - { - "arguments": { - "inputs": { - "data": "steps.6.produce", - "type": "CONTAINER" - }, - "reference": { - "data": "steps.1.produce", - "type": "CONTAINER" - } - }, - "outputs": [ - { - "id": "produce" - } - ], - "primitive": { - "id": "8d38b340-f83f-4877-baaa-162f8e551736", - "name": "Construct pipeline predictions output", - "python_path": "d3m.primitives.data_transformation.construct_predictions.Common", - "version": "0.3.0" - }, - "type": "PRIMITIVE" - } - ] -} diff --git a/common-primitives/pipelines/schema_discovery.profiler.Common/4ec215d1-6484-4502-a6dd-f659943ccb94.json b/common-primitives/pipelines/schema_discovery.profiler.Common/4ec215d1-6484-4502-a6dd-f659943ccb94.json deleted file mode 120000 index 0deae2e..0000000 --- a/common-primitives/pipelines/schema_discovery.profiler.Common/4ec215d1-6484-4502-a6dd-f659943ccb94.json +++ /dev/null @@ -1 +0,0 @@ -../data_transformation.extract_columns.Common/4ec215d1-6484-4502-a6dd-f659943ccb94.json \ No newline at end of file diff --git a/common-primitives/pipelines/schema_discovery.profiler.Common/a8c40699-c48d-4f12-aa18-639c5fb6baae.json b/common-primitives/pipelines/schema_discovery.profiler.Common/a8c40699-c48d-4f12-aa18-639c5fb6baae.json deleted file mode 120000 index b1225d9..0000000 --- a/common-primitives/pipelines/schema_discovery.profiler.Common/a8c40699-c48d-4f12-aa18-639c5fb6baae.json +++ /dev/null @@ -1 +0,0 @@ -../data_transformation.grouping_field_compose.Common/a8c40699-c48d-4f12-aa18-639c5fb6baae.json \ No newline at end of file diff --git a/common-primitives/pipelines/schema_discovery.profiler.Common/b523335c-0c47-4d02-a582-f69609cde1e8.json b/common-primitives/pipelines/schema_discovery.profiler.Common/b523335c-0c47-4d02-a582-f69609cde1e8.json deleted file mode 120000 index 51266fd..0000000 --- a/common-primitives/pipelines/schema_discovery.profiler.Common/b523335c-0c47-4d02-a582-f69609cde1e8.json +++ /dev/null @@ -1 +0,0 @@ -../data_transformation.extract_columns_by_structural_types.Common/b523335c-0c47-4d02-a582-f69609cde1e8.json \ No newline at end of file diff --git a/common-primitives/run_pipelines.sh b/common-primitives/run_pipelines.sh deleted file mode 100755 index 437c24b..0000000 --- a/common-primitives/run_pipelines.sh +++ /dev/null @@ -1,44 +0,0 @@ -#!/bin/bash - -mkdir -p results - -overall_result="0" - -while IFS= read -r pipeline_run_file; do - pipeline_run_name="$(dirname "$pipeline_run_file")/$(basename -s .yml.gz "$(basename -s .yaml.gz "$pipeline_run_file")")" - primitive_name="$(basename "$(dirname "$pipeline_run_file")")" - - if [[ -L "$pipeline_run_file" ]]; then - echo ">>> Skipping '$pipeline_run_file'." - continue - else - mkdir -p "results/$pipeline_run_name" - fi - - pipelines_path="pipelines/$primitive_name" - - if [[ ! -d "$pipelines_path" ]]; then - echo ">>> ERROR: Could not find pipelines for '$pipeline_run_file'." - overall_result="1" - continue - fi - - echo ">>> Running '$pipeline_run_file'." - python3 -m d3m --pipelines-path "$pipelines_path" \ - runtime \ - --datasets /data/datasets --volumes /data/static_files \ - fit-score --input-run "$pipeline_run_file" \ - --output "results/$pipeline_run_name/predictions.csv" \ - --scores "results/$pipeline_run_name/scores.csv" \ - --output-run "results/$pipeline_run_name/pipeline_runs.yaml" - result="$?" - - if [[ "$result" -eq 0 ]]; then - echo ">>> SUCCESS ($pipeline_run_file)" - else - echo ">>> ERROR ($pipeline_run_file)" - overall_result="1" - fi -done < <(find pipeline_runs -name '*.yml.gz' -or -name '*.yaml.gz') - -exit "$overall_result" diff --git a/common-primitives/run_tests.py b/common-primitives/run_tests.py deleted file mode 100755 index 16c264a..0000000 --- a/common-primitives/run_tests.py +++ /dev/null @@ -1,11 +0,0 @@ -#!/usr/bin/env python3 - -import sys -import unittest - -runner = unittest.TextTestRunner(verbosity=1) - -tests = unittest.TestLoader().discover('tests') - -if not runner.run(tests).wasSuccessful(): - sys.exit(1) diff --git a/common-primitives/setup.cfg b/common-primitives/setup.cfg deleted file mode 100644 index e218fc8..0000000 --- a/common-primitives/setup.cfg +++ /dev/null @@ -1,28 +0,0 @@ -[pycodestyle] -max-line-length = 200 - -[metadata] -description-file = README.md - -[mypy] -warn_redundant_casts = True -# TODO: Enable back once false positives are fixed. -# See: https://github.com/python/mypy/issues/4412 -#warn_unused_ignores = True -warn_unused_configs = True -disallow_untyped_defs = True - -# TODO: Remove once this is fixed: https://github.com/python/mypy/issues/4300 -[mypy-d3m.container.list] -ignore_errors = True - -# TODO: Remove once this is fixed: https://github.com/python/mypy/issues/4300 -[mypy-d3m.metadata.hyperparams] -ignore_errors = True - -# TODO: Remove once this is fixed: https://github.com/python/mypy/pull/4384#issuecomment-354033177 -[mypy-d3m.primitive_interfaces.distance] -ignore_errors = True - -[mypy-common_primitives.slacker.*] -ignore_errors = True diff --git a/common-primitives/setup.py b/common-primitives/setup.py deleted file mode 100644 index c8d1e21..0000000 --- a/common-primitives/setup.py +++ /dev/null @@ -1,65 +0,0 @@ -import os -import sys -from setuptools import setup, find_packages - -PACKAGE_NAME = 'common_primitives' -MINIMUM_PYTHON_VERSION = 3, 6 - - -def check_python_version(): - """Exit when the Python version is too low.""" - if sys.version_info < MINIMUM_PYTHON_VERSION: - sys.exit("Python {}.{}+ is required.".format(*MINIMUM_PYTHON_VERSION)) - - -def read_package_variable(key): - """Read the value of a variable from the package without importing.""" - module_path = os.path.join(PACKAGE_NAME, '__init__.py') - with open(module_path) as module: - for line in module: - parts = line.strip().split(' ') - if parts and parts[0] == key: - return parts[-1].strip("'") - raise KeyError("'{0}' not found in '{1}'".format(key, module_path)) - - -def read_readme(): - with open(os.path.join(os.path.dirname(__file__), 'README.md'), encoding='utf8') as file: - return file.read() - - -def read_entry_points(): - with open('entry_points.ini') as entry_points: - return entry_points.read() - - -check_python_version() -version = read_package_variable('__version__') - -setup( - name=PACKAGE_NAME, - version=version, - description='D3M common primitives', - author=read_package_variable('__author__'), - packages=find_packages(exclude=['contrib', 'docs', 'tests*']), - data_files=[('./', ['./entry_points.ini'])], - install_requires=[ - 'd3m', - 'pandas', - 'scikit-learn', - 'numpy', - 'lightgbm>=2.2.2,<=2.3.0', - 'opencv-python-headless<=4.1.1.26,>=4.1', - 'imageio>=2.3.0,<=2.6.0', - 'pillow==6.2.1', - 'xgboost>=0.81,<=0.90', - ], - entry_points=read_entry_points(), - url='https://gitlab.com/datadrivendiscovery/common-primitives', - long_description=read_readme(), - long_description_content_type='text/markdown', - license='Apache-2.0', - classifiers=[ - 'License :: OSI Approved :: Apache Software License', - ], -) diff --git a/common-primitives/sklearn-wrap/.gitignore b/common-primitives/sklearn-wrap/.gitignore deleted file mode 100644 index 36fa0f3..0000000 --- a/common-primitives/sklearn-wrap/.gitignore +++ /dev/null @@ -1,2 +0,0 @@ -.pyc -__pycache__ diff --git a/common-primitives/sklearn-wrap/requirements.txt b/common-primitives/sklearn-wrap/requirements.txt deleted file mode 100644 index fccd5e8..0000000 --- a/common-primitives/sklearn-wrap/requirements.txt +++ /dev/null @@ -1,31 +0,0 @@ -scikit-learn==0.22.0 -pytypes==1.0b5 -frozendict==1.2 -numpy>=1.15.4,<=1.18.1 -jsonschema==2.6.0 -requests>=2.19.1,<=2.22.0 -strict-rfc3339==0.7 -rfc3987==1.3.8 -webcolors>=1.8.1,<=1.10 -dateparser>=0.7.0,<=0.7.2 -python-dateutil==2.8.1 -pandas==0.25 -typing-inspect==0.5.0 -GitPython>=2.1.11,<=3.0.5 -jsonpath-ng==1.4.3 -custom-inherit>=2.2.0,<=2.2.2 -PyYAML>=5.1,<=5.3 -pycurl>=7.43.0.2,<=7.43.0.3 -pyarrow==0.15.1 -gputil>=1.3.0,<=1.4.0 -pyrsistent>=0.14.11,<=0.15.7 -scipy>=1.2.1,<=1.4.1 -openml==0.10.1 -lightgbm>=2.2.2,<=2.3.0 -opencv-python-headless<=4.1.1.26,>=4.1 -imageio>=2.3.0,<=2.6.0 -pillow==6.2.1 -xgboost>=0.81,<=0.90 -Jinja2==2.9.4 -simplejson==3.12.0 -gitdb2==2.0.6 diff --git a/common-primitives/sklearn-wrap/setup.py b/common-primitives/sklearn-wrap/setup.py deleted file mode 100644 index 0090ec8..0000000 --- a/common-primitives/sklearn-wrap/setup.py +++ /dev/null @@ -1,106 +0,0 @@ -import os -from setuptools import setup, find_packages - -PACKAGE_NAME = 'sklearn_wrap' - - -def read_package_variable(key): - """Read the value of a variable from the package without importing.""" - module_path = os.path.join(PACKAGE_NAME, '__init__.py') - with open(module_path) as module: - for line in module: - parts = line.strip().split(' ') - if parts and parts[0] == key: - return parts[-1].strip("'") - assert False, "'{0}' not found in '{1}'".format(key, module_path) - - -setup( - name=PACKAGE_NAME, - version=read_package_variable('__version__'), - description='Primitives created using the Sklearn auto wrapper', - author=read_package_variable('__author__'), - packages=find_packages(exclude=['contrib', 'docs', 'tests*']), - install_requires=[ - 'd3m', - 'Jinja2==2.9.4', - 'simplejson==3.12.0', - 'scikit-learn==0.22.0', - ], - url='https://gitlab.datadrivendiscovery.org/jpl/sklearn-wrapping', - entry_points = { - 'd3m.primitives': [ - 'data_cleaning.string_imputer.SKlearn = sklearn_wrap.SKStringImputer:SKStringImputer', - 'classification.gradient_boosting.SKlearn = sklearn_wrap.SKGradientBoostingClassifier:SKGradientBoostingClassifier', - 'classification.quadratic_discriminant_analysis.SKlearn = sklearn_wrap.SKQuadraticDiscriminantAnalysis:SKQuadraticDiscriminantAnalysis', - 'classification.decision_tree.SKlearn = sklearn_wrap.SKDecisionTreeClassifier:SKDecisionTreeClassifier', - 'classification.sgd.SKlearn = sklearn_wrap.SKSGDClassifier:SKSGDClassifier', - 'classification.nearest_centroid.SKlearn = sklearn_wrap.SKNearestCentroid:SKNearestCentroid', - 'classification.mlp.SKlearn = sklearn_wrap.SKMLPClassifier:SKMLPClassifier', - 'classification.bagging.SKlearn = sklearn_wrap.SKBaggingClassifier:SKBaggingClassifier', - 'classification.linear_svc.SKlearn = sklearn_wrap.SKLinearSVC:SKLinearSVC', - 'classification.linear_discriminant_analysis.SKlearn = sklearn_wrap.SKLinearDiscriminantAnalysis:SKLinearDiscriminantAnalysis', - 'classification.passive_aggressive.SKlearn = sklearn_wrap.SKPassiveAggressiveClassifier:SKPassiveAggressiveClassifier', - 'classification.gaussian_naive_bayes.SKlearn = sklearn_wrap.SKGaussianNB:SKGaussianNB', - 'classification.ada_boost.SKlearn = sklearn_wrap.SKAdaBoostClassifier:SKAdaBoostClassifier', - 'classification.random_forest.SKlearn = sklearn_wrap.SKRandomForestClassifier:SKRandomForestClassifier', - 'classification.svc.SKlearn = sklearn_wrap.SKSVC:SKSVC', - 'classification.multinomial_naive_bayes.SKlearn = sklearn_wrap.SKMultinomialNB:SKMultinomialNB', - 'classification.dummy.SKlearn = sklearn_wrap.SKDummyClassifier:SKDummyClassifier', - 'classification.extra_trees.SKlearn = sklearn_wrap.SKExtraTreesClassifier:SKExtraTreesClassifier', - 'classification.logistic_regression.SKlearn = sklearn_wrap.SKLogisticRegression:SKLogisticRegression', - 'classification.bernoulli_naive_bayes.SKlearn = sklearn_wrap.SKBernoulliNB:SKBernoulliNB', - 'classification.k_neighbors.SKlearn = sklearn_wrap.SKKNeighborsClassifier:SKKNeighborsClassifier', - 'regression.decision_tree.SKlearn = sklearn_wrap.SKDecisionTreeRegressor:SKDecisionTreeRegressor', - 'regression.ada_boost.SKlearn = sklearn_wrap.SKAdaBoostRegressor:SKAdaBoostRegressor', - 'regression.k_neighbors.SKlearn = sklearn_wrap.SKKNeighborsRegressor:SKKNeighborsRegressor', - 'regression.linear.SKlearn = sklearn_wrap.SKLinearRegression:SKLinearRegression', - 'regression.bagging.SKlearn = sklearn_wrap.SKBaggingRegressor:SKBaggingRegressor', - 'regression.lasso_cv.SKlearn = sklearn_wrap.SKLassoCV:SKLassoCV', - 'regression.elastic_net.SKlearn = sklearn_wrap.SKElasticNet:SKElasticNet', - 'regression.ard.SKlearn = sklearn_wrap.SKARDRegression:SKARDRegression', - 'regression.svr.SKlearn = sklearn_wrap.SKSVR:SKSVR', - 'regression.ridge.SKlearn = sklearn_wrap.SKRidge:SKRidge', - 'regression.gaussian_process.SKlearn = sklearn_wrap.SKGaussianProcessRegressor:SKGaussianProcessRegressor', - 'regression.mlp.SKlearn = sklearn_wrap.SKMLPRegressor:SKMLPRegressor', - 'regression.dummy.SKlearn = sklearn_wrap.SKDummyRegressor:SKDummyRegressor', - 'regression.sgd.SKlearn = sklearn_wrap.SKSGDRegressor:SKSGDRegressor', - 'regression.lasso.SKlearn = sklearn_wrap.SKLasso:SKLasso', - 'regression.lars.SKlearn = sklearn_wrap.SKLars:SKLars', - 'regression.extra_trees.SKlearn = sklearn_wrap.SKExtraTreesRegressor:SKExtraTreesRegressor', - 'regression.linear_svr.SKlearn = sklearn_wrap.SKLinearSVR:SKLinearSVR', - 'regression.random_forest.SKlearn = sklearn_wrap.SKRandomForestRegressor:SKRandomForestRegressor', - 'regression.gradient_boosting.SKlearn = sklearn_wrap.SKGradientBoostingRegressor:SKGradientBoostingRegressor', - 'regression.passive_aggressive.SKlearn = sklearn_wrap.SKPassiveAggressiveRegressor:SKPassiveAggressiveRegressor', - 'regression.kernel_ridge.SKlearn = sklearn_wrap.SKKernelRidge:SKKernelRidge', - 'data_preprocessing.max_abs_scaler.SKlearn = sklearn_wrap.SKMaxAbsScaler:SKMaxAbsScaler', - 'data_preprocessing.normalizer.SKlearn = sklearn_wrap.SKNormalizer:SKNormalizer', - 'data_preprocessing.robust_scaler.SKlearn = sklearn_wrap.SKRobustScaler:SKRobustScaler', - 'data_preprocessing.tfidf_vectorizer.SKlearn = sklearn_wrap.SKTfidfVectorizer:SKTfidfVectorizer', - 'data_transformation.one_hot_encoder.SKlearn = sklearn_wrap.SKOneHotEncoder:SKOneHotEncoder', - 'data_preprocessing.truncated_svd.SKlearn = sklearn_wrap.SKTruncatedSVD:SKTruncatedSVD', - 'feature_selection.select_percentile.SKlearn = sklearn_wrap.SKSelectPercentile:SKSelectPercentile', - 'feature_extraction.pca.SKlearn = sklearn_wrap.SKPCA:SKPCA', - 'data_preprocessing.count_vectorizer.SKlearn = sklearn_wrap.SKCountVectorizer:SKCountVectorizer', - 'data_transformation.ordinal_encoder.SKlearn = sklearn_wrap.SKOrdinalEncoder:SKOrdinalEncoder', - 'data_preprocessing.binarizer.SKlearn = sklearn_wrap.SKBinarizer:SKBinarizer', - 'data_cleaning.missing_indicator.SKlearn = sklearn_wrap.SKMissingIndicator:SKMissingIndicator', - 'feature_selection.select_fwe.SKlearn = sklearn_wrap.SKSelectFwe:SKSelectFwe', - 'data_preprocessing.rbf_sampler.SKlearn = sklearn_wrap.SKRBFSampler:SKRBFSampler', - 'data_preprocessing.min_max_scaler.SKlearn = sklearn_wrap.SKMinMaxScaler:SKMinMaxScaler', - 'data_preprocessing.random_trees_embedding.SKlearn = sklearn_wrap.SKRandomTreesEmbedding:SKRandomTreesEmbedding', - 'data_transformation.gaussian_random_projection.SKlearn = sklearn_wrap.SKGaussianRandomProjection:SKGaussianRandomProjection', - 'feature_extraction.kernel_pca.SKlearn = sklearn_wrap.SKKernelPCA:SKKernelPCA', - 'data_preprocessing.polynomial_features.SKlearn = sklearn_wrap.SKPolynomialFeatures:SKPolynomialFeatures', - 'data_preprocessing.feature_agglomeration.SKlearn = sklearn_wrap.SKFeatureAgglomeration:SKFeatureAgglomeration', - 'data_cleaning.imputer.SKlearn = sklearn_wrap.SKImputer:SKImputer', - 'data_preprocessing.standard_scaler.SKlearn = sklearn_wrap.SKStandardScaler:SKStandardScaler', - 'data_transformation.fast_ica.SKlearn = sklearn_wrap.SKFastICA:SKFastICA', - 'data_preprocessing.quantile_transformer.SKlearn = sklearn_wrap.SKQuantileTransformer:SKQuantileTransformer', - 'data_transformation.sparse_random_projection.SKlearn = sklearn_wrap.SKSparseRandomProjection:SKSparseRandomProjection', - 'data_preprocessing.nystroem.SKlearn = sklearn_wrap.SKNystroem:SKNystroem', - 'feature_selection.variance_threshold.SKlearn = sklearn_wrap.SKVarianceThreshold:SKVarianceThreshold', - 'feature_selection.generic_univariate_select.SKlearn = sklearn_wrap.SKGenericUnivariateSelect:SKGenericUnivariateSelect', - ], - }, -) diff --git a/common-primitives/sklearn-wrap/sklearn_wrap/SKARDRegression.py b/common-primitives/sklearn-wrap/sklearn_wrap/SKARDRegression.py deleted file mode 100644 index 6d1b782..0000000 --- a/common-primitives/sklearn-wrap/sklearn_wrap/SKARDRegression.py +++ /dev/null @@ -1,470 +0,0 @@ -from typing import Any, Callable, List, Dict, Union, Optional, Sequence, Tuple -from numpy import ndarray -from collections import OrderedDict -from scipy import sparse -import os -import sklearn -import numpy -import typing - -# Custom import commands if any -from sklearn.linear_model.bayes import ARDRegression - - -from d3m.container.numpy import ndarray as d3m_ndarray -from d3m.container import DataFrame as d3m_dataframe -from d3m.metadata import hyperparams, params, base as metadata_base -from d3m import utils -from d3m.base import utils as base_utils -from d3m.exceptions import PrimitiveNotFittedError -from d3m.primitive_interfaces.base import CallResult, DockerContainer - -from d3m.primitive_interfaces.supervised_learning import SupervisedLearnerPrimitiveBase -from d3m.primitive_interfaces.base import ProbabilisticCompositionalityMixin, ContinueFitMixin -from d3m import exceptions -import pandas - - - -Inputs = d3m_dataframe -Outputs = d3m_dataframe - - -class Params(params.Params): - coef_: Optional[ndarray] - alpha_: Optional[float] - lambda_: Optional[ndarray] - sigma_: Optional[ndarray] - scores_: Optional[Sequence[Any]] - intercept_: Optional[float] - input_column_names: Optional[Any] - target_names_: Optional[Sequence[Any]] - training_indices_: Optional[Sequence[int]] - target_column_indices_: Optional[Sequence[int]] - target_columns_metadata_: Optional[List[OrderedDict]] - - - -class Hyperparams(hyperparams.Hyperparams): - n_iter = hyperparams.Bounded[int]( - default=300, - lower=0, - upper=None, - description='Maximum number of iterations. Default is 300', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - tol = hyperparams.Bounded[float]( - default=0.001, - lower=0, - upper=None, - description='Stop the algorithm if w has converged. Default is 1.e-3.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - alpha_1 = hyperparams.Hyperparameter[float]( - default=1e-06, - description='Hyper-parameter : shape parameter for the Gamma distribution prior over the alpha parameter. Default is 1.e-6.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - alpha_2 = hyperparams.Hyperparameter[float]( - default=1e-06, - description='Hyper-parameter : inverse scale parameter (rate parameter) for the Gamma distribution prior over the alpha parameter. Default is 1.e-6.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - lambda_1 = hyperparams.Hyperparameter[float]( - default=1e-06, - description='Hyper-parameter : shape parameter for the Gamma distribution prior over the lambda parameter. Default is 1.e-6.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - lambda_2 = hyperparams.Hyperparameter[float]( - default=1e-06, - description='Hyper-parameter : inverse scale parameter (rate parameter) for the Gamma distribution prior over the lambda parameter. Default is 1.e-6.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - threshold_lambda = hyperparams.Hyperparameter[float]( - default=10000.0, - description='threshold for removing (pruning) weights with high precision from the computation. Default is 1.e+4.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - fit_intercept = hyperparams.UniformBool( - default=True, - description='whether to calculate the intercept for this model. If set to false, no intercept will be used in calculations (e.g. data is expected to be already centered). Default is True.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - normalize = hyperparams.UniformBool( - default=False, - description='If True, the regressors X will be normalized before regression. This parameter is ignored when `fit_intercept` is set to False. When the regressors are normalized, note that this makes the hyperparameters learnt more robust and almost independent of the number of samples. The same property is not valid for standardized data. However, if you wish to standardize, please use `preprocessing.StandardScaler` before calling `fit` on an estimator with `normalize=False`. copy_X : boolean, optional, default True. If True, X will be copied; else, it may be overwritten.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - - use_inputs_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to force primitive to use as training input. If any specified column cannot be parsed, it is skipped.", - ) - use_outputs_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to force primitive to use as training target. If any specified column cannot be parsed, it is skipped.", - ) - exclude_inputs_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to not use as training inputs. Applicable only if \"use_columns\" is not provided.", - ) - exclude_outputs_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to not use as training target. Applicable only if \"use_columns\" is not provided.", - ) - return_result = hyperparams.Enumeration( - values=['append', 'replace', 'new'], - default='new', - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Should parsed columns be appended, should they replace original columns, or should only parsed columns be returned? This hyperparam is ignored if use_semantic_types is set to false.", - ) - use_semantic_types = hyperparams.UniformBool( - default=False, - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Controls whether semantic_types metadata will be used for filtering columns in input dataframe. Setting this to false makes the code ignore return_result and will produce only the output dataframe" - ) - add_index_columns = hyperparams.UniformBool( - default=False, - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Also include primary index columns if input data has them. Applicable only if \"return_result\" is set to \"new\".", - ) - error_on_no_input = hyperparams.UniformBool( - default=True, - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Throw an exception if no input column is selected/provided. Defaults to true to behave like sklearn. To prevent pipelines from breaking set this to False.", - ) - - return_semantic_type = hyperparams.Enumeration[str]( - values=['https://metadata.datadrivendiscovery.org/types/Attribute', 'https://metadata.datadrivendiscovery.org/types/ConstructedAttribute', 'https://metadata.datadrivendiscovery.org/types/PredictedTarget'], - default='https://metadata.datadrivendiscovery.org/types/PredictedTarget', - description='Decides what semantic type to attach to generated output', - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'] - ) - -class SKARDRegression(SupervisedLearnerPrimitiveBase[Inputs, Outputs, Params, Hyperparams]): - """ - Primitive wrapping for sklearn ARDRegression - `sklearn documentation `_ - - """ - - __author__ = "JPL MARVIN" - metadata = metadata_base.PrimitiveMetadata({ - "algorithm_types": [metadata_base.PrimitiveAlgorithmType.BAYESIAN_LINEAR_REGRESSION, ], - "name": "sklearn.linear_model.bayes.ARDRegression", - "primitive_family": metadata_base.PrimitiveFamily.REGRESSION, - "python_path": "d3m.primitives.regression.ard.SKlearn", - "source": {'name': 'JPL', 'contact': 'mailto:shah@jpl.nasa.gov', 'uris': ['https://gitlab.com/datadrivendiscovery/sklearn-wrap/issues', 'https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.ARDRegression.html']}, - "version": "2019.11.13", - "id": "966dd2c4-d439-3ad6-b49f-17706595606c", - "hyperparams_to_tune": ['n_iter'], - 'installation': [ - {'type': metadata_base.PrimitiveInstallationType.PIP, - 'package_uri': 'git+https://gitlab.com/datadrivendiscovery/sklearn-wrap.git@{git_commit}#egg=sklearn_wrap'.format( - git_commit=utils.current_git_commit(os.path.dirname(__file__)), - ), - }] - }) - - def __init__(self, *, - hyperparams: Hyperparams, - random_seed: int = 0, - docker_containers: Dict[str, DockerContainer] = None, - _copy_X: bool = True, - _verbose: bool = None) -> None: - - super().__init__(hyperparams=hyperparams, random_seed=random_seed, docker_containers=docker_containers) - - # False - self._clf = ARDRegression( - n_iter=self.hyperparams['n_iter'], - tol=self.hyperparams['tol'], - alpha_1=self.hyperparams['alpha_1'], - alpha_2=self.hyperparams['alpha_2'], - lambda_1=self.hyperparams['lambda_1'], - lambda_2=self.hyperparams['lambda_2'], - threshold_lambda=self.hyperparams['threshold_lambda'], - fit_intercept=self.hyperparams['fit_intercept'], - normalize=self.hyperparams['normalize'], - copy_X=_copy_X, - verbose=_verbose - ) - - self._inputs = None - self._outputs = None - self._training_inputs = None - self._training_outputs = None - self._target_names = None - self._training_indices = None - self._target_column_indices = None - self._target_columns_metadata: List[OrderedDict] = None - self._input_column_names = None - self._fitted = False - self._new_training_data = False - - def set_training_data(self, *, inputs: Inputs, outputs: Outputs) -> None: - self._inputs = inputs - self._outputs = outputs - self._fitted = False - self._new_training_data = True - - def fit(self, *, timeout: float = None, iterations: int = None) -> CallResult[None]: - if self._inputs is None or self._outputs is None: - raise ValueError("Missing training data.") - - if not self._new_training_data: - return CallResult(None) - self._new_training_data = False - - self._training_inputs, self._training_indices = self._get_columns_to_fit(self._inputs, self.hyperparams) - self._training_outputs, self._target_names, self._target_column_indices = self._get_targets(self._outputs, self.hyperparams) - self._input_column_names = self._training_inputs.columns - - if len(self._training_indices) > 0 and len(self._target_column_indices) > 0: - self._target_columns_metadata = self._get_target_columns_metadata(self._training_outputs.metadata, self.hyperparams) - sk_training_output = self._training_outputs.values - - shape = sk_training_output.shape - if len(shape) == 2 and shape[1] == 1: - sk_training_output = numpy.ravel(sk_training_output) - - self._clf.fit(self._training_inputs, sk_training_output) - self._fitted = True - else: - if self.hyperparams['error_on_no_input']: - raise RuntimeError("No input columns were selected") - self.logger.warn("No input columns were selected") - - return CallResult(None) - - - - def produce(self, *, inputs: Inputs, timeout: float = None, iterations: int = None) -> CallResult[Outputs]: - sk_inputs, columns_to_use = self._get_columns_to_fit(inputs, self.hyperparams) - output = [] - if len(sk_inputs.columns): - try: - sk_output = self._clf.predict(sk_inputs) - except sklearn.exceptions.NotFittedError as error: - raise PrimitiveNotFittedError("Primitive not fitted.") from error - # For primitives that allow predicting without fitting like GaussianProcessRegressor - if not self._fitted: - raise PrimitiveNotFittedError("Primitive not fitted.") - if sparse.issparse(sk_output): - sk_output = sk_output.toarray() - output = self._wrap_predictions(inputs, sk_output) - output.columns = self._target_names - output = [output] - else: - if self.hyperparams['error_on_no_input']: - raise RuntimeError("No input columns were selected") - self.logger.warn("No input columns were selected") - outputs = base_utils.combine_columns(return_result=self.hyperparams['return_result'], - add_index_columns=self.hyperparams['add_index_columns'], - inputs=inputs, column_indices=self._target_column_indices, - columns_list=output) - - return CallResult(outputs) - - - def get_params(self) -> Params: - if not self._fitted: - return Params( - coef_=None, - alpha_=None, - lambda_=None, - sigma_=None, - scores_=None, - intercept_=None, - input_column_names=self._input_column_names, - training_indices_=self._training_indices, - target_names_=self._target_names, - target_column_indices_=self._target_column_indices, - target_columns_metadata_=self._target_columns_metadata - ) - - return Params( - coef_=getattr(self._clf, 'coef_', None), - alpha_=getattr(self._clf, 'alpha_', None), - lambda_=getattr(self._clf, 'lambda_', None), - sigma_=getattr(self._clf, 'sigma_', None), - scores_=getattr(self._clf, 'scores_', None), - intercept_=getattr(self._clf, 'intercept_', None), - input_column_names=self._input_column_names, - training_indices_=self._training_indices, - target_names_=self._target_names, - target_column_indices_=self._target_column_indices, - target_columns_metadata_=self._target_columns_metadata - ) - - def set_params(self, *, params: Params) -> None: - self._clf.coef_ = params['coef_'] - self._clf.alpha_ = params['alpha_'] - self._clf.lambda_ = params['lambda_'] - self._clf.sigma_ = params['sigma_'] - self._clf.scores_ = params['scores_'] - self._clf.intercept_ = params['intercept_'] - self._input_column_names = params['input_column_names'] - self._training_indices = params['training_indices_'] - self._target_names = params['target_names_'] - self._target_column_indices = params['target_column_indices_'] - self._target_columns_metadata = params['target_columns_metadata_'] - - if params['coef_'] is not None: - self._fitted = True - if params['alpha_'] is not None: - self._fitted = True - if params['lambda_'] is not None: - self._fitted = True - if params['sigma_'] is not None: - self._fitted = True - if params['scores_'] is not None: - self._fitted = True - if params['intercept_'] is not None: - self._fitted = True - - - - - - - - @classmethod - def _get_columns_to_fit(cls, inputs: Inputs, hyperparams: Hyperparams): - if not hyperparams['use_semantic_types']: - return inputs, list(range(len(inputs.columns))) - - inputs_metadata = inputs.metadata - - def can_produce_column(column_index: int) -> bool: - return cls._can_produce_column(inputs_metadata, column_index, hyperparams) - - columns_to_produce, columns_not_to_produce = base_utils.get_columns_to_use(inputs_metadata, - use_columns=hyperparams['use_inputs_columns'], - exclude_columns=hyperparams['exclude_inputs_columns'], - can_use_column=can_produce_column) - return inputs.iloc[:, columns_to_produce], columns_to_produce - # return columns_to_produce - - @classmethod - def _can_produce_column(cls, inputs_metadata: metadata_base.DataMetadata, column_index: int, hyperparams: Hyperparams) -> bool: - column_metadata = inputs_metadata.query((metadata_base.ALL_ELEMENTS, column_index)) - - accepted_structural_types = (int, float, numpy.integer, numpy.float64) - accepted_semantic_types = set() - accepted_semantic_types.add("https://metadata.datadrivendiscovery.org/types/Attribute") - if not issubclass(column_metadata['structural_type'], accepted_structural_types): - return False - - semantic_types = set(column_metadata.get('semantic_types', [])) - - if len(semantic_types) == 0: - cls.logger.warning("No semantic types found in column metadata") - return False - # Making sure all accepted_semantic_types are available in semantic_types - if len(accepted_semantic_types - semantic_types) == 0: - return True - - return False - - @classmethod - def _get_targets(cls, data: d3m_dataframe, hyperparams: Hyperparams): - if not hyperparams['use_semantic_types']: - return data, list(data.columns), list(range(len(data.columns))) - - metadata = data.metadata - - def can_produce_column(column_index: int) -> bool: - accepted_semantic_types = set() - accepted_semantic_types.add("https://metadata.datadrivendiscovery.org/types/TrueTarget") - column_metadata = metadata.query((metadata_base.ALL_ELEMENTS, column_index)) - semantic_types = set(column_metadata.get('semantic_types', [])) - if len(semantic_types) == 0: - cls.logger.warning("No semantic types found in column metadata") - return False - # Making sure all accepted_semantic_types are available in semantic_types - if len(accepted_semantic_types - semantic_types) == 0: - return True - return False - - target_column_indices, target_columns_not_to_produce = base_utils.get_columns_to_use(metadata, - use_columns=hyperparams[ - 'use_outputs_columns'], - exclude_columns= - hyperparams[ - 'exclude_outputs_columns'], - can_use_column=can_produce_column) - targets = [] - if target_column_indices: - targets = data.select_columns(target_column_indices) - target_column_names = [] - for idx in target_column_indices: - target_column_names.append(data.columns[idx]) - return targets, target_column_names, target_column_indices - - @classmethod - def _get_target_columns_metadata(cls, outputs_metadata: metadata_base.DataMetadata, hyperparams) -> List[OrderedDict]: - outputs_length = outputs_metadata.query((metadata_base.ALL_ELEMENTS,))['dimension']['length'] - - target_columns_metadata: List[OrderedDict] = [] - for column_index in range(outputs_length): - column_metadata = OrderedDict(outputs_metadata.query_column(column_index)) - - # Update semantic types and prepare it for predicted targets. - semantic_types = set(column_metadata.get('semantic_types', [])) - semantic_types_to_remove = set(["https://metadata.datadrivendiscovery.org/types/TrueTarget","https://metadata.datadrivendiscovery.org/types/SuggestedTarget",]) - add_semantic_types = set(["https://metadata.datadrivendiscovery.org/types/PredictedTarget",]) - add_semantic_types.add(hyperparams["return_semantic_type"]) - semantic_types = semantic_types - semantic_types_to_remove - semantic_types = semantic_types.union(add_semantic_types) - column_metadata['semantic_types'] = list(semantic_types) - - target_columns_metadata.append(column_metadata) - - return target_columns_metadata - - @classmethod - def _update_predictions_metadata(cls, inputs_metadata: metadata_base.DataMetadata, outputs: Optional[Outputs], - target_columns_metadata: List[OrderedDict]) -> metadata_base.DataMetadata: - outputs_metadata = metadata_base.DataMetadata().generate(value=outputs) - - for column_index, column_metadata in enumerate(target_columns_metadata): - column_metadata.pop("structural_type", None) - outputs_metadata = outputs_metadata.update_column(column_index, column_metadata) - - return outputs_metadata - - def _wrap_predictions(self, inputs: Inputs, predictions: ndarray) -> Outputs: - outputs = d3m_dataframe(predictions, generate_metadata=False) - outputs.metadata = self._update_predictions_metadata(inputs.metadata, outputs, self._target_columns_metadata) - return outputs - - - @classmethod - def _add_target_columns_metadata(cls, outputs_metadata: metadata_base.DataMetadata): - outputs_length = outputs_metadata.query((metadata_base.ALL_ELEMENTS,))['dimension']['length'] - - target_columns_metadata: List[OrderedDict] = [] - for column_index in range(outputs_length): - column_metadata = OrderedDict() - semantic_types = [] - semantic_types.append('https://metadata.datadrivendiscovery.org/types/PredictedTarget') - column_name = outputs_metadata.query((metadata_base.ALL_ELEMENTS, column_index)).get("name") - if column_name is None: - column_name = "output_{}".format(column_index) - column_metadata["semantic_types"] = semantic_types - column_metadata["name"] = str(column_name) - target_columns_metadata.append(column_metadata) - - return target_columns_metadata - - -SKARDRegression.__doc__ = ARDRegression.__doc__ \ No newline at end of file diff --git a/common-primitives/sklearn-wrap/sklearn_wrap/SKAdaBoostClassifier.py b/common-primitives/sklearn-wrap/sklearn_wrap/SKAdaBoostClassifier.py deleted file mode 100644 index e48b2b6..0000000 --- a/common-primitives/sklearn-wrap/sklearn_wrap/SKAdaBoostClassifier.py +++ /dev/null @@ -1,498 +0,0 @@ -from typing import Any, Callable, List, Dict, Union, Optional, Sequence, Tuple -from numpy import ndarray -from collections import OrderedDict -from scipy import sparse -import os -import sklearn -import numpy -import typing - -# Custom import commands if any -from sklearn.ensemble.weight_boosting import AdaBoostClassifier - - -from d3m.container.numpy import ndarray as d3m_ndarray -from d3m.container import DataFrame as d3m_dataframe -from d3m.metadata import hyperparams, params, base as metadata_base -from d3m import utils -from d3m.base import utils as base_utils -from d3m.exceptions import PrimitiveNotFittedError -from d3m.primitive_interfaces.base import CallResult, DockerContainer - -from d3m.primitive_interfaces.supervised_learning import SupervisedLearnerPrimitiveBase -from d3m.primitive_interfaces.base import ProbabilisticCompositionalityMixin, ContinueFitMixin -from d3m import exceptions -import pandas - - - -Inputs = d3m_dataframe -Outputs = d3m_dataframe - - -class Params(params.Params): - estimators_: Optional[Sequence[sklearn.base.BaseEstimator]] - classes_: Optional[ndarray] - n_classes_: Optional[int] - estimator_weights_: Optional[ndarray] - estimator_errors_: Optional[ndarray] - base_estimator_: Optional[object] - estimator_params: Optional[tuple] - input_column_names: Optional[Any] - target_names_: Optional[Sequence[Any]] - training_indices_: Optional[Sequence[int]] - target_column_indices_: Optional[Sequence[int]] - target_columns_metadata_: Optional[List[OrderedDict]] - - - -class Hyperparams(hyperparams.Hyperparams): - base_estimator = hyperparams.Constant( - default=None, - description='The base estimator from which the boosted ensemble is built. Support for sample weighting is required, as well as proper `classes_` and `n_classes_` attributes.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - n_estimators = hyperparams.Bounded[int]( - lower=1, - upper=None, - default=50, - description='The maximum number of estimators at which boosting is terminated. In case of perfect fit, the learning procedure is stopped early.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - learning_rate = hyperparams.Uniform( - lower=0.01, - upper=2, - default=0.1, - description='Learning rate shrinks the contribution of each classifier by ``learning_rate``. There is a trade-off between ``learning_rate`` and ``n_estimators``.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - algorithm = hyperparams.Enumeration[str]( - values=['SAMME.R', 'SAMME'], - default='SAMME.R', - description='If \'SAMME.R\' then use the SAMME.R real boosting algorithm. ``base_estimator`` must support calculation of class probabilities. If \'SAMME\' then use the SAMME discrete boosting algorithm. The SAMME.R algorithm typically converges faster than SAMME, achieving a lower test error with fewer boosting iterations.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - - use_inputs_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to force primitive to use as training input. If any specified column cannot be parsed, it is skipped.", - ) - use_outputs_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to force primitive to use as training target. If any specified column cannot be parsed, it is skipped.", - ) - exclude_inputs_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to not use as training inputs. Applicable only if \"use_columns\" is not provided.", - ) - exclude_outputs_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to not use as training target. Applicable only if \"use_columns\" is not provided.", - ) - return_result = hyperparams.Enumeration( - values=['append', 'replace', 'new'], - default='new', - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Should parsed columns be appended, should they replace original columns, or should only parsed columns be returned? This hyperparam is ignored if use_semantic_types is set to false.", - ) - use_semantic_types = hyperparams.UniformBool( - default=False, - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Controls whether semantic_types metadata will be used for filtering columns in input dataframe. Setting this to false makes the code ignore return_result and will produce only the output dataframe" - ) - add_index_columns = hyperparams.UniformBool( - default=False, - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Also include primary index columns if input data has them. Applicable only if \"return_result\" is set to \"new\".", - ) - error_on_no_input = hyperparams.UniformBool( - default=True, - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Throw an exception if no input column is selected/provided. Defaults to true to behave like sklearn. To prevent pipelines from breaking set this to False.", - ) - - return_semantic_type = hyperparams.Enumeration[str]( - values=['https://metadata.datadrivendiscovery.org/types/Attribute', 'https://metadata.datadrivendiscovery.org/types/ConstructedAttribute', 'https://metadata.datadrivendiscovery.org/types/PredictedTarget'], - default='https://metadata.datadrivendiscovery.org/types/PredictedTarget', - description='Decides what semantic type to attach to generated output', - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'] - ) - -class SKAdaBoostClassifier(SupervisedLearnerPrimitiveBase[Inputs, Outputs, Params, Hyperparams], - ProbabilisticCompositionalityMixin[Inputs, Outputs, Params, Hyperparams]): - """ - Primitive wrapping for sklearn AdaBoostClassifier - `sklearn documentation `_ - - """ - - __author__ = "JPL MARVIN" - metadata = metadata_base.PrimitiveMetadata({ - "algorithm_types": [metadata_base.PrimitiveAlgorithmType.ADABOOST, ], - "name": "sklearn.ensemble.weight_boosting.AdaBoostClassifier", - "primitive_family": metadata_base.PrimitiveFamily.CLASSIFICATION, - "python_path": "d3m.primitives.classification.ada_boost.SKlearn", - "source": {'name': 'JPL', 'contact': 'mailto:shah@jpl.nasa.gov', 'uris': ['https://gitlab.com/datadrivendiscovery/sklearn-wrap/issues', 'https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostClassifier.html']}, - "version": "2019.11.13", - "id": "4210a6a6-14ab-4490-a7dc-460763e70e55", - "hyperparams_to_tune": ['learning_rate', 'n_estimators'], - 'installation': [ - {'type': metadata_base.PrimitiveInstallationType.PIP, - 'package_uri': 'git+https://gitlab.com/datadrivendiscovery/sklearn-wrap.git@{git_commit}#egg=sklearn_wrap'.format( - git_commit=utils.current_git_commit(os.path.dirname(__file__)), - ), - }] - }) - - def __init__(self, *, - hyperparams: Hyperparams, - random_seed: int = 0, - docker_containers: Dict[str, DockerContainer] = None) -> None: - - super().__init__(hyperparams=hyperparams, random_seed=random_seed, docker_containers=docker_containers) - - # False - self._clf = AdaBoostClassifier( - base_estimator=self.hyperparams['base_estimator'], - n_estimators=self.hyperparams['n_estimators'], - learning_rate=self.hyperparams['learning_rate'], - algorithm=self.hyperparams['algorithm'], - random_state=self.random_seed, - ) - - self._inputs = None - self._outputs = None - self._training_inputs = None - self._training_outputs = None - self._target_names = None - self._training_indices = None - self._target_column_indices = None - self._target_columns_metadata: List[OrderedDict] = None - self._input_column_names = None - self._fitted = False - self._new_training_data = False - - def set_training_data(self, *, inputs: Inputs, outputs: Outputs) -> None: - self._inputs = inputs - self._outputs = outputs - self._fitted = False - self._new_training_data = True - - def fit(self, *, timeout: float = None, iterations: int = None) -> CallResult[None]: - if self._inputs is None or self._outputs is None: - raise ValueError("Missing training data.") - - if not self._new_training_data: - return CallResult(None) - self._new_training_data = False - - self._training_inputs, self._training_indices = self._get_columns_to_fit(self._inputs, self.hyperparams) - self._training_outputs, self._target_names, self._target_column_indices = self._get_targets(self._outputs, self.hyperparams) - self._input_column_names = self._training_inputs.columns - - if len(self._training_indices) > 0 and len(self._target_column_indices) > 0: - self._target_columns_metadata = self._get_target_columns_metadata(self._training_outputs.metadata, self.hyperparams) - sk_training_output = self._training_outputs.values - - shape = sk_training_output.shape - if len(shape) == 2 and shape[1] == 1: - sk_training_output = numpy.ravel(sk_training_output) - - self._clf.fit(self._training_inputs, sk_training_output) - self._fitted = True - else: - if self.hyperparams['error_on_no_input']: - raise RuntimeError("No input columns were selected") - self.logger.warn("No input columns were selected") - - return CallResult(None) - - - - def produce(self, *, inputs: Inputs, timeout: float = None, iterations: int = None) -> CallResult[Outputs]: - sk_inputs, columns_to_use = self._get_columns_to_fit(inputs, self.hyperparams) - output = [] - if len(sk_inputs.columns): - try: - sk_output = self._clf.predict(sk_inputs) - except sklearn.exceptions.NotFittedError as error: - raise PrimitiveNotFittedError("Primitive not fitted.") from error - # For primitives that allow predicting without fitting like GaussianProcessRegressor - if not self._fitted: - raise PrimitiveNotFittedError("Primitive not fitted.") - if sparse.issparse(sk_output): - sk_output = sk_output.toarray() - output = self._wrap_predictions(inputs, sk_output) - output.columns = self._target_names - output = [output] - else: - if self.hyperparams['error_on_no_input']: - raise RuntimeError("No input columns were selected") - self.logger.warn("No input columns were selected") - outputs = base_utils.combine_columns(return_result=self.hyperparams['return_result'], - add_index_columns=self.hyperparams['add_index_columns'], - inputs=inputs, column_indices=self._target_column_indices, - columns_list=output) - - return CallResult(outputs) - - - def get_params(self) -> Params: - if not self._fitted: - return Params( - estimators_=None, - classes_=None, - n_classes_=None, - estimator_weights_=None, - estimator_errors_=None, - base_estimator_=None, - estimator_params=None, - input_column_names=self._input_column_names, - training_indices_=self._training_indices, - target_names_=self._target_names, - target_column_indices_=self._target_column_indices, - target_columns_metadata_=self._target_columns_metadata - ) - - return Params( - estimators_=getattr(self._clf, 'estimators_', None), - classes_=getattr(self._clf, 'classes_', None), - n_classes_=getattr(self._clf, 'n_classes_', None), - estimator_weights_=getattr(self._clf, 'estimator_weights_', None), - estimator_errors_=getattr(self._clf, 'estimator_errors_', None), - base_estimator_=getattr(self._clf, 'base_estimator_', None), - estimator_params=getattr(self._clf, 'estimator_params', None), - input_column_names=self._input_column_names, - training_indices_=self._training_indices, - target_names_=self._target_names, - target_column_indices_=self._target_column_indices, - target_columns_metadata_=self._target_columns_metadata - ) - - def set_params(self, *, params: Params) -> None: - self._clf.estimators_ = params['estimators_'] - self._clf.classes_ = params['classes_'] - self._clf.n_classes_ = params['n_classes_'] - self._clf.estimator_weights_ = params['estimator_weights_'] - self._clf.estimator_errors_ = params['estimator_errors_'] - self._clf.base_estimator_ = params['base_estimator_'] - self._clf.estimator_params = params['estimator_params'] - self._input_column_names = params['input_column_names'] - self._training_indices = params['training_indices_'] - self._target_names = params['target_names_'] - self._target_column_indices = params['target_column_indices_'] - self._target_columns_metadata = params['target_columns_metadata_'] - - if params['estimators_'] is not None: - self._fitted = True - if params['classes_'] is not None: - self._fitted = True - if params['n_classes_'] is not None: - self._fitted = True - if params['estimator_weights_'] is not None: - self._fitted = True - if params['estimator_errors_'] is not None: - self._fitted = True - if params['base_estimator_'] is not None: - self._fitted = True - if params['estimator_params'] is not None: - self._fitted = True - - - def log_likelihoods(self, *, - outputs: Outputs, - inputs: Inputs, - timeout: float = None, - iterations: int = None) -> CallResult[Sequence[float]]: - inputs = inputs.iloc[:, self._training_indices] # Get ndarray - outputs = outputs.iloc[:, self._target_column_indices] - - if len(inputs.columns) and len(outputs.columns): - - if outputs.shape[1] != self._clf.n_outputs_: - raise exceptions.InvalidArgumentValueError("\"outputs\" argument does not have the correct number of target columns.") - - log_proba = self._clf.predict_log_proba(inputs) - - # Making it always a list, even when only one target. - if self._clf.n_outputs_ == 1: - log_proba = [log_proba] - classes = [self._clf.classes_] - else: - classes = self._clf.classes_ - - samples_length = inputs.shape[0] - - log_likelihoods = [] - for k in range(self._clf.n_outputs_): - # We have to map each class to its internal (numerical) index used in the learner. - # This allows "outputs" to contain string classes. - outputs_column = outputs.iloc[:, k] - classes_map = pandas.Series(numpy.arange(len(classes[k])), index=classes[k]) - mapped_outputs_column = outputs_column.map(classes_map) - - # For each target column (column in "outputs"), for each sample (row) we pick the log - # likelihood for a given class. - log_likelihoods.append(log_proba[k][numpy.arange(samples_length), mapped_outputs_column]) - - results = d3m_dataframe(dict(enumerate(log_likelihoods)), generate_metadata=True) - results.columns = outputs.columns - - for k in range(self._clf.n_outputs_): - column_metadata = outputs.metadata.query_column(k) - if 'name' in column_metadata: - results.metadata = results.metadata.update_column(k, {'name': column_metadata['name']}) - - else: - results = d3m_dataframe(generate_metadata=True) - - return CallResult(results) - - - - def produce_feature_importances(self, *, timeout: float = None, iterations: int = None) -> CallResult[d3m_dataframe]: - output = d3m_dataframe(self._clf.feature_importances_.reshape((1, len(self._input_column_names)))) - output.columns = self._input_column_names - for i in range(len(self._input_column_names)): - output.metadata = output.metadata.update_column(i, {"name": self._input_column_names[i]}) - return CallResult(output) - - @classmethod - def _get_columns_to_fit(cls, inputs: Inputs, hyperparams: Hyperparams): - if not hyperparams['use_semantic_types']: - return inputs, list(range(len(inputs.columns))) - - inputs_metadata = inputs.metadata - - def can_produce_column(column_index: int) -> bool: - return cls._can_produce_column(inputs_metadata, column_index, hyperparams) - - columns_to_produce, columns_not_to_produce = base_utils.get_columns_to_use(inputs_metadata, - use_columns=hyperparams['use_inputs_columns'], - exclude_columns=hyperparams['exclude_inputs_columns'], - can_use_column=can_produce_column) - return inputs.iloc[:, columns_to_produce], columns_to_produce - # return columns_to_produce - - @classmethod - def _can_produce_column(cls, inputs_metadata: metadata_base.DataMetadata, column_index: int, hyperparams: Hyperparams) -> bool: - column_metadata = inputs_metadata.query((metadata_base.ALL_ELEMENTS, column_index)) - - accepted_structural_types = (int, float, numpy.integer, numpy.float64) - accepted_semantic_types = set() - accepted_semantic_types.add("https://metadata.datadrivendiscovery.org/types/Attribute") - if not issubclass(column_metadata['structural_type'], accepted_structural_types): - return False - - semantic_types = set(column_metadata.get('semantic_types', [])) - - if len(semantic_types) == 0: - cls.logger.warning("No semantic types found in column metadata") - return False - # Making sure all accepted_semantic_types are available in semantic_types - if len(accepted_semantic_types - semantic_types) == 0: - return True - - return False - - @classmethod - def _get_targets(cls, data: d3m_dataframe, hyperparams: Hyperparams): - if not hyperparams['use_semantic_types']: - return data, list(data.columns), list(range(len(data.columns))) - - metadata = data.metadata - - def can_produce_column(column_index: int) -> bool: - accepted_semantic_types = set() - accepted_semantic_types.add("https://metadata.datadrivendiscovery.org/types/TrueTarget") - column_metadata = metadata.query((metadata_base.ALL_ELEMENTS, column_index)) - semantic_types = set(column_metadata.get('semantic_types', [])) - if len(semantic_types) == 0: - cls.logger.warning("No semantic types found in column metadata") - return False - # Making sure all accepted_semantic_types are available in semantic_types - if len(accepted_semantic_types - semantic_types) == 0: - return True - return False - - target_column_indices, target_columns_not_to_produce = base_utils.get_columns_to_use(metadata, - use_columns=hyperparams[ - 'use_outputs_columns'], - exclude_columns= - hyperparams[ - 'exclude_outputs_columns'], - can_use_column=can_produce_column) - targets = [] - if target_column_indices: - targets = data.select_columns(target_column_indices) - target_column_names = [] - for idx in target_column_indices: - target_column_names.append(data.columns[idx]) - return targets, target_column_names, target_column_indices - - @classmethod - def _get_target_columns_metadata(cls, outputs_metadata: metadata_base.DataMetadata, hyperparams) -> List[OrderedDict]: - outputs_length = outputs_metadata.query((metadata_base.ALL_ELEMENTS,))['dimension']['length'] - - target_columns_metadata: List[OrderedDict] = [] - for column_index in range(outputs_length): - column_metadata = OrderedDict(outputs_metadata.query_column(column_index)) - - # Update semantic types and prepare it for predicted targets. - semantic_types = set(column_metadata.get('semantic_types', [])) - semantic_types_to_remove = set(["https://metadata.datadrivendiscovery.org/types/TrueTarget","https://metadata.datadrivendiscovery.org/types/SuggestedTarget",]) - add_semantic_types = set(["https://metadata.datadrivendiscovery.org/types/PredictedTarget",]) - add_semantic_types.add(hyperparams["return_semantic_type"]) - semantic_types = semantic_types - semantic_types_to_remove - semantic_types = semantic_types.union(add_semantic_types) - column_metadata['semantic_types'] = list(semantic_types) - - target_columns_metadata.append(column_metadata) - - return target_columns_metadata - - @classmethod - def _update_predictions_metadata(cls, inputs_metadata: metadata_base.DataMetadata, outputs: Optional[Outputs], - target_columns_metadata: List[OrderedDict]) -> metadata_base.DataMetadata: - outputs_metadata = metadata_base.DataMetadata().generate(value=outputs) - - for column_index, column_metadata in enumerate(target_columns_metadata): - column_metadata.pop("structural_type", None) - outputs_metadata = outputs_metadata.update_column(column_index, column_metadata) - - return outputs_metadata - - def _wrap_predictions(self, inputs: Inputs, predictions: ndarray) -> Outputs: - outputs = d3m_dataframe(predictions, generate_metadata=False) - outputs.metadata = self._update_predictions_metadata(inputs.metadata, outputs, self._target_columns_metadata) - return outputs - - - @classmethod - def _add_target_columns_metadata(cls, outputs_metadata: metadata_base.DataMetadata): - outputs_length = outputs_metadata.query((metadata_base.ALL_ELEMENTS,))['dimension']['length'] - - target_columns_metadata: List[OrderedDict] = [] - for column_index in range(outputs_length): - column_metadata = OrderedDict() - semantic_types = [] - semantic_types.append('https://metadata.datadrivendiscovery.org/types/PredictedTarget') - column_name = outputs_metadata.query((metadata_base.ALL_ELEMENTS, column_index)).get("name") - if column_name is None: - column_name = "output_{}".format(column_index) - column_metadata["semantic_types"] = semantic_types - column_metadata["name"] = str(column_name) - target_columns_metadata.append(column_metadata) - - return target_columns_metadata - - -SKAdaBoostClassifier.__doc__ = AdaBoostClassifier.__doc__ \ No newline at end of file diff --git a/common-primitives/sklearn-wrap/sklearn_wrap/SKAdaBoostRegressor.py b/common-primitives/sklearn-wrap/sklearn_wrap/SKAdaBoostRegressor.py deleted file mode 100644 index bf06e54..0000000 --- a/common-primitives/sklearn-wrap/sklearn_wrap/SKAdaBoostRegressor.py +++ /dev/null @@ -1,437 +0,0 @@ -from typing import Any, Callable, List, Dict, Union, Optional, Sequence, Tuple -from numpy import ndarray -from collections import OrderedDict -from scipy import sparse -import os -import sklearn -import numpy -import typing - -# Custom import commands if any -from sklearn.ensemble.weight_boosting import AdaBoostRegressor - - -from d3m.container.numpy import ndarray as d3m_ndarray -from d3m.container import DataFrame as d3m_dataframe -from d3m.metadata import hyperparams, params, base as metadata_base -from d3m import utils -from d3m.base import utils as base_utils -from d3m.exceptions import PrimitiveNotFittedError -from d3m.primitive_interfaces.base import CallResult, DockerContainer - -from d3m.primitive_interfaces.supervised_learning import SupervisedLearnerPrimitiveBase -from d3m.primitive_interfaces.base import ProbabilisticCompositionalityMixin, ContinueFitMixin -from d3m import exceptions -import pandas - - - -Inputs = d3m_dataframe -Outputs = d3m_dataframe - - -class Params(params.Params): - estimators_: Optional[List[sklearn.tree.DecisionTreeRegressor]] - estimator_weights_: Optional[ndarray] - estimator_errors_: Optional[ndarray] - estimator_params: Optional[tuple] - base_estimator_: Optional[object] - input_column_names: Optional[Any] - target_names_: Optional[Sequence[Any]] - training_indices_: Optional[Sequence[int]] - target_column_indices_: Optional[Sequence[int]] - target_columns_metadata_: Optional[List[OrderedDict]] - - - -class Hyperparams(hyperparams.Hyperparams): - base_estimator = hyperparams.Constant( - default=None, - description='The base estimator from which the boosted ensemble is built. Support for sample weighting is required.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - n_estimators = hyperparams.Bounded[int]( - lower=1, - upper=None, - default=50, - description='The maximum number of estimators at which boosting is terminated. In case of perfect fit, the learning procedure is stopped early.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - learning_rate = hyperparams.Uniform( - lower=0.01, - upper=2, - default=0.1, - description='Learning rate shrinks the contribution of each regressor by ``learning_rate``. There is a trade-off between ``learning_rate`` and ``n_estimators``.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - loss = hyperparams.Enumeration[str]( - values=['linear', 'square', 'exponential'], - default='linear', - description='The loss function to use when updating the weights after each boosting iteration.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - - use_inputs_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to force primitive to use as training input. If any specified column cannot be parsed, it is skipped.", - ) - use_outputs_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to force primitive to use as training target. If any specified column cannot be parsed, it is skipped.", - ) - exclude_inputs_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to not use as training inputs. Applicable only if \"use_columns\" is not provided.", - ) - exclude_outputs_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to not use as training target. Applicable only if \"use_columns\" is not provided.", - ) - return_result = hyperparams.Enumeration( - values=['append', 'replace', 'new'], - default='new', - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Should parsed columns be appended, should they replace original columns, or should only parsed columns be returned? This hyperparam is ignored if use_semantic_types is set to false.", - ) - use_semantic_types = hyperparams.UniformBool( - default=False, - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Controls whether semantic_types metadata will be used for filtering columns in input dataframe. Setting this to false makes the code ignore return_result and will produce only the output dataframe" - ) - add_index_columns = hyperparams.UniformBool( - default=False, - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Also include primary index columns if input data has them. Applicable only if \"return_result\" is set to \"new\".", - ) - error_on_no_input = hyperparams.UniformBool( - default=True, - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Throw an exception if no input column is selected/provided. Defaults to true to behave like sklearn. To prevent pipelines from breaking set this to False.", - ) - - return_semantic_type = hyperparams.Enumeration[str]( - values=['https://metadata.datadrivendiscovery.org/types/Attribute', 'https://metadata.datadrivendiscovery.org/types/ConstructedAttribute', 'https://metadata.datadrivendiscovery.org/types/PredictedTarget'], - default='https://metadata.datadrivendiscovery.org/types/PredictedTarget', - description='Decides what semantic type to attach to generated output', - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'] - ) - -class SKAdaBoostRegressor(SupervisedLearnerPrimitiveBase[Inputs, Outputs, Params, Hyperparams]): - """ - Primitive wrapping for sklearn AdaBoostRegressor - `sklearn documentation `_ - - """ - - __author__ = "JPL MARVIN" - metadata = metadata_base.PrimitiveMetadata({ - "algorithm_types": [metadata_base.PrimitiveAlgorithmType.ADABOOST, ], - "name": "sklearn.ensemble.weight_boosting.AdaBoostRegressor", - "primitive_family": metadata_base.PrimitiveFamily.REGRESSION, - "python_path": "d3m.primitives.regression.ada_boost.SKlearn", - "source": {'name': 'JPL', 'contact': 'mailto:shah@jpl.nasa.gov', 'uris': ['https://gitlab.com/datadrivendiscovery/sklearn-wrap/issues', 'https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostRegressor.html']}, - "version": "2019.11.13", - "id": "6cab1537-02e1-4dc4-9ebb-53fa2cbabedd", - "hyperparams_to_tune": ['learning_rate', 'n_estimators'], - 'installation': [ - {'type': metadata_base.PrimitiveInstallationType.PIP, - 'package_uri': 'git+https://gitlab.com/datadrivendiscovery/sklearn-wrap.git@{git_commit}#egg=sklearn_wrap'.format( - git_commit=utils.current_git_commit(os.path.dirname(__file__)), - ), - }] - }) - - def __init__(self, *, - hyperparams: Hyperparams, - random_seed: int = 0, - docker_containers: Dict[str, DockerContainer] = None) -> None: - - super().__init__(hyperparams=hyperparams, random_seed=random_seed, docker_containers=docker_containers) - - # False - self._clf = AdaBoostRegressor( - base_estimator=self.hyperparams['base_estimator'], - n_estimators=self.hyperparams['n_estimators'], - learning_rate=self.hyperparams['learning_rate'], - loss=self.hyperparams['loss'], - random_state=self.random_seed, - ) - - self._inputs = None - self._outputs = None - self._training_inputs = None - self._training_outputs = None - self._target_names = None - self._training_indices = None - self._target_column_indices = None - self._target_columns_metadata: List[OrderedDict] = None - self._input_column_names = None - self._fitted = False - self._new_training_data = False - - def set_training_data(self, *, inputs: Inputs, outputs: Outputs) -> None: - self._inputs = inputs - self._outputs = outputs - self._fitted = False - self._new_training_data = True - - def fit(self, *, timeout: float = None, iterations: int = None) -> CallResult[None]: - if self._inputs is None or self._outputs is None: - raise ValueError("Missing training data.") - - if not self._new_training_data: - return CallResult(None) - self._new_training_data = False - - self._training_inputs, self._training_indices = self._get_columns_to_fit(self._inputs, self.hyperparams) - self._training_outputs, self._target_names, self._target_column_indices = self._get_targets(self._outputs, self.hyperparams) - self._input_column_names = self._training_inputs.columns - - if len(self._training_indices) > 0 and len(self._target_column_indices) > 0: - self._target_columns_metadata = self._get_target_columns_metadata(self._training_outputs.metadata, self.hyperparams) - sk_training_output = self._training_outputs.values - - shape = sk_training_output.shape - if len(shape) == 2 and shape[1] == 1: - sk_training_output = numpy.ravel(sk_training_output) - - self._clf.fit(self._training_inputs, sk_training_output) - self._fitted = True - else: - if self.hyperparams['error_on_no_input']: - raise RuntimeError("No input columns were selected") - self.logger.warn("No input columns were selected") - - return CallResult(None) - - - - def produce(self, *, inputs: Inputs, timeout: float = None, iterations: int = None) -> CallResult[Outputs]: - sk_inputs, columns_to_use = self._get_columns_to_fit(inputs, self.hyperparams) - output = [] - if len(sk_inputs.columns): - try: - sk_output = self._clf.predict(sk_inputs) - except sklearn.exceptions.NotFittedError as error: - raise PrimitiveNotFittedError("Primitive not fitted.") from error - # For primitives that allow predicting without fitting like GaussianProcessRegressor - if not self._fitted: - raise PrimitiveNotFittedError("Primitive not fitted.") - if sparse.issparse(sk_output): - sk_output = sk_output.toarray() - output = self._wrap_predictions(inputs, sk_output) - output.columns = self._target_names - output = [output] - else: - if self.hyperparams['error_on_no_input']: - raise RuntimeError("No input columns were selected") - self.logger.warn("No input columns were selected") - outputs = base_utils.combine_columns(return_result=self.hyperparams['return_result'], - add_index_columns=self.hyperparams['add_index_columns'], - inputs=inputs, column_indices=self._target_column_indices, - columns_list=output) - - return CallResult(outputs) - - - def get_params(self) -> Params: - if not self._fitted: - return Params( - estimators_=None, - estimator_weights_=None, - estimator_errors_=None, - estimator_params=None, - base_estimator_=None, - input_column_names=self._input_column_names, - training_indices_=self._training_indices, - target_names_=self._target_names, - target_column_indices_=self._target_column_indices, - target_columns_metadata_=self._target_columns_metadata - ) - - return Params( - estimators_=getattr(self._clf, 'estimators_', None), - estimator_weights_=getattr(self._clf, 'estimator_weights_', None), - estimator_errors_=getattr(self._clf, 'estimator_errors_', None), - estimator_params=getattr(self._clf, 'estimator_params', None), - base_estimator_=getattr(self._clf, 'base_estimator_', None), - input_column_names=self._input_column_names, - training_indices_=self._training_indices, - target_names_=self._target_names, - target_column_indices_=self._target_column_indices, - target_columns_metadata_=self._target_columns_metadata - ) - - def set_params(self, *, params: Params) -> None: - self._clf.estimators_ = params['estimators_'] - self._clf.estimator_weights_ = params['estimator_weights_'] - self._clf.estimator_errors_ = params['estimator_errors_'] - self._clf.estimator_params = params['estimator_params'] - self._clf.base_estimator_ = params['base_estimator_'] - self._input_column_names = params['input_column_names'] - self._training_indices = params['training_indices_'] - self._target_names = params['target_names_'] - self._target_column_indices = params['target_column_indices_'] - self._target_columns_metadata = params['target_columns_metadata_'] - - if params['estimators_'] is not None: - self._fitted = True - if params['estimator_weights_'] is not None: - self._fitted = True - if params['estimator_errors_'] is not None: - self._fitted = True - if params['estimator_params'] is not None: - self._fitted = True - if params['base_estimator_'] is not None: - self._fitted = True - - - - - - def produce_feature_importances(self, *, timeout: float = None, iterations: int = None) -> CallResult[d3m_dataframe]: - output = d3m_dataframe(self._clf.feature_importances_.reshape((1, len(self._input_column_names)))) - output.columns = self._input_column_names - for i in range(len(self._input_column_names)): - output.metadata = output.metadata.update_column(i, {"name": self._input_column_names[i]}) - return CallResult(output) - - @classmethod - def _get_columns_to_fit(cls, inputs: Inputs, hyperparams: Hyperparams): - if not hyperparams['use_semantic_types']: - return inputs, list(range(len(inputs.columns))) - - inputs_metadata = inputs.metadata - - def can_produce_column(column_index: int) -> bool: - return cls._can_produce_column(inputs_metadata, column_index, hyperparams) - - columns_to_produce, columns_not_to_produce = base_utils.get_columns_to_use(inputs_metadata, - use_columns=hyperparams['use_inputs_columns'], - exclude_columns=hyperparams['exclude_inputs_columns'], - can_use_column=can_produce_column) - return inputs.iloc[:, columns_to_produce], columns_to_produce - # return columns_to_produce - - @classmethod - def _can_produce_column(cls, inputs_metadata: metadata_base.DataMetadata, column_index: int, hyperparams: Hyperparams) -> bool: - column_metadata = inputs_metadata.query((metadata_base.ALL_ELEMENTS, column_index)) - - accepted_structural_types = (int, float, numpy.integer, numpy.float64) - accepted_semantic_types = set() - accepted_semantic_types.add("https://metadata.datadrivendiscovery.org/types/Attribute") - if not issubclass(column_metadata['structural_type'], accepted_structural_types): - return False - - semantic_types = set(column_metadata.get('semantic_types', [])) - - if len(semantic_types) == 0: - cls.logger.warning("No semantic types found in column metadata") - return False - # Making sure all accepted_semantic_types are available in semantic_types - if len(accepted_semantic_types - semantic_types) == 0: - return True - - return False - - @classmethod - def _get_targets(cls, data: d3m_dataframe, hyperparams: Hyperparams): - if not hyperparams['use_semantic_types']: - return data, list(data.columns), list(range(len(data.columns))) - - metadata = data.metadata - - def can_produce_column(column_index: int) -> bool: - accepted_semantic_types = set() - accepted_semantic_types.add("https://metadata.datadrivendiscovery.org/types/TrueTarget") - column_metadata = metadata.query((metadata_base.ALL_ELEMENTS, column_index)) - semantic_types = set(column_metadata.get('semantic_types', [])) - if len(semantic_types) == 0: - cls.logger.warning("No semantic types found in column metadata") - return False - # Making sure all accepted_semantic_types are available in semantic_types - if len(accepted_semantic_types - semantic_types) == 0: - return True - return False - - target_column_indices, target_columns_not_to_produce = base_utils.get_columns_to_use(metadata, - use_columns=hyperparams[ - 'use_outputs_columns'], - exclude_columns= - hyperparams[ - 'exclude_outputs_columns'], - can_use_column=can_produce_column) - targets = [] - if target_column_indices: - targets = data.select_columns(target_column_indices) - target_column_names = [] - for idx in target_column_indices: - target_column_names.append(data.columns[idx]) - return targets, target_column_names, target_column_indices - - @classmethod - def _get_target_columns_metadata(cls, outputs_metadata: metadata_base.DataMetadata, hyperparams) -> List[OrderedDict]: - outputs_length = outputs_metadata.query((metadata_base.ALL_ELEMENTS,))['dimension']['length'] - - target_columns_metadata: List[OrderedDict] = [] - for column_index in range(outputs_length): - column_metadata = OrderedDict(outputs_metadata.query_column(column_index)) - - # Update semantic types and prepare it for predicted targets. - semantic_types = set(column_metadata.get('semantic_types', [])) - semantic_types_to_remove = set(["https://metadata.datadrivendiscovery.org/types/TrueTarget","https://metadata.datadrivendiscovery.org/types/SuggestedTarget",]) - add_semantic_types = set(["https://metadata.datadrivendiscovery.org/types/PredictedTarget",]) - add_semantic_types.add(hyperparams["return_semantic_type"]) - semantic_types = semantic_types - semantic_types_to_remove - semantic_types = semantic_types.union(add_semantic_types) - column_metadata['semantic_types'] = list(semantic_types) - - target_columns_metadata.append(column_metadata) - - return target_columns_metadata - - @classmethod - def _update_predictions_metadata(cls, inputs_metadata: metadata_base.DataMetadata, outputs: Optional[Outputs], - target_columns_metadata: List[OrderedDict]) -> metadata_base.DataMetadata: - outputs_metadata = metadata_base.DataMetadata().generate(value=outputs) - - for column_index, column_metadata in enumerate(target_columns_metadata): - column_metadata.pop("structural_type", None) - outputs_metadata = outputs_metadata.update_column(column_index, column_metadata) - - return outputs_metadata - - def _wrap_predictions(self, inputs: Inputs, predictions: ndarray) -> Outputs: - outputs = d3m_dataframe(predictions, generate_metadata=False) - outputs.metadata = self._update_predictions_metadata(inputs.metadata, outputs, self._target_columns_metadata) - return outputs - - - @classmethod - def _add_target_columns_metadata(cls, outputs_metadata: metadata_base.DataMetadata): - outputs_length = outputs_metadata.query((metadata_base.ALL_ELEMENTS,))['dimension']['length'] - - target_columns_metadata: List[OrderedDict] = [] - for column_index in range(outputs_length): - column_metadata = OrderedDict() - semantic_types = [] - semantic_types.append('https://metadata.datadrivendiscovery.org/types/PredictedTarget') - column_name = outputs_metadata.query((metadata_base.ALL_ELEMENTS, column_index)).get("name") - if column_name is None: - column_name = "output_{}".format(column_index) - column_metadata["semantic_types"] = semantic_types - column_metadata["name"] = str(column_name) - target_columns_metadata.append(column_metadata) - - return target_columns_metadata - - -SKAdaBoostRegressor.__doc__ = AdaBoostRegressor.__doc__ \ No newline at end of file diff --git a/common-primitives/sklearn-wrap/sklearn_wrap/SKBaggingClassifier.py b/common-primitives/sklearn-wrap/sklearn_wrap/SKBaggingClassifier.py deleted file mode 100644 index c875434..0000000 --- a/common-primitives/sklearn-wrap/sklearn_wrap/SKBaggingClassifier.py +++ /dev/null @@ -1,589 +0,0 @@ -from typing import Any, Callable, List, Dict, Union, Optional, Sequence, Tuple -from numpy import ndarray -from collections import OrderedDict -from scipy import sparse -import os -import sklearn -import numpy -import typing - -# Custom import commands if any -from sklearn.ensemble.bagging import BaggingClassifier - - -from d3m.container.numpy import ndarray as d3m_ndarray -from d3m.container import DataFrame as d3m_dataframe -from d3m.metadata import hyperparams, params, base as metadata_base -from d3m import utils -from d3m.base import utils as base_utils -from d3m.exceptions import PrimitiveNotFittedError -from d3m.primitive_interfaces.base import CallResult, DockerContainer - -from d3m.primitive_interfaces.supervised_learning import SupervisedLearnerPrimitiveBase -from d3m.primitive_interfaces.base import ProbabilisticCompositionalityMixin, ContinueFitMixin -from d3m import exceptions -import pandas - - - -Inputs = d3m_dataframe -Outputs = d3m_dataframe - - -class Params(params.Params): - base_estimator_: Optional[object] - estimators_: Optional[List[sklearn.tree.DecisionTreeClassifier]] - estimators_features_: Optional[List[ndarray]] - classes_: Optional[ndarray] - n_classes_: Optional[int] - oob_score_: Optional[float] - oob_decision_function_: Optional[List[ndarray]] - n_features_: Optional[int] - _max_features: Optional[int] - _max_samples: Optional[int] - _n_samples: Optional[int] - _seeds: Optional[ndarray] - estimator_params: Optional[tuple] - input_column_names: Optional[Any] - target_names_: Optional[Sequence[Any]] - training_indices_: Optional[Sequence[int]] - target_column_indices_: Optional[Sequence[int]] - target_columns_metadata_: Optional[List[OrderedDict]] - - - -class Hyperparams(hyperparams.Hyperparams): - n_estimators = hyperparams.Bounded[int]( - default=10, - lower=1, - upper=None, - description='The number of base estimators in the ensemble.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - max_samples = hyperparams.Union( - configuration=OrderedDict({ - 'absolute': hyperparams.Bounded[int]( - lower=0, - upper=None, - default=0, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ), - 'percent': hyperparams.Bounded[float]( - default=1.0, - lower=0, - upper=1, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ) - }), - default='percent', - description='The number of samples to draw from X to train each base estimator. - If int, then draw `max_samples` samples. - If float, then draw `max_samples * X.shape[0]` samples.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - max_features = hyperparams.Union( - configuration=OrderedDict({ - 'absolute': hyperparams.Bounded[int]( - lower=0, - upper=None, - default=0, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ), - 'percent': hyperparams.Bounded[float]( - default=1.0, - lower=0, - upper=1, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ) - }), - default='percent', - description='The number of features to draw from X to train each base estimator. - If int, then draw `max_features` features. - If float, then draw `max_features * X.shape[1]` features.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - bootstrap = hyperparams.Enumeration[str]( - values=['bootstrap', 'bootstrap_with_oob_score', 'disabled'], - default='bootstrap', - description='Whether bootstrap samples are used when building trees.' - ' And whether to use out-of-bag samples to estimate the generalization accuracy.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ) - bootstrap_features = hyperparams.UniformBool( - default=False, - description='Whether features are drawn with replacement.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - warm_start = hyperparams.UniformBool( - default=False, - description='When set to True, reuse the solution of the previous call to fit and add more estimators to the ensemble, otherwise, just fit a whole new ensemble. .. versionadded:: 0.17 *warm_start* constructor parameter.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - n_jobs = hyperparams.Union( - configuration=OrderedDict({ - 'limit': hyperparams.Bounded[int]( - default=1, - lower=1, - upper=None, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ), - 'all_cores': hyperparams.Constant( - default=-1, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ) - }), - default='limit', - description='The number of jobs to run in parallel for both `fit` and `predict`. If -1, then the number of jobs is set to the number of cores.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/ResourcesUseParameter'] - ) - - use_inputs_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to force primitive to use as training input. If any specified column cannot be parsed, it is skipped.", - ) - use_outputs_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to force primitive to use as training target. If any specified column cannot be parsed, it is skipped.", - ) - exclude_inputs_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to not use as training inputs. Applicable only if \"use_columns\" is not provided.", - ) - exclude_outputs_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to not use as training target. Applicable only if \"use_columns\" is not provided.", - ) - return_result = hyperparams.Enumeration( - values=['append', 'replace', 'new'], - default='new', - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Should parsed columns be appended, should they replace original columns, or should only parsed columns be returned? This hyperparam is ignored if use_semantic_types is set to false.", - ) - use_semantic_types = hyperparams.UniformBool( - default=False, - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Controls whether semantic_types metadata will be used for filtering columns in input dataframe. Setting this to false makes the code ignore return_result and will produce only the output dataframe" - ) - add_index_columns = hyperparams.UniformBool( - default=False, - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Also include primary index columns if input data has them. Applicable only if \"return_result\" is set to \"new\".", - ) - error_on_no_input = hyperparams.UniformBool( - default=True, - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Throw an exception if no input column is selected/provided. Defaults to true to behave like sklearn. To prevent pipelines from breaking set this to False.", - ) - - return_semantic_type = hyperparams.Enumeration[str]( - values=['https://metadata.datadrivendiscovery.org/types/Attribute', 'https://metadata.datadrivendiscovery.org/types/ConstructedAttribute', 'https://metadata.datadrivendiscovery.org/types/PredictedTarget'], - default='https://metadata.datadrivendiscovery.org/types/PredictedTarget', - description='Decides what semantic type to attach to generated output', - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'] - ) - -class SKBaggingClassifier(SupervisedLearnerPrimitiveBase[Inputs, Outputs, Params, Hyperparams], - ProbabilisticCompositionalityMixin[Inputs, Outputs, Params, Hyperparams]): - """ - Primitive wrapping for sklearn BaggingClassifier - `sklearn documentation `_ - - """ - - __author__ = "JPL MARVIN" - metadata = metadata_base.PrimitiveMetadata({ - "algorithm_types": [metadata_base.PrimitiveAlgorithmType.ENSEMBLE_LEARNING, ], - "name": "sklearn.ensemble.bagging.BaggingClassifier", - "primitive_family": metadata_base.PrimitiveFamily.CLASSIFICATION, - "python_path": "d3m.primitives.classification.bagging.SKlearn", - "source": {'name': 'JPL', 'contact': 'mailto:shah@jpl.nasa.gov', 'uris': ['https://gitlab.com/datadrivendiscovery/sklearn-wrap/issues', 'https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.BaggingClassifier.html']}, - "version": "2019.11.13", - "id": "1b2a32a6-0ec5-3ca0-9386-b8b1f1b831d1", - "hyperparams_to_tune": ['n_estimators', 'max_samples', 'max_features'], - 'installation': [ - {'type': metadata_base.PrimitiveInstallationType.PIP, - 'package_uri': 'git+https://gitlab.com/datadrivendiscovery/sklearn-wrap.git@{git_commit}#egg=sklearn_wrap'.format( - git_commit=utils.current_git_commit(os.path.dirname(__file__)), - ), - }] - }) - - def __init__(self, *, - hyperparams: Hyperparams, - random_seed: int = 0, - docker_containers: Dict[str, DockerContainer] = None, - _verbose: int = 0) -> None: - - super().__init__(hyperparams=hyperparams, random_seed=random_seed, docker_containers=docker_containers) - - # False - self._clf = BaggingClassifier( - n_estimators=self.hyperparams['n_estimators'], - max_samples=self.hyperparams['max_samples'], - max_features=self.hyperparams['max_features'], - bootstrap=self.hyperparams['bootstrap'] in ['bootstrap', 'bootstrap_with_oob_score'], - bootstrap_features=self.hyperparams['bootstrap_features'], - oob_score=self.hyperparams['bootstrap'] in ['bootstrap_with_oob_score'], - warm_start=self.hyperparams['warm_start'], - n_jobs=self.hyperparams['n_jobs'], - random_state=self.random_seed, - verbose=_verbose - ) - - self._inputs = None - self._outputs = None - self._training_inputs = None - self._training_outputs = None - self._target_names = None - self._training_indices = None - self._target_column_indices = None - self._target_columns_metadata: List[OrderedDict] = None - self._input_column_names = None - self._fitted = False - self._new_training_data = False - - def set_training_data(self, *, inputs: Inputs, outputs: Outputs) -> None: - self._inputs = inputs - self._outputs = outputs - self._fitted = False - self._new_training_data = True - - def fit(self, *, timeout: float = None, iterations: int = None) -> CallResult[None]: - if self._inputs is None or self._outputs is None: - raise ValueError("Missing training data.") - - if not self._new_training_data: - return CallResult(None) - self._new_training_data = False - - self._training_inputs, self._training_indices = self._get_columns_to_fit(self._inputs, self.hyperparams) - self._training_outputs, self._target_names, self._target_column_indices = self._get_targets(self._outputs, self.hyperparams) - self._input_column_names = self._training_inputs.columns - - if len(self._training_indices) > 0 and len(self._target_column_indices) > 0: - self._target_columns_metadata = self._get_target_columns_metadata(self._training_outputs.metadata, self.hyperparams) - sk_training_output = self._training_outputs.values - - shape = sk_training_output.shape - if len(shape) == 2 and shape[1] == 1: - sk_training_output = numpy.ravel(sk_training_output) - - self._clf.fit(self._training_inputs, sk_training_output) - self._fitted = True - else: - if self.hyperparams['error_on_no_input']: - raise RuntimeError("No input columns were selected") - self.logger.warn("No input columns were selected") - - return CallResult(None) - - - - def produce(self, *, inputs: Inputs, timeout: float = None, iterations: int = None) -> CallResult[Outputs]: - sk_inputs, columns_to_use = self._get_columns_to_fit(inputs, self.hyperparams) - output = [] - if len(sk_inputs.columns): - try: - sk_output = self._clf.predict(sk_inputs) - except sklearn.exceptions.NotFittedError as error: - raise PrimitiveNotFittedError("Primitive not fitted.") from error - # For primitives that allow predicting without fitting like GaussianProcessRegressor - if not self._fitted: - raise PrimitiveNotFittedError("Primitive not fitted.") - if sparse.issparse(sk_output): - sk_output = sk_output.toarray() - output = self._wrap_predictions(inputs, sk_output) - output.columns = self._target_names - output = [output] - else: - if self.hyperparams['error_on_no_input']: - raise RuntimeError("No input columns were selected") - self.logger.warn("No input columns were selected") - outputs = base_utils.combine_columns(return_result=self.hyperparams['return_result'], - add_index_columns=self.hyperparams['add_index_columns'], - inputs=inputs, column_indices=self._target_column_indices, - columns_list=output) - - return CallResult(outputs) - - - def get_params(self) -> Params: - if not self._fitted: - return Params( - base_estimator_=None, - estimators_=None, - estimators_features_=None, - classes_=None, - n_classes_=None, - oob_score_=None, - oob_decision_function_=None, - n_features_=None, - _max_features=None, - _max_samples=None, - _n_samples=None, - _seeds=None, - estimator_params=None, - input_column_names=self._input_column_names, - training_indices_=self._training_indices, - target_names_=self._target_names, - target_column_indices_=self._target_column_indices, - target_columns_metadata_=self._target_columns_metadata - ) - - return Params( - base_estimator_=getattr(self._clf, 'base_estimator_', None), - estimators_=getattr(self._clf, 'estimators_', None), - estimators_features_=getattr(self._clf, 'estimators_features_', None), - classes_=getattr(self._clf, 'classes_', None), - n_classes_=getattr(self._clf, 'n_classes_', None), - oob_score_=getattr(self._clf, 'oob_score_', None), - oob_decision_function_=getattr(self._clf, 'oob_decision_function_', None), - n_features_=getattr(self._clf, 'n_features_', None), - _max_features=getattr(self._clf, '_max_features', None), - _max_samples=getattr(self._clf, '_max_samples', None), - _n_samples=getattr(self._clf, '_n_samples', None), - _seeds=getattr(self._clf, '_seeds', None), - estimator_params=getattr(self._clf, 'estimator_params', None), - input_column_names=self._input_column_names, - training_indices_=self._training_indices, - target_names_=self._target_names, - target_column_indices_=self._target_column_indices, - target_columns_metadata_=self._target_columns_metadata - ) - - def set_params(self, *, params: Params) -> None: - self._clf.base_estimator_ = params['base_estimator_'] - self._clf.estimators_ = params['estimators_'] - self._clf.estimators_features_ = params['estimators_features_'] - self._clf.classes_ = params['classes_'] - self._clf.n_classes_ = params['n_classes_'] - self._clf.oob_score_ = params['oob_score_'] - self._clf.oob_decision_function_ = params['oob_decision_function_'] - self._clf.n_features_ = params['n_features_'] - self._clf._max_features = params['_max_features'] - self._clf._max_samples = params['_max_samples'] - self._clf._n_samples = params['_n_samples'] - self._clf._seeds = params['_seeds'] - self._clf.estimator_params = params['estimator_params'] - self._input_column_names = params['input_column_names'] - self._training_indices = params['training_indices_'] - self._target_names = params['target_names_'] - self._target_column_indices = params['target_column_indices_'] - self._target_columns_metadata = params['target_columns_metadata_'] - - if params['base_estimator_'] is not None: - self._fitted = True - if params['estimators_'] is not None: - self._fitted = True - if params['estimators_features_'] is not None: - self._fitted = True - if params['classes_'] is not None: - self._fitted = True - if params['n_classes_'] is not None: - self._fitted = True - if params['oob_score_'] is not None: - self._fitted = True - if params['oob_decision_function_'] is not None: - self._fitted = True - if params['n_features_'] is not None: - self._fitted = True - if params['_max_features'] is not None: - self._fitted = True - if params['_max_samples'] is not None: - self._fitted = True - if params['_n_samples'] is not None: - self._fitted = True - if params['_seeds'] is not None: - self._fitted = True - if params['estimator_params'] is not None: - self._fitted = True - - - def log_likelihoods(self, *, - outputs: Outputs, - inputs: Inputs, - timeout: float = None, - iterations: int = None) -> CallResult[Sequence[float]]: - inputs = inputs.iloc[:, self._training_indices] # Get ndarray - outputs = outputs.iloc[:, self._target_column_indices] - - if len(inputs.columns) and len(outputs.columns): - - if outputs.shape[1] != self._clf.n_outputs_: - raise exceptions.InvalidArgumentValueError("\"outputs\" argument does not have the correct number of target columns.") - - log_proba = self._clf.predict_log_proba(inputs) - - # Making it always a list, even when only one target. - if self._clf.n_outputs_ == 1: - log_proba = [log_proba] - classes = [self._clf.classes_] - else: - classes = self._clf.classes_ - - samples_length = inputs.shape[0] - - log_likelihoods = [] - for k in range(self._clf.n_outputs_): - # We have to map each class to its internal (numerical) index used in the learner. - # This allows "outputs" to contain string classes. - outputs_column = outputs.iloc[:, k] - classes_map = pandas.Series(numpy.arange(len(classes[k])), index=classes[k]) - mapped_outputs_column = outputs_column.map(classes_map) - - # For each target column (column in "outputs"), for each sample (row) we pick the log - # likelihood for a given class. - log_likelihoods.append(log_proba[k][numpy.arange(samples_length), mapped_outputs_column]) - - results = d3m_dataframe(dict(enumerate(log_likelihoods)), generate_metadata=True) - results.columns = outputs.columns - - for k in range(self._clf.n_outputs_): - column_metadata = outputs.metadata.query_column(k) - if 'name' in column_metadata: - results.metadata = results.metadata.update_column(k, {'name': column_metadata['name']}) - - else: - results = d3m_dataframe(generate_metadata=True) - - return CallResult(results) - - - - - - @classmethod - def _get_columns_to_fit(cls, inputs: Inputs, hyperparams: Hyperparams): - if not hyperparams['use_semantic_types']: - return inputs, list(range(len(inputs.columns))) - - inputs_metadata = inputs.metadata - - def can_produce_column(column_index: int) -> bool: - return cls._can_produce_column(inputs_metadata, column_index, hyperparams) - - columns_to_produce, columns_not_to_produce = base_utils.get_columns_to_use(inputs_metadata, - use_columns=hyperparams['use_inputs_columns'], - exclude_columns=hyperparams['exclude_inputs_columns'], - can_use_column=can_produce_column) - return inputs.iloc[:, columns_to_produce], columns_to_produce - # return columns_to_produce - - @classmethod - def _can_produce_column(cls, inputs_metadata: metadata_base.DataMetadata, column_index: int, hyperparams: Hyperparams) -> bool: - column_metadata = inputs_metadata.query((metadata_base.ALL_ELEMENTS, column_index)) - - accepted_structural_types = (int, float, numpy.integer, numpy.float64) - accepted_semantic_types = set() - accepted_semantic_types.add("https://metadata.datadrivendiscovery.org/types/Attribute") - if not issubclass(column_metadata['structural_type'], accepted_structural_types): - return False - - semantic_types = set(column_metadata.get('semantic_types', [])) - - if len(semantic_types) == 0: - cls.logger.warning("No semantic types found in column metadata") - return False - # Making sure all accepted_semantic_types are available in semantic_types - if len(accepted_semantic_types - semantic_types) == 0: - return True - - return False - - @classmethod - def _get_targets(cls, data: d3m_dataframe, hyperparams: Hyperparams): - if not hyperparams['use_semantic_types']: - return data, list(data.columns), list(range(len(data.columns))) - - metadata = data.metadata - - def can_produce_column(column_index: int) -> bool: - accepted_semantic_types = set() - accepted_semantic_types.add("https://metadata.datadrivendiscovery.org/types/TrueTarget") - column_metadata = metadata.query((metadata_base.ALL_ELEMENTS, column_index)) - semantic_types = set(column_metadata.get('semantic_types', [])) - if len(semantic_types) == 0: - cls.logger.warning("No semantic types found in column metadata") - return False - # Making sure all accepted_semantic_types are available in semantic_types - if len(accepted_semantic_types - semantic_types) == 0: - return True - return False - - target_column_indices, target_columns_not_to_produce = base_utils.get_columns_to_use(metadata, - use_columns=hyperparams[ - 'use_outputs_columns'], - exclude_columns= - hyperparams[ - 'exclude_outputs_columns'], - can_use_column=can_produce_column) - targets = [] - if target_column_indices: - targets = data.select_columns(target_column_indices) - target_column_names = [] - for idx in target_column_indices: - target_column_names.append(data.columns[idx]) - return targets, target_column_names, target_column_indices - - @classmethod - def _get_target_columns_metadata(cls, outputs_metadata: metadata_base.DataMetadata, hyperparams) -> List[OrderedDict]: - outputs_length = outputs_metadata.query((metadata_base.ALL_ELEMENTS,))['dimension']['length'] - - target_columns_metadata: List[OrderedDict] = [] - for column_index in range(outputs_length): - column_metadata = OrderedDict(outputs_metadata.query_column(column_index)) - - # Update semantic types and prepare it for predicted targets. - semantic_types = set(column_metadata.get('semantic_types', [])) - semantic_types_to_remove = set(["https://metadata.datadrivendiscovery.org/types/TrueTarget","https://metadata.datadrivendiscovery.org/types/SuggestedTarget",]) - add_semantic_types = set(["https://metadata.datadrivendiscovery.org/types/PredictedTarget",]) - add_semantic_types.add(hyperparams["return_semantic_type"]) - semantic_types = semantic_types - semantic_types_to_remove - semantic_types = semantic_types.union(add_semantic_types) - column_metadata['semantic_types'] = list(semantic_types) - - target_columns_metadata.append(column_metadata) - - return target_columns_metadata - - @classmethod - def _update_predictions_metadata(cls, inputs_metadata: metadata_base.DataMetadata, outputs: Optional[Outputs], - target_columns_metadata: List[OrderedDict]) -> metadata_base.DataMetadata: - outputs_metadata = metadata_base.DataMetadata().generate(value=outputs) - - for column_index, column_metadata in enumerate(target_columns_metadata): - column_metadata.pop("structural_type", None) - outputs_metadata = outputs_metadata.update_column(column_index, column_metadata) - - return outputs_metadata - - def _wrap_predictions(self, inputs: Inputs, predictions: ndarray) -> Outputs: - outputs = d3m_dataframe(predictions, generate_metadata=False) - outputs.metadata = self._update_predictions_metadata(inputs.metadata, outputs, self._target_columns_metadata) - return outputs - - - @classmethod - def _add_target_columns_metadata(cls, outputs_metadata: metadata_base.DataMetadata): - outputs_length = outputs_metadata.query((metadata_base.ALL_ELEMENTS,))['dimension']['length'] - - target_columns_metadata: List[OrderedDict] = [] - for column_index in range(outputs_length): - column_metadata = OrderedDict() - semantic_types = [] - semantic_types.append('https://metadata.datadrivendiscovery.org/types/PredictedTarget') - column_name = outputs_metadata.query((metadata_base.ALL_ELEMENTS, column_index)).get("name") - if column_name is None: - column_name = "output_{}".format(column_index) - column_metadata["semantic_types"] = semantic_types - column_metadata["name"] = str(column_name) - target_columns_metadata.append(column_metadata) - - return target_columns_metadata - - -SKBaggingClassifier.__doc__ = BaggingClassifier.__doc__ \ No newline at end of file diff --git a/common-primitives/sklearn-wrap/sklearn_wrap/SKBaggingRegressor.py b/common-primitives/sklearn-wrap/sklearn_wrap/SKBaggingRegressor.py deleted file mode 100644 index 7a62c7b..0000000 --- a/common-primitives/sklearn-wrap/sklearn_wrap/SKBaggingRegressor.py +++ /dev/null @@ -1,533 +0,0 @@ -from typing import Any, Callable, List, Dict, Union, Optional, Sequence, Tuple -from numpy import ndarray -from collections import OrderedDict -from scipy import sparse -import os -import sklearn -import numpy -import typing - -# Custom import commands if any -from sklearn.ensemble.bagging import BaggingRegressor - - -from d3m.container.numpy import ndarray as d3m_ndarray -from d3m.container import DataFrame as d3m_dataframe -from d3m.metadata import hyperparams, params, base as metadata_base -from d3m import utils -from d3m.base import utils as base_utils -from d3m.exceptions import PrimitiveNotFittedError -from d3m.primitive_interfaces.base import CallResult, DockerContainer - -from d3m.primitive_interfaces.supervised_learning import SupervisedLearnerPrimitiveBase -from d3m.primitive_interfaces.base import ProbabilisticCompositionalityMixin, ContinueFitMixin -from d3m import exceptions -import pandas - - - -Inputs = d3m_dataframe -Outputs = d3m_dataframe - - -class Params(params.Params): - estimators_: Optional[List[sklearn.tree.DecisionTreeRegressor]] - estimators_features_: Optional[List[ndarray]] - oob_score_: Optional[float] - oob_prediction_: Optional[ndarray] - base_estimator_: Optional[object] - n_features_: Optional[int] - _max_features: Optional[int] - _max_samples: Optional[int] - _n_samples: Optional[int] - _seeds: Optional[ndarray] - estimator_params: Optional[tuple] - input_column_names: Optional[Any] - target_names_: Optional[Sequence[Any]] - training_indices_: Optional[Sequence[int]] - target_column_indices_: Optional[Sequence[int]] - target_columns_metadata_: Optional[List[OrderedDict]] - - - -class Hyperparams(hyperparams.Hyperparams): - base_estimator = hyperparams.Constant( - default=None, - description='The base estimator to fit on random subsets of the dataset. If None, then the base estimator is a decision tree.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - n_estimators = hyperparams.Bounded[int]( - default=10, - lower=1, - upper=None, - description='The number of base estimators in the ensemble.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - max_samples = hyperparams.Union( - configuration=OrderedDict({ - 'absolute': hyperparams.Bounded[int]( - lower=0, - upper=None, - default=0, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ), - 'percent': hyperparams.Bounded[float]( - default=1.0, - lower=0, - upper=1, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ) - }), - default='percent', - description='The number of samples to draw from X to train each base estimator. - If int, then draw `max_samples` samples. - If float, then draw `max_samples * X.shape[0]` samples.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - max_features = hyperparams.Union( - configuration=OrderedDict({ - 'absolute': hyperparams.Bounded[int]( - lower=0, - upper=None, - default=0, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ), - 'percent': hyperparams.Bounded[float]( - default=1.0, - lower=0, - upper=1, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ) - }), - default='percent', - description='The number of features to draw from X to train each base estimator. - If int, then draw `max_features` features. - If float, then draw `max_features * X.shape[1]` features.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - bootstrap = hyperparams.Enumeration[str]( - values=['bootstrap', 'bootstrap_with_oob_score', 'disabled'], - default='bootstrap', - description='Whether bootstrap samples are used when building trees.' - ' And whether to use out-of-bag samples to estimate the generalization accuracy.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ) - bootstrap_features = hyperparams.UniformBool( - default=False, - description='Whether features are drawn with replacement.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - warm_start = hyperparams.UniformBool( - default=False, - description='When set to True, reuse the solution of the previous call to fit and add more estimators to the ensemble, otherwise, just fit a whole new ensemble. See :term:`the Glossary `.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - n_jobs = hyperparams.Union( - configuration=OrderedDict({ - 'limit': hyperparams.Bounded[int]( - default=1, - lower=1, - upper=None, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ), - 'all_cores': hyperparams.Constant( - default=-1, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ) - }), - default='limit', - description='The number of jobs to run in parallel for both `fit` and `predict`. ``None`` means 1 unless in a :obj:`joblib.parallel_backend` context. ``-1`` means using all processors. See :term:`Glossary ` for more details.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/ResourcesUseParameter'] - ) - - use_inputs_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to force primitive to use as training input. If any specified column cannot be parsed, it is skipped.", - ) - use_outputs_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to force primitive to use as training target. If any specified column cannot be parsed, it is skipped.", - ) - exclude_inputs_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to not use as training inputs. Applicable only if \"use_columns\" is not provided.", - ) - exclude_outputs_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to not use as training target. Applicable only if \"use_columns\" is not provided.", - ) - return_result = hyperparams.Enumeration( - values=['append', 'replace', 'new'], - default='new', - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Should parsed columns be appended, should they replace original columns, or should only parsed columns be returned? This hyperparam is ignored if use_semantic_types is set to false.", - ) - use_semantic_types = hyperparams.UniformBool( - default=False, - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Controls whether semantic_types metadata will be used for filtering columns in input dataframe. Setting this to false makes the code ignore return_result and will produce only the output dataframe" - ) - add_index_columns = hyperparams.UniformBool( - default=False, - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Also include primary index columns if input data has them. Applicable only if \"return_result\" is set to \"new\".", - ) - error_on_no_input = hyperparams.UniformBool( - default=True, - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Throw an exception if no input column is selected/provided. Defaults to true to behave like sklearn. To prevent pipelines from breaking set this to False.", - ) - - return_semantic_type = hyperparams.Enumeration[str]( - values=['https://metadata.datadrivendiscovery.org/types/Attribute', 'https://metadata.datadrivendiscovery.org/types/ConstructedAttribute', 'https://metadata.datadrivendiscovery.org/types/PredictedTarget'], - default='https://metadata.datadrivendiscovery.org/types/PredictedTarget', - description='Decides what semantic type to attach to generated output', - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'] - ) - -class SKBaggingRegressor(SupervisedLearnerPrimitiveBase[Inputs, Outputs, Params, Hyperparams]): - """ - Primitive wrapping for sklearn BaggingRegressor - `sklearn documentation `_ - - """ - - __author__ = "JPL MARVIN" - metadata = metadata_base.PrimitiveMetadata({ - "algorithm_types": [metadata_base.PrimitiveAlgorithmType.ENSEMBLE_LEARNING, ], - "name": "sklearn.ensemble.bagging.BaggingRegressor", - "primitive_family": metadata_base.PrimitiveFamily.REGRESSION, - "python_path": "d3m.primitives.regression.bagging.SKlearn", - "source": {'name': 'JPL', 'contact': 'mailto:shah@jpl.nasa.gov', 'uris': ['https://gitlab.com/datadrivendiscovery/sklearn-wrap/issues', 'https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.BaggingRegressor.html']}, - "version": "2019.11.13", - "id": "0dbc4b6d-aa57-4f11-ab18-36125880151b", - 'installation': [ - {'type': metadata_base.PrimitiveInstallationType.PIP, - 'package_uri': 'git+https://gitlab.com/datadrivendiscovery/sklearn-wrap.git@{git_commit}#egg=sklearn_wrap'.format( - git_commit=utils.current_git_commit(os.path.dirname(__file__)), - ), - }] - }) - - def __init__(self, *, - hyperparams: Hyperparams, - random_seed: int = 0, - docker_containers: Dict[str, DockerContainer] = None, - _verbose: int = 0) -> None: - - super().__init__(hyperparams=hyperparams, random_seed=random_seed, docker_containers=docker_containers) - - # False - self._clf = BaggingRegressor( - base_estimator=self.hyperparams['base_estimator'], - n_estimators=self.hyperparams['n_estimators'], - max_samples=self.hyperparams['max_samples'], - max_features=self.hyperparams['max_features'], - bootstrap=self.hyperparams['bootstrap'] in ['bootstrap', 'bootstrap_with_oob_score'], - bootstrap_features=self.hyperparams['bootstrap_features'], - oob_score=self.hyperparams['bootstrap'] in ['bootstrap_with_oob_score'], - warm_start=self.hyperparams['warm_start'], - n_jobs=self.hyperparams['n_jobs'], - random_state=self.random_seed, - verbose=_verbose - ) - - self._inputs = None - self._outputs = None - self._training_inputs = None - self._training_outputs = None - self._target_names = None - self._training_indices = None - self._target_column_indices = None - self._target_columns_metadata: List[OrderedDict] = None - self._input_column_names = None - self._fitted = False - self._new_training_data = False - - def set_training_data(self, *, inputs: Inputs, outputs: Outputs) -> None: - self._inputs = inputs - self._outputs = outputs - self._fitted = False - self._new_training_data = True - - def fit(self, *, timeout: float = None, iterations: int = None) -> CallResult[None]: - if self._inputs is None or self._outputs is None: - raise ValueError("Missing training data.") - - if not self._new_training_data: - return CallResult(None) - self._new_training_data = False - - self._training_inputs, self._training_indices = self._get_columns_to_fit(self._inputs, self.hyperparams) - self._training_outputs, self._target_names, self._target_column_indices = self._get_targets(self._outputs, self.hyperparams) - self._input_column_names = self._training_inputs.columns - - if len(self._training_indices) > 0 and len(self._target_column_indices) > 0: - self._target_columns_metadata = self._get_target_columns_metadata(self._training_outputs.metadata, self.hyperparams) - sk_training_output = self._training_outputs.values - - shape = sk_training_output.shape - if len(shape) == 2 and shape[1] == 1: - sk_training_output = numpy.ravel(sk_training_output) - - self._clf.fit(self._training_inputs, sk_training_output) - self._fitted = True - else: - if self.hyperparams['error_on_no_input']: - raise RuntimeError("No input columns were selected") - self.logger.warn("No input columns were selected") - - return CallResult(None) - - - - def produce(self, *, inputs: Inputs, timeout: float = None, iterations: int = None) -> CallResult[Outputs]: - sk_inputs, columns_to_use = self._get_columns_to_fit(inputs, self.hyperparams) - output = [] - if len(sk_inputs.columns): - try: - sk_output = self._clf.predict(sk_inputs) - except sklearn.exceptions.NotFittedError as error: - raise PrimitiveNotFittedError("Primitive not fitted.") from error - # For primitives that allow predicting without fitting like GaussianProcessRegressor - if not self._fitted: - raise PrimitiveNotFittedError("Primitive not fitted.") - if sparse.issparse(sk_output): - sk_output = sk_output.toarray() - output = self._wrap_predictions(inputs, sk_output) - output.columns = self._target_names - output = [output] - else: - if self.hyperparams['error_on_no_input']: - raise RuntimeError("No input columns were selected") - self.logger.warn("No input columns were selected") - outputs = base_utils.combine_columns(return_result=self.hyperparams['return_result'], - add_index_columns=self.hyperparams['add_index_columns'], - inputs=inputs, column_indices=self._target_column_indices, - columns_list=output) - - return CallResult(outputs) - - - def get_params(self) -> Params: - if not self._fitted: - return Params( - estimators_=None, - estimators_features_=None, - oob_score_=None, - oob_prediction_=None, - base_estimator_=None, - n_features_=None, - _max_features=None, - _max_samples=None, - _n_samples=None, - _seeds=None, - estimator_params=None, - input_column_names=self._input_column_names, - training_indices_=self._training_indices, - target_names_=self._target_names, - target_column_indices_=self._target_column_indices, - target_columns_metadata_=self._target_columns_metadata - ) - - return Params( - estimators_=getattr(self._clf, 'estimators_', None), - estimators_features_=getattr(self._clf, 'estimators_features_', None), - oob_score_=getattr(self._clf, 'oob_score_', None), - oob_prediction_=getattr(self._clf, 'oob_prediction_', None), - base_estimator_=getattr(self._clf, 'base_estimator_', None), - n_features_=getattr(self._clf, 'n_features_', None), - _max_features=getattr(self._clf, '_max_features', None), - _max_samples=getattr(self._clf, '_max_samples', None), - _n_samples=getattr(self._clf, '_n_samples', None), - _seeds=getattr(self._clf, '_seeds', None), - estimator_params=getattr(self._clf, 'estimator_params', None), - input_column_names=self._input_column_names, - training_indices_=self._training_indices, - target_names_=self._target_names, - target_column_indices_=self._target_column_indices, - target_columns_metadata_=self._target_columns_metadata - ) - - def set_params(self, *, params: Params) -> None: - self._clf.estimators_ = params['estimators_'] - self._clf.estimators_features_ = params['estimators_features_'] - self._clf.oob_score_ = params['oob_score_'] - self._clf.oob_prediction_ = params['oob_prediction_'] - self._clf.base_estimator_ = params['base_estimator_'] - self._clf.n_features_ = params['n_features_'] - self._clf._max_features = params['_max_features'] - self._clf._max_samples = params['_max_samples'] - self._clf._n_samples = params['_n_samples'] - self._clf._seeds = params['_seeds'] - self._clf.estimator_params = params['estimator_params'] - self._input_column_names = params['input_column_names'] - self._training_indices = params['training_indices_'] - self._target_names = params['target_names_'] - self._target_column_indices = params['target_column_indices_'] - self._target_columns_metadata = params['target_columns_metadata_'] - - if params['estimators_'] is not None: - self._fitted = True - if params['estimators_features_'] is not None: - self._fitted = True - if params['oob_score_'] is not None: - self._fitted = True - if params['oob_prediction_'] is not None: - self._fitted = True - if params['base_estimator_'] is not None: - self._fitted = True - if params['n_features_'] is not None: - self._fitted = True - if params['_max_features'] is not None: - self._fitted = True - if params['_max_samples'] is not None: - self._fitted = True - if params['_n_samples'] is not None: - self._fitted = True - if params['_seeds'] is not None: - self._fitted = True - if params['estimator_params'] is not None: - self._fitted = True - - - - - - - - @classmethod - def _get_columns_to_fit(cls, inputs: Inputs, hyperparams: Hyperparams): - if not hyperparams['use_semantic_types']: - return inputs, list(range(len(inputs.columns))) - - inputs_metadata = inputs.metadata - - def can_produce_column(column_index: int) -> bool: - return cls._can_produce_column(inputs_metadata, column_index, hyperparams) - - columns_to_produce, columns_not_to_produce = base_utils.get_columns_to_use(inputs_metadata, - use_columns=hyperparams['use_inputs_columns'], - exclude_columns=hyperparams['exclude_inputs_columns'], - can_use_column=can_produce_column) - return inputs.iloc[:, columns_to_produce], columns_to_produce - # return columns_to_produce - - @classmethod - def _can_produce_column(cls, inputs_metadata: metadata_base.DataMetadata, column_index: int, hyperparams: Hyperparams) -> bool: - column_metadata = inputs_metadata.query((metadata_base.ALL_ELEMENTS, column_index)) - - accepted_structural_types = (int, float, numpy.integer, numpy.float64) - accepted_semantic_types = set() - accepted_semantic_types.add("https://metadata.datadrivendiscovery.org/types/Attribute") - if not issubclass(column_metadata['structural_type'], accepted_structural_types): - return False - - semantic_types = set(column_metadata.get('semantic_types', [])) - - if len(semantic_types) == 0: - cls.logger.warning("No semantic types found in column metadata") - return False - # Making sure all accepted_semantic_types are available in semantic_types - if len(accepted_semantic_types - semantic_types) == 0: - return True - - return False - - @classmethod - def _get_targets(cls, data: d3m_dataframe, hyperparams: Hyperparams): - if not hyperparams['use_semantic_types']: - return data, list(data.columns), list(range(len(data.columns))) - - metadata = data.metadata - - def can_produce_column(column_index: int) -> bool: - accepted_semantic_types = set() - accepted_semantic_types.add("https://metadata.datadrivendiscovery.org/types/TrueTarget") - column_metadata = metadata.query((metadata_base.ALL_ELEMENTS, column_index)) - semantic_types = set(column_metadata.get('semantic_types', [])) - if len(semantic_types) == 0: - cls.logger.warning("No semantic types found in column metadata") - return False - # Making sure all accepted_semantic_types are available in semantic_types - if len(accepted_semantic_types - semantic_types) == 0: - return True - return False - - target_column_indices, target_columns_not_to_produce = base_utils.get_columns_to_use(metadata, - use_columns=hyperparams[ - 'use_outputs_columns'], - exclude_columns= - hyperparams[ - 'exclude_outputs_columns'], - can_use_column=can_produce_column) - targets = [] - if target_column_indices: - targets = data.select_columns(target_column_indices) - target_column_names = [] - for idx in target_column_indices: - target_column_names.append(data.columns[idx]) - return targets, target_column_names, target_column_indices - - @classmethod - def _get_target_columns_metadata(cls, outputs_metadata: metadata_base.DataMetadata, hyperparams) -> List[OrderedDict]: - outputs_length = outputs_metadata.query((metadata_base.ALL_ELEMENTS,))['dimension']['length'] - - target_columns_metadata: List[OrderedDict] = [] - for column_index in range(outputs_length): - column_metadata = OrderedDict(outputs_metadata.query_column(column_index)) - - # Update semantic types and prepare it for predicted targets. - semantic_types = set(column_metadata.get('semantic_types', [])) - semantic_types_to_remove = set(["https://metadata.datadrivendiscovery.org/types/TrueTarget","https://metadata.datadrivendiscovery.org/types/SuggestedTarget",]) - add_semantic_types = set(["https://metadata.datadrivendiscovery.org/types/PredictedTarget",]) - add_semantic_types.add(hyperparams["return_semantic_type"]) - semantic_types = semantic_types - semantic_types_to_remove - semantic_types = semantic_types.union(add_semantic_types) - column_metadata['semantic_types'] = list(semantic_types) - - target_columns_metadata.append(column_metadata) - - return target_columns_metadata - - @classmethod - def _update_predictions_metadata(cls, inputs_metadata: metadata_base.DataMetadata, outputs: Optional[Outputs], - target_columns_metadata: List[OrderedDict]) -> metadata_base.DataMetadata: - outputs_metadata = metadata_base.DataMetadata().generate(value=outputs) - - for column_index, column_metadata in enumerate(target_columns_metadata): - column_metadata.pop("structural_type", None) - outputs_metadata = outputs_metadata.update_column(column_index, column_metadata) - - return outputs_metadata - - def _wrap_predictions(self, inputs: Inputs, predictions: ndarray) -> Outputs: - outputs = d3m_dataframe(predictions, generate_metadata=False) - outputs.metadata = self._update_predictions_metadata(inputs.metadata, outputs, self._target_columns_metadata) - return outputs - - - @classmethod - def _add_target_columns_metadata(cls, outputs_metadata: metadata_base.DataMetadata): - outputs_length = outputs_metadata.query((metadata_base.ALL_ELEMENTS,))['dimension']['length'] - - target_columns_metadata: List[OrderedDict] = [] - for column_index in range(outputs_length): - column_metadata = OrderedDict() - semantic_types = [] - semantic_types.append('https://metadata.datadrivendiscovery.org/types/PredictedTarget') - column_name = outputs_metadata.query((metadata_base.ALL_ELEMENTS, column_index)).get("name") - if column_name is None: - column_name = "output_{}".format(column_index) - column_metadata["semantic_types"] = semantic_types - column_metadata["name"] = str(column_name) - target_columns_metadata.append(column_metadata) - - return target_columns_metadata - - -SKBaggingRegressor.__doc__ = BaggingRegressor.__doc__ \ No newline at end of file diff --git a/common-primitives/sklearn-wrap/sklearn_wrap/SKBernoulliNB.py b/common-primitives/sklearn-wrap/sklearn_wrap/SKBernoulliNB.py deleted file mode 100644 index 40dde7e..0000000 --- a/common-primitives/sklearn-wrap/sklearn_wrap/SKBernoulliNB.py +++ /dev/null @@ -1,508 +0,0 @@ -from typing import Any, Callable, List, Dict, Union, Optional, Sequence, Tuple -from numpy import ndarray -from collections import OrderedDict -from scipy import sparse -import os -import sklearn -import numpy -import typing - -# Custom import commands if any -from sklearn.naive_bayes import BernoulliNB - - -from d3m.container.numpy import ndarray as d3m_ndarray -from d3m.container import DataFrame as d3m_dataframe -from d3m.metadata import hyperparams, params, base as metadata_base -from d3m import utils -from d3m.base import utils as base_utils -from d3m.exceptions import PrimitiveNotFittedError -from d3m.primitive_interfaces.base import CallResult, DockerContainer - -from d3m.primitive_interfaces.supervised_learning import SupervisedLearnerPrimitiveBase -from d3m.primitive_interfaces.base import ProbabilisticCompositionalityMixin, ContinueFitMixin -from d3m import exceptions -import pandas - - - -Inputs = d3m_dataframe -Outputs = d3m_dataframe - - -class Params(params.Params): - class_log_prior_: Optional[ndarray] - feature_log_prob_: Optional[ndarray] - class_count_: Optional[ndarray] - feature_count_: Optional[ndarray] - classes_: Optional[ndarray] - input_column_names: Optional[Any] - target_names_: Optional[Sequence[Any]] - training_indices_: Optional[Sequence[int]] - target_column_indices_: Optional[Sequence[int]] - target_columns_metadata_: Optional[List[OrderedDict]] - - - -class Hyperparams(hyperparams.Hyperparams): - alpha = hyperparams.Bounded[float]( - default=1, - lower=0, - upper=None, - description='Additive (Laplace/Lidstone) smoothing parameter (0 for no smoothing).', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - binarize = hyperparams.Union( - configuration=OrderedDict({ - 'float': hyperparams.Bounded[float]( - lower=0, - upper=None, - default=0, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ), - 'none': hyperparams.Constant( - default=None, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ) - }), - default='float', - description='Threshold for binarizing (mapping to booleans) of sample features. If None, input is presumed to already consist of binary vectors.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - fit_prior = hyperparams.UniformBool( - default=True, - description='Whether to learn class prior probabilities or not. If false, a uniform prior will be used.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - - use_inputs_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to force primitive to use as training input. If any specified column cannot be parsed, it is skipped.", - ) - use_outputs_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to force primitive to use as training target. If any specified column cannot be parsed, it is skipped.", - ) - exclude_inputs_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to not use as training inputs. Applicable only if \"use_columns\" is not provided.", - ) - exclude_outputs_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to not use as training target. Applicable only if \"use_columns\" is not provided.", - ) - return_result = hyperparams.Enumeration( - values=['append', 'replace', 'new'], - default='new', - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Should parsed columns be appended, should they replace original columns, or should only parsed columns be returned? This hyperparam is ignored if use_semantic_types is set to false.", - ) - use_semantic_types = hyperparams.UniformBool( - default=False, - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Controls whether semantic_types metadata will be used for filtering columns in input dataframe. Setting this to false makes the code ignore return_result and will produce only the output dataframe" - ) - add_index_columns = hyperparams.UniformBool( - default=False, - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Also include primary index columns if input data has them. Applicable only if \"return_result\" is set to \"new\".", - ) - error_on_no_input = hyperparams.UniformBool( - default=True, - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Throw an exception if no input column is selected/provided. Defaults to true to behave like sklearn. To prevent pipelines from breaking set this to False.", - ) - - return_semantic_type = hyperparams.Enumeration[str]( - values=['https://metadata.datadrivendiscovery.org/types/Attribute', 'https://metadata.datadrivendiscovery.org/types/ConstructedAttribute', 'https://metadata.datadrivendiscovery.org/types/PredictedTarget'], - default='https://metadata.datadrivendiscovery.org/types/PredictedTarget', - description='Decides what semantic type to attach to generated output', - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'] - ) - -class SKBernoulliNB(SupervisedLearnerPrimitiveBase[Inputs, Outputs, Params, Hyperparams], - ProbabilisticCompositionalityMixin[Inputs, Outputs, Params, Hyperparams], - ContinueFitMixin[Inputs, Outputs, Params, Hyperparams]): - """ - Primitive wrapping for sklearn BernoulliNB - `sklearn documentation `_ - - """ - - __author__ = "JPL MARVIN" - metadata = metadata_base.PrimitiveMetadata({ - "algorithm_types": [metadata_base.PrimitiveAlgorithmType.NAIVE_BAYES_CLASSIFIER, ], - "name": "sklearn.naive_bayes.BernoulliNB", - "primitive_family": metadata_base.PrimitiveFamily.CLASSIFICATION, - "python_path": "d3m.primitives.classification.bernoulli_naive_bayes.SKlearn", - "source": {'name': 'JPL', 'contact': 'mailto:shah@jpl.nasa.gov', 'uris': ['https://gitlab.com/datadrivendiscovery/sklearn-wrap/issues', 'https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.BernoulliNB.html']}, - "version": "2019.11.13", - "id": "dfb1004e-02ac-3399-ba57-8a95639312cd", - "hyperparams_to_tune": ['alpha', 'binarize', 'fit_prior'], - 'installation': [ - {'type': metadata_base.PrimitiveInstallationType.PIP, - 'package_uri': 'git+https://gitlab.com/datadrivendiscovery/sklearn-wrap.git@{git_commit}#egg=sklearn_wrap'.format( - git_commit=utils.current_git_commit(os.path.dirname(__file__)), - ), - }] - }) - - def __init__(self, *, - hyperparams: Hyperparams, - random_seed: int = 0, - docker_containers: Dict[str, DockerContainer] = None) -> None: - - super().__init__(hyperparams=hyperparams, random_seed=random_seed, docker_containers=docker_containers) - - # False - self._clf = BernoulliNB( - alpha=self.hyperparams['alpha'], - binarize=self.hyperparams['binarize'], - fit_prior=self.hyperparams['fit_prior'], - ) - - self._inputs = None - self._outputs = None - self._training_inputs = None - self._training_outputs = None - self._target_names = None - self._training_indices = None - self._target_column_indices = None - self._target_columns_metadata: List[OrderedDict] = None - self._input_column_names = None - self._fitted = False - self._new_training_data = False - - def set_training_data(self, *, inputs: Inputs, outputs: Outputs) -> None: - self._inputs = inputs - self._outputs = outputs - self._fitted = False - self._new_training_data = True - - def fit(self, *, timeout: float = None, iterations: int = None) -> CallResult[None]: - if self._inputs is None or self._outputs is None: - raise ValueError("Missing training data.") - - if not self._new_training_data: - return CallResult(None) - self._new_training_data = False - - self._training_inputs, self._training_indices = self._get_columns_to_fit(self._inputs, self.hyperparams) - self._training_outputs, self._target_names, self._target_column_indices = self._get_targets(self._outputs, self.hyperparams) - self._input_column_names = self._training_inputs.columns - - if len(self._training_indices) > 0 and len(self._target_column_indices) > 0: - self._target_columns_metadata = self._get_target_columns_metadata(self._training_outputs.metadata, self.hyperparams) - sk_training_output = self._training_outputs.values - - shape = sk_training_output.shape - if len(shape) == 2 and shape[1] == 1: - sk_training_output = numpy.ravel(sk_training_output) - - self._clf.fit(self._training_inputs, sk_training_output) - self._fitted = True - else: - if self.hyperparams['error_on_no_input']: - raise RuntimeError("No input columns were selected") - self.logger.warn("No input columns were selected") - - return CallResult(None) - - def continue_fit(self, *, timeout: float = None, iterations: int = None) -> CallResult[None]: - if self._training_inputs is None or self._training_outputs is None: - raise ValueError("Missing training data.") - - if not self._new_training_data: - return CallResult(None) - self._new_training_data = False - - if len(self._training_indices) > 0 and len(self._target_column_indices) > 0: - self._target_columns_metadata = self._get_target_columns_metadata(self._training_outputs.metadata, self.hyperparams) - sk_training_output = self._training_outputs.values - - shape = sk_training_output.shape - if len(shape) == 2 and shape[1] == 1: - sk_training_output = numpy.ravel(sk_training_output) - - self._clf.partial_fit(self._training_inputs, sk_training_output) - self._fitted = True - else: - if self.hyperparams['error_on_no_input']: - raise RuntimeError("No input columns were selected") - self.logger.warn("No input columns were selected") - - return CallResult(None) - - - def produce(self, *, inputs: Inputs, timeout: float = None, iterations: int = None) -> CallResult[Outputs]: - sk_inputs, columns_to_use = self._get_columns_to_fit(inputs, self.hyperparams) - output = [] - if len(sk_inputs.columns): - try: - sk_output = self._clf.predict(sk_inputs) - except sklearn.exceptions.NotFittedError as error: - raise PrimitiveNotFittedError("Primitive not fitted.") from error - # For primitives that allow predicting without fitting like GaussianProcessRegressor - if not self._fitted: - raise PrimitiveNotFittedError("Primitive not fitted.") - if sparse.issparse(sk_output): - sk_output = sk_output.toarray() - output = self._wrap_predictions(inputs, sk_output) - output.columns = self._target_names - output = [output] - else: - if self.hyperparams['error_on_no_input']: - raise RuntimeError("No input columns were selected") - self.logger.warn("No input columns were selected") - outputs = base_utils.combine_columns(return_result=self.hyperparams['return_result'], - add_index_columns=self.hyperparams['add_index_columns'], - inputs=inputs, column_indices=self._target_column_indices, - columns_list=output) - - return CallResult(outputs) - - - def get_params(self) -> Params: - if not self._fitted: - return Params( - class_log_prior_=None, - feature_log_prob_=None, - class_count_=None, - feature_count_=None, - classes_=None, - input_column_names=self._input_column_names, - training_indices_=self._training_indices, - target_names_=self._target_names, - target_column_indices_=self._target_column_indices, - target_columns_metadata_=self._target_columns_metadata - ) - - return Params( - class_log_prior_=getattr(self._clf, 'class_log_prior_', None), - feature_log_prob_=getattr(self._clf, 'feature_log_prob_', None), - class_count_=getattr(self._clf, 'class_count_', None), - feature_count_=getattr(self._clf, 'feature_count_', None), - classes_=getattr(self._clf, 'classes_', None), - input_column_names=self._input_column_names, - training_indices_=self._training_indices, - target_names_=self._target_names, - target_column_indices_=self._target_column_indices, - target_columns_metadata_=self._target_columns_metadata - ) - - def set_params(self, *, params: Params) -> None: - self._clf.class_log_prior_ = params['class_log_prior_'] - self._clf.feature_log_prob_ = params['feature_log_prob_'] - self._clf.class_count_ = params['class_count_'] - self._clf.feature_count_ = params['feature_count_'] - self._clf.classes_ = params['classes_'] - self._input_column_names = params['input_column_names'] - self._training_indices = params['training_indices_'] - self._target_names = params['target_names_'] - self._target_column_indices = params['target_column_indices_'] - self._target_columns_metadata = params['target_columns_metadata_'] - - if params['class_log_prior_'] is not None: - self._fitted = True - if params['feature_log_prob_'] is not None: - self._fitted = True - if params['class_count_'] is not None: - self._fitted = True - if params['feature_count_'] is not None: - self._fitted = True - if params['classes_'] is not None: - self._fitted = True - - - def log_likelihoods(self, *, - outputs: Outputs, - inputs: Inputs, - timeout: float = None, - iterations: int = None) -> CallResult[Sequence[float]]: - inputs = inputs.iloc[:, self._training_indices] # Get ndarray - outputs = outputs.iloc[:, self._target_column_indices] - - if len(inputs.columns) and len(outputs.columns): - - if outputs.shape[1] != self._clf.n_outputs_: - raise exceptions.InvalidArgumentValueError("\"outputs\" argument does not have the correct number of target columns.") - - log_proba = self._clf.predict_log_proba(inputs) - - # Making it always a list, even when only one target. - if self._clf.n_outputs_ == 1: - log_proba = [log_proba] - classes = [self._clf.classes_] - else: - classes = self._clf.classes_ - - samples_length = inputs.shape[0] - - log_likelihoods = [] - for k in range(self._clf.n_outputs_): - # We have to map each class to its internal (numerical) index used in the learner. - # This allows "outputs" to contain string classes. - outputs_column = outputs.iloc[:, k] - classes_map = pandas.Series(numpy.arange(len(classes[k])), index=classes[k]) - mapped_outputs_column = outputs_column.map(classes_map) - - # For each target column (column in "outputs"), for each sample (row) we pick the log - # likelihood for a given class. - log_likelihoods.append(log_proba[k][numpy.arange(samples_length), mapped_outputs_column]) - - results = d3m_dataframe(dict(enumerate(log_likelihoods)), generate_metadata=True) - results.columns = outputs.columns - - for k in range(self._clf.n_outputs_): - column_metadata = outputs.metadata.query_column(k) - if 'name' in column_metadata: - results.metadata = results.metadata.update_column(k, {'name': column_metadata['name']}) - - else: - results = d3m_dataframe(generate_metadata=True) - - return CallResult(results) - - - - - - @classmethod - def _get_columns_to_fit(cls, inputs: Inputs, hyperparams: Hyperparams): - if not hyperparams['use_semantic_types']: - return inputs, list(range(len(inputs.columns))) - - inputs_metadata = inputs.metadata - - def can_produce_column(column_index: int) -> bool: - return cls._can_produce_column(inputs_metadata, column_index, hyperparams) - - columns_to_produce, columns_not_to_produce = base_utils.get_columns_to_use(inputs_metadata, - use_columns=hyperparams['use_inputs_columns'], - exclude_columns=hyperparams['exclude_inputs_columns'], - can_use_column=can_produce_column) - return inputs.iloc[:, columns_to_produce], columns_to_produce - # return columns_to_produce - - @classmethod - def _can_produce_column(cls, inputs_metadata: metadata_base.DataMetadata, column_index: int, hyperparams: Hyperparams) -> bool: - column_metadata = inputs_metadata.query((metadata_base.ALL_ELEMENTS, column_index)) - - accepted_structural_types = (int, float, numpy.integer, numpy.float64) - accepted_semantic_types = set() - accepted_semantic_types.add("https://metadata.datadrivendiscovery.org/types/Attribute") - if not issubclass(column_metadata['structural_type'], accepted_structural_types): - return False - - semantic_types = set(column_metadata.get('semantic_types', [])) - - if len(semantic_types) == 0: - cls.logger.warning("No semantic types found in column metadata") - return False - # Making sure all accepted_semantic_types are available in semantic_types - if len(accepted_semantic_types - semantic_types) == 0: - return True - - return False - - @classmethod - def _get_targets(cls, data: d3m_dataframe, hyperparams: Hyperparams): - if not hyperparams['use_semantic_types']: - return data, list(data.columns), list(range(len(data.columns))) - - metadata = data.metadata - - def can_produce_column(column_index: int) -> bool: - accepted_semantic_types = set() - accepted_semantic_types.add("https://metadata.datadrivendiscovery.org/types/TrueTarget") - column_metadata = metadata.query((metadata_base.ALL_ELEMENTS, column_index)) - semantic_types = set(column_metadata.get('semantic_types', [])) - if len(semantic_types) == 0: - cls.logger.warning("No semantic types found in column metadata") - return False - # Making sure all accepted_semantic_types are available in semantic_types - if len(accepted_semantic_types - semantic_types) == 0: - return True - return False - - target_column_indices, target_columns_not_to_produce = base_utils.get_columns_to_use(metadata, - use_columns=hyperparams[ - 'use_outputs_columns'], - exclude_columns= - hyperparams[ - 'exclude_outputs_columns'], - can_use_column=can_produce_column) - targets = [] - if target_column_indices: - targets = data.select_columns(target_column_indices) - target_column_names = [] - for idx in target_column_indices: - target_column_names.append(data.columns[idx]) - return targets, target_column_names, target_column_indices - - @classmethod - def _get_target_columns_metadata(cls, outputs_metadata: metadata_base.DataMetadata, hyperparams) -> List[OrderedDict]: - outputs_length = outputs_metadata.query((metadata_base.ALL_ELEMENTS,))['dimension']['length'] - - target_columns_metadata: List[OrderedDict] = [] - for column_index in range(outputs_length): - column_metadata = OrderedDict(outputs_metadata.query_column(column_index)) - - # Update semantic types and prepare it for predicted targets. - semantic_types = set(column_metadata.get('semantic_types', [])) - semantic_types_to_remove = set(["https://metadata.datadrivendiscovery.org/types/TrueTarget","https://metadata.datadrivendiscovery.org/types/SuggestedTarget",]) - add_semantic_types = set(["https://metadata.datadrivendiscovery.org/types/PredictedTarget",]) - add_semantic_types.add(hyperparams["return_semantic_type"]) - semantic_types = semantic_types - semantic_types_to_remove - semantic_types = semantic_types.union(add_semantic_types) - column_metadata['semantic_types'] = list(semantic_types) - - target_columns_metadata.append(column_metadata) - - return target_columns_metadata - - @classmethod - def _update_predictions_metadata(cls, inputs_metadata: metadata_base.DataMetadata, outputs: Optional[Outputs], - target_columns_metadata: List[OrderedDict]) -> metadata_base.DataMetadata: - outputs_metadata = metadata_base.DataMetadata().generate(value=outputs) - - for column_index, column_metadata in enumerate(target_columns_metadata): - column_metadata.pop("structural_type", None) - outputs_metadata = outputs_metadata.update_column(column_index, column_metadata) - - return outputs_metadata - - def _wrap_predictions(self, inputs: Inputs, predictions: ndarray) -> Outputs: - outputs = d3m_dataframe(predictions, generate_metadata=False) - outputs.metadata = self._update_predictions_metadata(inputs.metadata, outputs, self._target_columns_metadata) - return outputs - - - @classmethod - def _add_target_columns_metadata(cls, outputs_metadata: metadata_base.DataMetadata): - outputs_length = outputs_metadata.query((metadata_base.ALL_ELEMENTS,))['dimension']['length'] - - target_columns_metadata: List[OrderedDict] = [] - for column_index in range(outputs_length): - column_metadata = OrderedDict() - semantic_types = [] - semantic_types.append('https://metadata.datadrivendiscovery.org/types/PredictedTarget') - column_name = outputs_metadata.query((metadata_base.ALL_ELEMENTS, column_index)).get("name") - if column_name is None: - column_name = "output_{}".format(column_index) - column_metadata["semantic_types"] = semantic_types - column_metadata["name"] = str(column_name) - target_columns_metadata.append(column_metadata) - - return target_columns_metadata - - -SKBernoulliNB.__doc__ = BernoulliNB.__doc__ \ No newline at end of file diff --git a/common-primitives/sklearn-wrap/sklearn_wrap/SKBinarizer.py b/common-primitives/sklearn-wrap/sklearn_wrap/SKBinarizer.py deleted file mode 100644 index 7d1166e..0000000 --- a/common-primitives/sklearn-wrap/sklearn_wrap/SKBinarizer.py +++ /dev/null @@ -1,330 +0,0 @@ -from typing import Any, Callable, List, Dict, Union, Optional, Sequence, Tuple -from numpy import ndarray -from collections import OrderedDict -from scipy import sparse -import os -import sklearn -import numpy -import typing - -# Custom import commands if any -from sklearn.preprocessing.data import Binarizer - - -from d3m.container.numpy import ndarray as d3m_ndarray -from d3m.container import DataFrame as d3m_dataframe -from d3m.metadata import hyperparams, params, base as metadata_base -from d3m import utils -from d3m.base import utils as base_utils -from d3m.exceptions import PrimitiveNotFittedError -from d3m.primitive_interfaces.base import CallResult, DockerContainer -from d3m.primitive_interfaces.unsupervised_learning import UnsupervisedLearnerPrimitiveBase - - -Inputs = d3m_dataframe -Outputs = d3m_dataframe - - -class Params(params.Params): - input_column_names: Optional[Any] - target_names_: Optional[Sequence[Any]] - training_indices_: Optional[Sequence[int]] - target_column_indices_: Optional[Sequence[int]] - target_columns_metadata_: Optional[List[OrderedDict]] - - - -class Hyperparams(hyperparams.Hyperparams): - threshold = hyperparams.Bounded[float]( - default=0.0, - lower=0.0, - upper=None, - description='Feature values below or equal to this are replaced by 0, above it by 1. Threshold may not be less than 0 for operations on sparse matrices.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - - use_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to force primitive to operate on. If any specified column cannot be parsed, it is skipped.", - ) - exclude_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to not operate on. Applicable only if \"use_columns\" is not provided.", - ) - return_result = hyperparams.Enumeration( - values=['append', 'replace', 'new'], - default='new', - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Should parsed columns be appended, should they replace original columns, or should only parsed columns be returned? This hyperparam is ignored if use_semantic_types is set to false.", - ) - use_semantic_types = hyperparams.UniformBool( - default=False, - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Controls whether semantic_types metadata will be used for filtering columns in input dataframe. Setting this to false makes the code ignore return_result and will produce only the output dataframe" - ) - add_index_columns = hyperparams.UniformBool( - default=False, - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Also include primary index columns if input data has them. Applicable only if \"return_result\" is set to \"new\".", - ) - error_on_no_input = hyperparams.UniformBool( - default=True, - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Throw an exception if no input column is selected/provided. Defaults to true to behave like sklearn. To prevent pipelines from breaking set this to False.", - ) - - return_semantic_type = hyperparams.Enumeration[str]( - values=['https://metadata.datadrivendiscovery.org/types/Attribute', 'https://metadata.datadrivendiscovery.org/types/ConstructedAttribute'], - default='https://metadata.datadrivendiscovery.org/types/Attribute', - description='Decides what semantic type to attach to generated attributes', - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'] - ) - -class SKBinarizer(UnsupervisedLearnerPrimitiveBase[Inputs, Outputs, Params, Hyperparams]): - """ - Primitive wrapping for sklearn Binarizer - `sklearn documentation `_ - - """ - - __author__ = "JPL MARVIN" - metadata = metadata_base.PrimitiveMetadata({ - "algorithm_types": [metadata_base.PrimitiveAlgorithmType.FEATURE_SCALING, ], - "name": "sklearn.preprocessing.data.Binarizer", - "primitive_family": metadata_base.PrimitiveFamily.DATA_PREPROCESSING, - "python_path": "d3m.primitives.data_preprocessing.binarizer.SKlearn", - "source": {'name': 'JPL', 'contact': 'mailto:shah@jpl.nasa.gov', 'uris': ['https://gitlab.com/datadrivendiscovery/sklearn-wrap/issues', 'https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.Binarizer.html']}, - "version": "2019.11.13", - "id": "13777068-9dc0-3c5b-b4da-99350d67ee3f", - "hyperparams_to_tune": ['threshold'], - 'installation': [ - {'type': metadata_base.PrimitiveInstallationType.PIP, - 'package_uri': 'git+https://gitlab.com/datadrivendiscovery/sklearn-wrap.git@{git_commit}#egg=sklearn_wrap'.format( - git_commit=utils.current_git_commit(os.path.dirname(__file__)), - ), - }] - }) - - def __init__(self, *, - hyperparams: Hyperparams, - random_seed: int = 0, - docker_containers: Dict[str, DockerContainer] = None) -> None: - - super().__init__(hyperparams=hyperparams, random_seed=random_seed, docker_containers=docker_containers) - - # False - self._clf = Binarizer( - threshold=self.hyperparams['threshold'], - ) - - self._inputs = None - self._outputs = None - self._training_inputs = None - self._training_outputs = None - self._target_names = None - self._training_indices = None - self._target_column_indices = None - self._target_columns_metadata: List[OrderedDict] = None - self._input_column_names = None - self._fitted = False - - - def set_training_data(self, *, inputs: Inputs) -> None: - self._inputs = inputs - self._fitted = False - - def fit(self, *, timeout: float = None, iterations: int = None) -> CallResult[None]: - if self._fitted: - return CallResult(None) - - self._training_inputs, self._training_indices = self._get_columns_to_fit(self._inputs, self.hyperparams) - self._input_column_names = self._training_inputs.columns - - if self._training_inputs is None: - return CallResult(None) - - if len(self._training_indices) > 0: - self._clf.fit(self._training_inputs) - self._fitted = True - else: - if self.hyperparams['error_on_no_input']: - raise RuntimeError("No input columns were selected") - self.logger.warn("No input columns were selected") - return CallResult(None) - - def produce(self, *, inputs: Inputs, timeout: float = None, iterations: int = None) -> CallResult[Outputs]: - if not self._fitted: - raise PrimitiveNotFittedError("Primitive not fitted.") - sk_inputs = inputs - if self.hyperparams['use_semantic_types']: - sk_inputs = inputs.iloc[:, self._training_indices] - output_columns = [] - if len(self._training_indices) > 0: - sk_output = self._clf.transform(sk_inputs) - if sparse.issparse(sk_output): - sk_output = sk_output.toarray() - outputs = self._wrap_predictions(inputs, sk_output) - if len(outputs.columns) == len(self._input_column_names): - outputs.columns = self._input_column_names - output_columns = [outputs] - else: - if self.hyperparams['error_on_no_input']: - raise RuntimeError("No input columns were selected") - self.logger.warn("No input columns were selected") - outputs = base_utils.combine_columns(return_result=self.hyperparams['return_result'], - add_index_columns=self.hyperparams['add_index_columns'], - inputs=inputs, column_indices=self._training_indices, - columns_list=output_columns) - return CallResult(outputs) - - - def get_params(self) -> Params: - if not self._fitted: - return Params( - input_column_names=self._input_column_names, - training_indices_=self._training_indices, - target_names_=self._target_names, - target_column_indices_=self._target_column_indices, - target_columns_metadata_=self._target_columns_metadata - ) - - return Params( - input_column_names=self._input_column_names, - training_indices_=self._training_indices, - target_names_=self._target_names, - target_column_indices_=self._target_column_indices, - target_columns_metadata_=self._target_columns_metadata - ) - - def set_params(self, *, params: Params) -> None: - self._input_column_names = params['input_column_names'] - self._training_indices = params['training_indices_'] - self._target_names = params['target_names_'] - self._target_column_indices = params['target_column_indices_'] - self._target_columns_metadata = params['target_columns_metadata_'] - self._fitted = True - - - - - - @classmethod - def _get_columns_to_fit(cls, inputs: Inputs, hyperparams: Hyperparams): - if not hyperparams['use_semantic_types']: - return inputs, list(range(len(inputs.columns))) - - inputs_metadata = inputs.metadata - - def can_produce_column(column_index: int) -> bool: - return cls._can_produce_column(inputs_metadata, column_index, hyperparams) - - columns_to_produce, columns_not_to_produce = base_utils.get_columns_to_use(inputs_metadata, - use_columns=hyperparams['use_columns'], - exclude_columns=hyperparams['exclude_columns'], - can_use_column=can_produce_column) - return inputs.iloc[:, columns_to_produce], columns_to_produce - # return columns_to_produce - - @classmethod - def _can_produce_column(cls, inputs_metadata: metadata_base.DataMetadata, column_index: int, hyperparams: Hyperparams) -> bool: - column_metadata = inputs_metadata.query((metadata_base.ALL_ELEMENTS, column_index)) - - accepted_structural_types = (int, float, numpy.integer, numpy.float64) - accepted_semantic_types = set() - accepted_semantic_types.add("https://metadata.datadrivendiscovery.org/types/Attribute") - if not issubclass(column_metadata['structural_type'], accepted_structural_types): - return False - - semantic_types = set(column_metadata.get('semantic_types', [])) - - if len(semantic_types) == 0: - cls.logger.warning("No semantic types found in column metadata") - return False - - # Making sure all accepted_semantic_types are available in semantic_types - if len(accepted_semantic_types - semantic_types) == 0: - return True - - return False - - - @classmethod - def _get_target_columns_metadata(cls, outputs_metadata: metadata_base.DataMetadata, hyperparams) -> List[OrderedDict]: - outputs_length = outputs_metadata.query((metadata_base.ALL_ELEMENTS,))['dimension']['length'] - - target_columns_metadata: List[OrderedDict] = [] - for column_index in range(outputs_length): - column_metadata = OrderedDict(outputs_metadata.query_column(column_index)) - - # Update semantic types and prepare it for predicted targets. - semantic_types = set(column_metadata.get('semantic_types', [])) - semantic_types_to_remove = set([]) - add_semantic_types = [] - add_semantic_types.add(hyperparams["return_semantic_type"]) - semantic_types = semantic_types - semantic_types_to_remove - semantic_types = semantic_types.union(add_semantic_types) - column_metadata['semantic_types'] = list(semantic_types) - - target_columns_metadata.append(column_metadata) - - return target_columns_metadata - - @classmethod - def _update_predictions_metadata(cls, inputs_metadata: metadata_base.DataMetadata, outputs: Optional[Outputs], - target_columns_metadata: List[OrderedDict]) -> metadata_base.DataMetadata: - outputs_metadata = metadata_base.DataMetadata().generate(value=outputs) - - for column_index, column_metadata in enumerate(target_columns_metadata): - column_metadata.pop("structural_type", None) - outputs_metadata = outputs_metadata.update_column(column_index, column_metadata) - - return outputs_metadata - - def _wrap_predictions(self, inputs: Inputs, predictions: ndarray) -> Outputs: - outputs = d3m_dataframe(predictions, generate_metadata=True) - target_columns_metadata = self._copy_inputs_metadata(inputs.metadata, self._training_indices, outputs.metadata, self.hyperparams) - outputs.metadata = self._update_predictions_metadata(inputs.metadata, outputs, target_columns_metadata) - return outputs - - - @classmethod - def _copy_inputs_metadata(cls, inputs_metadata: metadata_base.DataMetadata, input_indices: List[int], - outputs_metadata: metadata_base.DataMetadata, hyperparams): - outputs_length = outputs_metadata.query((metadata_base.ALL_ELEMENTS,))['dimension']['length'] - target_columns_metadata: List[OrderedDict] = [] - for column_index in input_indices: - column_name = inputs_metadata.query((metadata_base.ALL_ELEMENTS, column_index)).get("name") - if column_name is None: - column_name = "output_{}".format(column_index) - - column_metadata = OrderedDict(inputs_metadata.query_column(column_index)) - semantic_types = set(column_metadata.get('semantic_types', [])) - semantic_types_to_remove = set([]) - add_semantic_types = set() - add_semantic_types.add(hyperparams["return_semantic_type"]) - semantic_types = semantic_types - semantic_types_to_remove - semantic_types = semantic_types.union(add_semantic_types) - column_metadata['semantic_types'] = list(semantic_types) - - column_metadata["name"] = str(column_name) - target_columns_metadata.append(column_metadata) - - # If outputs has more columns than index, add Attribute Type to all remaining - if outputs_length > len(input_indices): - for column_index in range(len(input_indices), outputs_length): - column_metadata = OrderedDict() - semantic_types = set() - semantic_types.add(hyperparams["return_semantic_type"]) - column_name = "output_{}".format(column_index) - column_metadata["semantic_types"] = list(semantic_types) - column_metadata["name"] = str(column_name) - target_columns_metadata.append(column_metadata) - - return target_columns_metadata - - -SKBinarizer.__doc__ = Binarizer.__doc__ \ No newline at end of file diff --git a/common-primitives/sklearn-wrap/sklearn_wrap/SKCountVectorizer.py b/common-primitives/sklearn-wrap/sklearn_wrap/SKCountVectorizer.py deleted file mode 100644 index 264c92f..0000000 --- a/common-primitives/sklearn-wrap/sklearn_wrap/SKCountVectorizer.py +++ /dev/null @@ -1,490 +0,0 @@ -from typing import Any, Callable, List, Dict, Union, Optional, Sequence, Tuple -from numpy import ndarray -from collections import OrderedDict -from scipy import sparse -import os -import sklearn -import numpy -import typing - -# Custom import commands if any -from sklearn.feature_extraction.text import CountVectorizer - - -from d3m.container.numpy import ndarray as d3m_ndarray -from d3m.container import DataFrame as d3m_dataframe -from d3m.metadata import hyperparams, params, base as metadata_base -from d3m import utils -from d3m.base import utils as base_utils -from d3m.exceptions import PrimitiveNotFittedError -from d3m.primitive_interfaces.base import CallResult, DockerContainer -from d3m.primitive_interfaces.unsupervised_learning import UnsupervisedLearnerPrimitiveBase -from d3m.metadata.base import ALL_ELEMENTS - - -Inputs = d3m_dataframe -Outputs = d3m_dataframe - - -class Params(params.Params): - vocabulary_: Optional[Sequence[dict]] - stop_words_: Optional[Sequence[set]] - fixed_vocabulary_: Optional[Sequence[bool]] - _stop_words_id: Optional[Sequence[int]] - target_names_: Optional[Sequence[Any]] - training_indices_: Optional[Sequence[int]] - - -class Hyperparams(hyperparams.Hyperparams): - strip_accents = hyperparams.Union( - configuration=OrderedDict({ - 'accents': hyperparams.Enumeration[str]( - default='ascii', - values=['ascii', 'unicode'], - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ), - 'none': hyperparams.Constant( - default=None, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ) - }), - default='none', - description='Remove accents during the preprocessing step. \'ascii\' is a fast method that only works on characters that have an direct ASCII mapping. \'unicode\' is a slightly slower method that works on any characters. None (default) does nothing.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - analyzer = hyperparams.Enumeration[str]( - default='word', - values=['word', 'char', 'char_wb'], - description='Whether the feature should be made of word or character n-grams. Option \'char_wb\' creates character n-grams only from text inside word boundaries; n-grams at the edges of words are padded with space. If a callable is passed it is used to extract the sequence of features out of the raw, unprocessed input.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - ngram_range = hyperparams.SortedList( - elements=hyperparams.Bounded[int](1, None, 1), - default=(1, 1), - min_size=2, - max_size=2, - description='The lower and upper boundary of the range of n-values for different n-grams to be extracted. All values of n such that min_n <= n <= max_n will be used.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - stop_words = hyperparams.Union( - configuration=OrderedDict({ - 'string': hyperparams.Hyperparameter[str]( - default='english', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ), - 'list': hyperparams.List( - elements=hyperparams.Hyperparameter[str](''), - default=[], - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ), - 'none': hyperparams.Constant( - default=None, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ) - }), - default='none', - description='If \'english\', a built-in stop word list for English is used. If a list, that list is assumed to contain stop words, all of which will be removed from the resulting tokens. Only applies if ``analyzer == \'word\'``. If None, no stop words will be used. max_df can be set to a value in the range [0.7, 1.0) to automatically detect and filter stop words based on intra corpus document frequency of terms.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - lowercase = hyperparams.UniformBool( - default=True, - description='Convert all characters to lowercase before tokenizing.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - token_pattern = hyperparams.Hyperparameter[str]( - default='(?u)\\b\w\w+\\b', - description='Regular expression denoting what constitutes a "token", only used if ``analyzer == \'word\'``. The default regexp select tokens of 2 or more alphanumeric characters (punctuation is completely ignored and always treated as a token separator).', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - max_df = hyperparams.Union( - configuration=OrderedDict({ - 'proportion': hyperparams.Bounded[float]( - default=1.0, - lower=0.0, - upper=1.0, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ), - 'absolute': hyperparams.Bounded[int]( - default=1, - lower=0, - upper=None, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ) - }), - default='proportion', - description='When building the vocabulary ignore terms that have a document frequency strictly higher than the given threshold (corpus-specific stop words). If float, the parameter represents a proportion of documents, integer absolute counts. This parameter is ignored if vocabulary is not None.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - min_df = hyperparams.Union( - configuration=OrderedDict({ - 'proportion': hyperparams.Bounded[float]( - default=1.0, - lower=0.0, - upper=1.0, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ), - 'absolute': hyperparams.Bounded[int]( - default=1, - lower=0, - upper=None, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ) - }), - default='absolute', - description='When building the vocabulary ignore terms that have a document frequency strictly lower than the given threshold. This value is also called cut-off in the literature. If float, the parameter represents a proportion of documents, integer absolute counts. This parameter is ignored if vocabulary is not None.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - max_features = hyperparams.Union( - configuration=OrderedDict({ - 'absolute': hyperparams.Bounded[int]( - default=1, - lower=0, - upper=None, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ), - 'none': hyperparams.Constant( - default=None, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ) - }), - default='none', - description='If not None, build a vocabulary that only consider the top max_features ordered by term frequency across the corpus. This parameter is ignored if vocabulary is not None.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - binary = hyperparams.UniformBool( - default=False, - description='If True, all non zero counts are set to 1. This is useful for discrete probabilistic models that model binary events rather than integer counts.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - - use_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to force primitive to operate on. If any specified column cannot be parsed, it is skipped.", - ) - exclude_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to not operate on. Applicable only if \"use_columns\" is not provided.", - ) - return_result = hyperparams.Enumeration( - values=['append', 'replace', 'new'], - default='new', - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Should parsed columns be appended, should they replace original columns, or should only parsed columns be returned? This hyperparam is ignored if use_semantic_types is set to false.", - ) - use_semantic_types = hyperparams.UniformBool( - default=False, - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Controls whether semantic_types metadata will be used for filtering columns in input dataframe. Setting this to false makes the code ignore return_result and will produce only the output dataframe" - ) - add_index_columns = hyperparams.UniformBool( - default=False, - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Also include primary index columns if input data has them. Applicable only if \"return_result\" is set to \"new\".", - ) - error_on_no_input = hyperparams.UniformBool( - default=True, - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Throw an exception if no input column is selected/provided. Defaults to true to behave like sklearn. To prevent pipelines from breaking set this to False.", - ) - - - -class SKCountVectorizer(UnsupervisedLearnerPrimitiveBase[Inputs, Outputs, Params, Hyperparams]): - """ - Primitive wrapping for sklearn CountVectorizer - `sklearn documentation `_ - - """ - - __author__ = "JPL MARVIN" - metadata = metadata_base.PrimitiveMetadata({ - "algorithm_types": [metadata_base.PrimitiveAlgorithmType.MINIMUM_REDUNDANCY_FEATURE_SELECTION, ], - "name": "sklearn.feature_extraction.text.CountVectorizer", - "primitive_family": metadata_base.PrimitiveFamily.DATA_PREPROCESSING, - "python_path": "d3m.primitives.data_preprocessing.count_vectorizer.SKlearn", - "source": {'name': 'JPL', 'contact': 'mailto:shah@jpl.nasa.gov', 'uris': ['https://gitlab.com/datadrivendiscovery/sklearn-wrap/issues', 'https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.CountVectorizer.html']}, - "version": "2019.11.13", - "id": "0609859b-8ed9-397f-ac7a-7c4f63863560", - "hyperparams_to_tune": ['max_df', 'min_df'], - 'installation': [ - {'type': metadata_base.PrimitiveInstallationType.PIP, - 'package_uri': 'git+https://gitlab.com/datadrivendiscovery/sklearn-wrap.git@{git_commit}#egg=sklearn_wrap'.format( - git_commit=utils.current_git_commit(os.path.dirname(__file__)), - ), - }] - }) - - def __init__(self, *, - hyperparams: Hyperparams, - random_seed: int = 0, - docker_containers: Dict[str, DockerContainer] = None) -> None: - - super().__init__(hyperparams=hyperparams, random_seed=random_seed, docker_containers=docker_containers) - - # True - - self._clf = list() - - self._training_inputs = None - self._target_names = None - self._training_indices = None - self._fitted = False - - - def set_training_data(self, *, inputs: Inputs) -> None: - self._inputs = inputs - self._fitted = False - - def fit(self, *, timeout: float = None, iterations: int = None) -> CallResult[None]: - if self._fitted: - return CallResult(None) - - self._training_inputs, self._training_indices = self._get_columns_to_fit(self._inputs, self.hyperparams) - - if self._training_inputs is None: - raise ValueError("Missing training data.") - - if len(self._training_indices) > 0: - for column_index in range(len(self._training_inputs.columns)): - clf = self._create_new_sklearn_estimator() - clf.fit(self._training_inputs.iloc[:, column_index]) - self._clf.append(clf) - - self._fitted = True - else: - if self.hyperparams['error_on_no_input']: - raise RuntimeError("No input columns were selected") - self.logger.warn("No input columns were selected") - - return CallResult(None) - - def produce(self, *, inputs: Inputs, timeout: float = None, iterations: int = None) -> CallResult[Outputs]: - if not self._fitted: - raise PrimitiveNotFittedError("Primitive not fitted.") - sk_inputs = inputs - if self.hyperparams['use_semantic_types']: - sk_inputs, training_indices = self._get_columns_to_fit(inputs, self.hyperparams) - else: - training_indices = list(range(len(inputs))) - - # Iterating over all estimators and call transform on them. - # No. of estimators should be equal to the number of columns in the input - if len(self._clf) != len(sk_inputs.columns): - raise RuntimeError("Input data does not have the same number of columns as training data") - outputs = [] - if len(self._training_indices) > 0: - for column_index in range(len(sk_inputs.columns)): - clf = self._clf[column_index] - output = clf.transform(sk_inputs.iloc[:, column_index]) - column_name = sk_inputs.columns[column_index] - - if sparse.issparse(output): - output = output.toarray() - output = self._wrap_predictions(inputs, output) - - # Updating column names. - output.columns = map(lambda x: "{}_{}".format(column_name, x), clf.get_feature_names()) - for i, name in enumerate(clf.get_feature_names()): - output.metadata = output.metadata.update((ALL_ELEMENTS, i), {'name': name}) - - outputs.append(output) - else: - if self.hyperparams['error_on_no_input']: - raise RuntimeError("No input columns were selected") - self.logger.warn("No input columns were selected") - - outputs = base_utils.combine_columns(return_result=self.hyperparams['return_result'], - add_index_columns=self.hyperparams['add_index_columns'], - inputs=inputs, column_indices=self._training_indices, - columns_list=outputs) - - return CallResult(outputs) - - - def get_params(self) -> Params: - if not self._fitted: - return Params( - vocabulary_=None, - stop_words_=None, - fixed_vocabulary_=None, - _stop_words_id=None, - training_indices_=self._training_indices, - target_names_=self._target_names - ) - - return Params( - vocabulary_=list(map(lambda clf: getattr(clf, 'vocabulary_', None), self._clf)), - stop_words_=list(map(lambda clf: getattr(clf, 'stop_words_', None), self._clf)), - fixed_vocabulary_=list(map(lambda clf: getattr(clf, 'fixed_vocabulary_', None), self._clf)), - _stop_words_id=list(map(lambda clf: getattr(clf, '_stop_words_id', None), self._clf)), - training_indices_=self._training_indices, - target_names_=self._target_names - ) - - def set_params(self, *, params: Params) -> None: - for param, val in params.items(): - if val is not None and param not in ['target_names_', 'training_indices_']: - self._clf = list(map(lambda x: self._create_new_sklearn_estimator(), val)) - break - for index in range(len(self._clf)): - for param, val in params.items(): - if val is not None: - setattr(self._clf[index], param, val[index]) - else: - setattr(self._clf[index], param, None) - self._training_indices = params['training_indices_'] - self._target_names = params['target_names_'] - self._fitted = False - - if params['vocabulary_'] is not None: - self._fitted = True - if params['stop_words_'] is not None: - self._fitted = True - if params['fixed_vocabulary_'] is not None: - self._fitted = True - if params['_stop_words_id'] is not None: - self._fitted = True - - def _create_new_sklearn_estimator(self): - clf = CountVectorizer( - strip_accents=self.hyperparams['strip_accents'], - analyzer=self.hyperparams['analyzer'], - ngram_range=self.hyperparams['ngram_range'], - stop_words=self.hyperparams['stop_words'], - lowercase=self.hyperparams['lowercase'], - token_pattern=self.hyperparams['token_pattern'], - max_df=self.hyperparams['max_df'], - min_df=self.hyperparams['min_df'], - max_features=self.hyperparams['max_features'], - binary=self.hyperparams['binary'], - ) - return clf - - - - - - - @classmethod - def _get_columns_to_fit(cls, inputs: Inputs, hyperparams: Hyperparams): - if not hyperparams['use_semantic_types']: - return inputs, list(range(len(inputs.columns))) - - inputs_metadata = inputs.metadata - - def can_produce_column(column_index: int) -> bool: - return cls._can_produce_column(inputs_metadata, column_index, hyperparams) - - columns_to_produce, columns_not_to_produce = base_utils.get_columns_to_use(inputs_metadata, - use_columns=hyperparams['use_columns'], - exclude_columns=hyperparams['exclude_columns'], - can_use_column=can_produce_column) - return inputs.iloc[:, columns_to_produce], columns_to_produce - # return columns_to_produce - - @classmethod - def _can_produce_column(cls, inputs_metadata: metadata_base.DataMetadata, column_index: int, hyperparams: Hyperparams) -> bool: - column_metadata = inputs_metadata.query((metadata_base.ALL_ELEMENTS, column_index)) - - accepted_structural_types = (str,) - accepted_semantic_types = set(["http://schema.org/Text",]) - accepted_semantic_types.add("https://metadata.datadrivendiscovery.org/types/Attribute") - if not issubclass(column_metadata['structural_type'], accepted_structural_types): - return False - - semantic_types = set(column_metadata.get('semantic_types', [])) - - if len(semantic_types) == 0: - cls.logger.warning("No semantic types found in column metadata") - return False - - # Making sure all accepted_semantic_types are available in semantic_types - if len(accepted_semantic_types - semantic_types) == 0: - return True - - return False - - @classmethod - def _get_targets(cls, data: d3m_dataframe, hyperparams: Hyperparams): - if not hyperparams['use_semantic_types']: - return data, list(data.columns), [] - target_names = [] - target_semantic_type = [] - target_column_indices = [] - metadata = data.metadata - target_column_indices.extend(metadata.get_columns_with_semantic_type('https://metadata.datadrivendiscovery.org/types/TrueTarget')) - - for column_index in target_column_indices: - if column_index is metadata_base.ALL_ELEMENTS: - continue - column_index = typing.cast(metadata_base.SimpleSelectorSegment, column_index) - column_metadata = metadata.query((metadata_base.ALL_ELEMENTS, column_index)) - target_names.append(column_metadata.get('name', str(column_index))) - target_semantic_type.append(column_metadata.get('semantic_types', [])) - - targets = data.iloc[:, target_column_indices] - return targets, target_names, target_semantic_type - - @classmethod - def _get_target_columns_metadata(cls, outputs_metadata: metadata_base.DataMetadata, hyperparams) -> List[OrderedDict]: - outputs_length = outputs_metadata.query((metadata_base.ALL_ELEMENTS,))['dimension']['length'] - - target_columns_metadata: List[OrderedDict] = [] - for column_index in range(outputs_length): - column_metadata = OrderedDict(outputs_metadata.query_column(column_index)) - - # Update semantic types and prepare it for predicted targets. - semantic_types = set(column_metadata.get('semantic_types', [])) - semantic_types_to_remove = set([]) - add_semantic_types = [] - add_semantic_types.add(hyperparams["return_semantic_type"]) - semantic_types = semantic_types - semantic_types_to_remove - semantic_types = semantic_types.union(add_semantic_types) - column_metadata['semantic_types'] = list(semantic_types) - - target_columns_metadata.append(column_metadata) - - return target_columns_metadata - - @classmethod - def _update_predictions_metadata(cls, inputs_metadata: metadata_base.DataMetadata, outputs: Optional[Outputs], - target_columns_metadata: List[OrderedDict]) -> metadata_base.DataMetadata: - outputs_metadata = metadata_base.DataMetadata().generate(value=outputs) - - for column_index, column_metadata in enumerate(target_columns_metadata): - column_metadata.pop("structural_type", None) - outputs_metadata = outputs_metadata.update_column(column_index, column_metadata) - - return outputs_metadata - - def _wrap_predictions(self, inputs: Inputs, predictions: ndarray) -> Outputs: - outputs = d3m_dataframe(predictions, generate_metadata=True) - target_columns_metadata = self._add_target_columns_metadata(outputs.metadata) - outputs.metadata = self._update_predictions_metadata(inputs.metadata, outputs, target_columns_metadata) - return outputs - - - @classmethod - def _add_target_columns_metadata(cls, outputs_metadata: metadata_base.DataMetadata): - outputs_length = outputs_metadata.query((metadata_base.ALL_ELEMENTS,))['dimension']['length'] - - target_columns_metadata: List[OrderedDict] = [] - for column_index in range(outputs_length): - column_metadata = OrderedDict() - semantic_types = [] - semantic_types.append('https://metadata.datadrivendiscovery.org/types/Attribute') - column_name = outputs_metadata.query((metadata_base.ALL_ELEMENTS, column_index)).get("name") - if column_name is None: - column_name = "output_{}".format(column_index) - column_metadata["semantic_types"] = semantic_types - column_metadata["name"] = str(column_name) - target_columns_metadata.append(column_metadata) - - return target_columns_metadata - - -SKCountVectorizer.__doc__ = CountVectorizer.__doc__ \ No newline at end of file diff --git a/common-primitives/sklearn-wrap/sklearn_wrap/SKDecisionTreeClassifier.py b/common-primitives/sklearn-wrap/sklearn_wrap/SKDecisionTreeClassifier.py deleted file mode 100644 index 46d060a..0000000 --- a/common-primitives/sklearn-wrap/sklearn_wrap/SKDecisionTreeClassifier.py +++ /dev/null @@ -1,621 +0,0 @@ -from typing import Any, Callable, List, Dict, Union, Optional, Sequence, Tuple -from numpy import ndarray -from collections import OrderedDict -from scipy import sparse -import os -import sklearn -import numpy -import typing - -# Custom import commands if any -from sklearn.tree.tree import DecisionTreeClassifier -import numpy - - -from d3m.container.numpy import ndarray as d3m_ndarray -from d3m.container import DataFrame as d3m_dataframe -from d3m.metadata import hyperparams, params, base as metadata_base -from d3m import utils -from d3m.base import utils as base_utils -from d3m.exceptions import PrimitiveNotFittedError -from d3m.primitive_interfaces.base import CallResult, DockerContainer - -from d3m.primitive_interfaces.supervised_learning import SupervisedLearnerPrimitiveBase -from d3m.primitive_interfaces.base import ProbabilisticCompositionalityMixin, ContinueFitMixin -from d3m import exceptions -import pandas - - - -Inputs = d3m_dataframe -Outputs = d3m_dataframe - - -class Params(params.Params): - classes_: Optional[Union[ndarray, List[ndarray]]] - max_features_: Optional[int] - n_classes_: Optional[Union[numpy.int64, List[numpy.int64]]] - n_features_: Optional[int] - n_outputs_: Optional[int] - tree_: Optional[object] - input_column_names: Optional[Any] - target_names_: Optional[Sequence[Any]] - training_indices_: Optional[Sequence[int]] - target_column_indices_: Optional[Sequence[int]] - target_columns_metadata_: Optional[List[OrderedDict]] - - - -class Hyperparams(hyperparams.Hyperparams): - criterion = hyperparams.Enumeration[str]( - values=['gini', 'entropy'], - default='gini', - description='The function to measure the quality of a split. Supported criteria are "gini" for the Gini impurity and "entropy" for the information gain.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - splitter = hyperparams.Enumeration[str]( - values=['best', 'random'], - default='best', - description='The strategy used to choose the split at each node. Supported strategies are "best" to choose the best split and "random" to choose the best random split.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - max_depth = hyperparams.Union( - configuration=OrderedDict({ - 'int': hyperparams.Bounded[int]( - default=10, - lower=0, - upper=None, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ), - 'none': hyperparams.Constant( - default=None, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ) - }), - default='none', - description='The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - min_samples_split = hyperparams.Union( - configuration=OrderedDict({ - 'absolute': hyperparams.Bounded[int]( - default=2, - lower=1, - upper=None, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ), - 'percent': hyperparams.Bounded[float]( - default=0.25, - lower=0, - upper=1, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ) - }), - default='absolute', - description='The minimum number of samples required to split an internal node: - If int, then consider `min_samples_split` as the minimum number. - If float, then `min_samples_split` is a percentage and `ceil(min_samples_split * n_samples)` are the minimum number of samples for each split. .. versionchanged:: 0.18 Added float values for percentages.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - min_samples_leaf = hyperparams.Union( - configuration=OrderedDict({ - 'absolute': hyperparams.Bounded[int]( - default=1, - lower=1, - upper=None, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ), - 'percent': hyperparams.Bounded[float]( - default=0.25, - lower=0, - upper=0.5, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ) - }), - default='absolute', - description='The minimum number of samples required to be at a leaf node: - If int, then consider `min_samples_leaf` as the minimum number. - If float, then `min_samples_leaf` is a percentage and `ceil(min_samples_leaf * n_samples)` are the minimum number of samples for each node. .. versionchanged:: 0.18 Added float values for percentages.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - min_weight_fraction_leaf = hyperparams.Bounded[float]( - default=0, - lower=0, - upper=0.5, - description='The minimum weighted fraction of the sum total of weights (of all the input samples) required to be at a leaf node. Samples have equal weight when sample_weight is not provided.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - max_leaf_nodes = hyperparams.Union( - configuration=OrderedDict({ - 'int': hyperparams.Bounded[int]( - lower=0, - upper=None, - default=0, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ), - 'none': hyperparams.Constant( - default=None, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ) - }), - default='none', - description='Grow a tree with ``max_leaf_nodes`` in best-first fashion. Best nodes are defined as relative reduction in impurity. If None then unlimited number of leaf nodes.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - max_features = hyperparams.Union( - configuration=OrderedDict({ - 'specified_int': hyperparams.Bounded[int]( - lower=0, - upper=None, - default=0, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ), - 'calculated': hyperparams.Enumeration[str]( - values=['auto', 'sqrt', 'log2'], - default='auto', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ), - 'none': hyperparams.Constant( - default=None, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ), - 'percent': hyperparams.Bounded[float]( - default=0.25, - lower=0, - upper=1, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ) - }), - default='none', - description='The number of features to consider when looking for the best split: - If int, then consider `max_features` features at each split. - If float, then `max_features` is a percentage and `int(max_features * n_features)` features are considered at each split. - If "auto", then `max_features=sqrt(n_features)`. - If "sqrt", then `max_features=sqrt(n_features)`. - If "log2", then `max_features=log2(n_features)`. - If None, then `max_features=n_features`. Note: the search for a split does not stop until at least one valid partition of the node samples is found, even if it requires to effectively inspect more than ``max_features`` features.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - min_impurity_decrease = hyperparams.Bounded[float]( - default=0.0, - lower=0.0, - upper=None, - description='A node will be split if this split induces a decrease of the impurity greater than or equal to this value. The weighted impurity decrease equation is the following:: N_t / N * (impurity - N_t_R / N_t * right_impurity - N_t_L / N_t * left_impurity) where ``N`` is the total number of samples, ``N_t`` is the number of samples at the current node, ``N_t_L`` is the number of samples in the left child, and ``N_t_R`` is the number of samples in the right child. ``N``, ``N_t``, ``N_t_R`` and ``N_t_L`` all refer to the weighted sum, if ``sample_weight`` is passed. .. versionadded:: 0.19 ', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - class_weight = hyperparams.Union( - configuration=OrderedDict({ - 'str': hyperparams.Constant( - default='balanced', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ), - 'none': hyperparams.Constant( - default=None, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ) - }), - default='none', - description='Weights associated with classes in the form ``{class_label: weight}``. If not given, all classes are supposed to have weight one. For multi-output problems, a list of dicts can be provided in the same order as the columns of y. The "balanced" mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data as ``n_samples / (n_classes * np.bincount(y))`` For multi-output, the weights of each column of y will be multiplied. Note that these weights will be multiplied with sample_weight (passed through the fit method) if sample_weight is specified.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - presort = hyperparams.UniformBool( - default=False, - description='Whether to presort the data to speed up the finding of best splits in fitting. For the default settings of a decision tree on large datasets, setting this to true may slow down the training process. When using either a smaller dataset or a restricted depth, this may speed up the training.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'] - ) - - use_inputs_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to force primitive to use as training input. If any specified column cannot be parsed, it is skipped.", - ) - use_outputs_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to force primitive to use as training target. If any specified column cannot be parsed, it is skipped.", - ) - exclude_inputs_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to not use as training inputs. Applicable only if \"use_columns\" is not provided.", - ) - exclude_outputs_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to not use as training target. Applicable only if \"use_columns\" is not provided.", - ) - return_result = hyperparams.Enumeration( - values=['append', 'replace', 'new'], - default='new', - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Should parsed columns be appended, should they replace original columns, or should only parsed columns be returned? This hyperparam is ignored if use_semantic_types is set to false.", - ) - use_semantic_types = hyperparams.UniformBool( - default=False, - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Controls whether semantic_types metadata will be used for filtering columns in input dataframe. Setting this to false makes the code ignore return_result and will produce only the output dataframe" - ) - add_index_columns = hyperparams.UniformBool( - default=False, - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Also include primary index columns if input data has them. Applicable only if \"return_result\" is set to \"new\".", - ) - error_on_no_input = hyperparams.UniformBool( - default=True, - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Throw an exception if no input column is selected/provided. Defaults to true to behave like sklearn. To prevent pipelines from breaking set this to False.", - ) - - return_semantic_type = hyperparams.Enumeration[str]( - values=['https://metadata.datadrivendiscovery.org/types/Attribute', 'https://metadata.datadrivendiscovery.org/types/ConstructedAttribute', 'https://metadata.datadrivendiscovery.org/types/PredictedTarget'], - default='https://metadata.datadrivendiscovery.org/types/PredictedTarget', - description='Decides what semantic type to attach to generated output', - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'] - ) - -class SKDecisionTreeClassifier(SupervisedLearnerPrimitiveBase[Inputs, Outputs, Params, Hyperparams], - ProbabilisticCompositionalityMixin[Inputs, Outputs, Params, Hyperparams]): - """ - Primitive wrapping for sklearn DecisionTreeClassifier - `sklearn documentation `_ - - """ - - __author__ = "JPL MARVIN" - metadata = metadata_base.PrimitiveMetadata({ - "algorithm_types": [metadata_base.PrimitiveAlgorithmType.DECISION_TREE, ], - "name": "sklearn.tree.tree.DecisionTreeClassifier", - "primitive_family": metadata_base.PrimitiveFamily.CLASSIFICATION, - "python_path": "d3m.primitives.classification.decision_tree.SKlearn", - "source": {'name': 'JPL', 'contact': 'mailto:shah@jpl.nasa.gov', 'uris': ['https://gitlab.com/datadrivendiscovery/sklearn-wrap/issues', 'https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html']}, - "version": "2019.11.13", - "id": "e20d003d-6a9f-35b0-b4b5-20e42b30282a", - "hyperparams_to_tune": ['max_depth', 'min_samples_split', 'min_samples_leaf', 'max_features'], - 'installation': [ - {'type': metadata_base.PrimitiveInstallationType.PIP, - 'package_uri': 'git+https://gitlab.com/datadrivendiscovery/sklearn-wrap.git@{git_commit}#egg=sklearn_wrap'.format( - git_commit=utils.current_git_commit(os.path.dirname(__file__)), - ), - }] - }) - - def __init__(self, *, - hyperparams: Hyperparams, - random_seed: int = 0, - docker_containers: Dict[str, DockerContainer] = None) -> None: - - super().__init__(hyperparams=hyperparams, random_seed=random_seed, docker_containers=docker_containers) - - # False - self._clf = DecisionTreeClassifier( - criterion=self.hyperparams['criterion'], - splitter=self.hyperparams['splitter'], - max_depth=self.hyperparams['max_depth'], - min_samples_split=self.hyperparams['min_samples_split'], - min_samples_leaf=self.hyperparams['min_samples_leaf'], - min_weight_fraction_leaf=self.hyperparams['min_weight_fraction_leaf'], - max_leaf_nodes=self.hyperparams['max_leaf_nodes'], - max_features=self.hyperparams['max_features'], - min_impurity_decrease=self.hyperparams['min_impurity_decrease'], - class_weight=self.hyperparams['class_weight'], - presort=self.hyperparams['presort'], - random_state=self.random_seed, - ) - - self._inputs = None - self._outputs = None - self._training_inputs = None - self._training_outputs = None - self._target_names = None - self._training_indices = None - self._target_column_indices = None - self._target_columns_metadata: List[OrderedDict] = None - self._input_column_names = None - self._fitted = False - self._new_training_data = False - - def set_training_data(self, *, inputs: Inputs, outputs: Outputs) -> None: - self._inputs = inputs - self._outputs = outputs - self._fitted = False - self._new_training_data = True - - def fit(self, *, timeout: float = None, iterations: int = None) -> CallResult[None]: - if self._inputs is None or self._outputs is None: - raise ValueError("Missing training data.") - - if not self._new_training_data: - return CallResult(None) - self._new_training_data = False - - self._training_inputs, self._training_indices = self._get_columns_to_fit(self._inputs, self.hyperparams) - self._training_outputs, self._target_names, self._target_column_indices = self._get_targets(self._outputs, self.hyperparams) - self._input_column_names = self._training_inputs.columns - - if len(self._training_indices) > 0 and len(self._target_column_indices) > 0: - self._target_columns_metadata = self._get_target_columns_metadata(self._training_outputs.metadata, self.hyperparams) - sk_training_output = self._training_outputs.values - - shape = sk_training_output.shape - if len(shape) == 2 and shape[1] == 1: - sk_training_output = numpy.ravel(sk_training_output) - - self._clf.fit(self._training_inputs, sk_training_output) - self._fitted = True - else: - if self.hyperparams['error_on_no_input']: - raise RuntimeError("No input columns were selected") - self.logger.warn("No input columns were selected") - - return CallResult(None) - - - - def produce(self, *, inputs: Inputs, timeout: float = None, iterations: int = None) -> CallResult[Outputs]: - sk_inputs, columns_to_use = self._get_columns_to_fit(inputs, self.hyperparams) - output = [] - if len(sk_inputs.columns): - try: - sk_output = self._clf.predict(sk_inputs) - except sklearn.exceptions.NotFittedError as error: - raise PrimitiveNotFittedError("Primitive not fitted.") from error - # For primitives that allow predicting without fitting like GaussianProcessRegressor - if not self._fitted: - raise PrimitiveNotFittedError("Primitive not fitted.") - if sparse.issparse(sk_output): - sk_output = sk_output.toarray() - output = self._wrap_predictions(inputs, sk_output) - output.columns = self._target_names - output = [output] - else: - if self.hyperparams['error_on_no_input']: - raise RuntimeError("No input columns were selected") - self.logger.warn("No input columns were selected") - outputs = base_utils.combine_columns(return_result=self.hyperparams['return_result'], - add_index_columns=self.hyperparams['add_index_columns'], - inputs=inputs, column_indices=self._target_column_indices, - columns_list=output) - - return CallResult(outputs) - - - def get_params(self) -> Params: - if not self._fitted: - return Params( - classes_=None, - max_features_=None, - n_classes_=None, - n_features_=None, - n_outputs_=None, - tree_=None, - input_column_names=self._input_column_names, - training_indices_=self._training_indices, - target_names_=self._target_names, - target_column_indices_=self._target_column_indices, - target_columns_metadata_=self._target_columns_metadata - ) - - return Params( - classes_=getattr(self._clf, 'classes_', None), - max_features_=getattr(self._clf, 'max_features_', None), - n_classes_=getattr(self._clf, 'n_classes_', None), - n_features_=getattr(self._clf, 'n_features_', None), - n_outputs_=getattr(self._clf, 'n_outputs_', None), - tree_=getattr(self._clf, 'tree_', None), - input_column_names=self._input_column_names, - training_indices_=self._training_indices, - target_names_=self._target_names, - target_column_indices_=self._target_column_indices, - target_columns_metadata_=self._target_columns_metadata - ) - - def set_params(self, *, params: Params) -> None: - self._clf.classes_ = params['classes_'] - self._clf.max_features_ = params['max_features_'] - self._clf.n_classes_ = params['n_classes_'] - self._clf.n_features_ = params['n_features_'] - self._clf.n_outputs_ = params['n_outputs_'] - self._clf.tree_ = params['tree_'] - self._input_column_names = params['input_column_names'] - self._training_indices = params['training_indices_'] - self._target_names = params['target_names_'] - self._target_column_indices = params['target_column_indices_'] - self._target_columns_metadata = params['target_columns_metadata_'] - - if params['classes_'] is not None: - self._fitted = True - if params['max_features_'] is not None: - self._fitted = True - if params['n_classes_'] is not None: - self._fitted = True - if params['n_features_'] is not None: - self._fitted = True - if params['n_outputs_'] is not None: - self._fitted = True - if params['tree_'] is not None: - self._fitted = True - - - def log_likelihoods(self, *, - outputs: Outputs, - inputs: Inputs, - timeout: float = None, - iterations: int = None) -> CallResult[Sequence[float]]: - inputs = inputs.iloc[:, self._training_indices] # Get ndarray - outputs = outputs.iloc[:, self._target_column_indices] - - if len(inputs.columns) and len(outputs.columns): - - if outputs.shape[1] != self._clf.n_outputs_: - raise exceptions.InvalidArgumentValueError("\"outputs\" argument does not have the correct number of target columns.") - - log_proba = self._clf.predict_log_proba(inputs) - - # Making it always a list, even when only one target. - if self._clf.n_outputs_ == 1: - log_proba = [log_proba] - classes = [self._clf.classes_] - else: - classes = self._clf.classes_ - - samples_length = inputs.shape[0] - - log_likelihoods = [] - for k in range(self._clf.n_outputs_): - # We have to map each class to its internal (numerical) index used in the learner. - # This allows "outputs" to contain string classes. - outputs_column = outputs.iloc[:, k] - classes_map = pandas.Series(numpy.arange(len(classes[k])), index=classes[k]) - mapped_outputs_column = outputs_column.map(classes_map) - - # For each target column (column in "outputs"), for each sample (row) we pick the log - # likelihood for a given class. - log_likelihoods.append(log_proba[k][numpy.arange(samples_length), mapped_outputs_column]) - - results = d3m_dataframe(dict(enumerate(log_likelihoods)), generate_metadata=True) - results.columns = outputs.columns - - for k in range(self._clf.n_outputs_): - column_metadata = outputs.metadata.query_column(k) - if 'name' in column_metadata: - results.metadata = results.metadata.update_column(k, {'name': column_metadata['name']}) - - else: - results = d3m_dataframe(generate_metadata=True) - - return CallResult(results) - - - - def produce_feature_importances(self, *, timeout: float = None, iterations: int = None) -> CallResult[d3m_dataframe]: - output = d3m_dataframe(self._clf.feature_importances_.reshape((1, len(self._input_column_names)))) - output.columns = self._input_column_names - for i in range(len(self._input_column_names)): - output.metadata = output.metadata.update_column(i, {"name": self._input_column_names[i]}) - return CallResult(output) - - @classmethod - def _get_columns_to_fit(cls, inputs: Inputs, hyperparams: Hyperparams): - if not hyperparams['use_semantic_types']: - return inputs, list(range(len(inputs.columns))) - - inputs_metadata = inputs.metadata - - def can_produce_column(column_index: int) -> bool: - return cls._can_produce_column(inputs_metadata, column_index, hyperparams) - - columns_to_produce, columns_not_to_produce = base_utils.get_columns_to_use(inputs_metadata, - use_columns=hyperparams['use_inputs_columns'], - exclude_columns=hyperparams['exclude_inputs_columns'], - can_use_column=can_produce_column) - return inputs.iloc[:, columns_to_produce], columns_to_produce - # return columns_to_produce - - @classmethod - def _can_produce_column(cls, inputs_metadata: metadata_base.DataMetadata, column_index: int, hyperparams: Hyperparams) -> bool: - column_metadata = inputs_metadata.query((metadata_base.ALL_ELEMENTS, column_index)) - - accepted_structural_types = (int, float, numpy.integer, numpy.float64) - accepted_semantic_types = set() - accepted_semantic_types.add("https://metadata.datadrivendiscovery.org/types/Attribute") - if not issubclass(column_metadata['structural_type'], accepted_structural_types): - return False - - semantic_types = set(column_metadata.get('semantic_types', [])) - - if len(semantic_types) == 0: - cls.logger.warning("No semantic types found in column metadata") - return False - # Making sure all accepted_semantic_types are available in semantic_types - if len(accepted_semantic_types - semantic_types) == 0: - return True - - return False - - @classmethod - def _get_targets(cls, data: d3m_dataframe, hyperparams: Hyperparams): - if not hyperparams['use_semantic_types']: - return data, list(data.columns), list(range(len(data.columns))) - - metadata = data.metadata - - def can_produce_column(column_index: int) -> bool: - accepted_semantic_types = set() - accepted_semantic_types.add("https://metadata.datadrivendiscovery.org/types/TrueTarget") - column_metadata = metadata.query((metadata_base.ALL_ELEMENTS, column_index)) - semantic_types = set(column_metadata.get('semantic_types', [])) - if len(semantic_types) == 0: - cls.logger.warning("No semantic types found in column metadata") - return False - # Making sure all accepted_semantic_types are available in semantic_types - if len(accepted_semantic_types - semantic_types) == 0: - return True - return False - - target_column_indices, target_columns_not_to_produce = base_utils.get_columns_to_use(metadata, - use_columns=hyperparams[ - 'use_outputs_columns'], - exclude_columns= - hyperparams[ - 'exclude_outputs_columns'], - can_use_column=can_produce_column) - targets = [] - if target_column_indices: - targets = data.select_columns(target_column_indices) - target_column_names = [] - for idx in target_column_indices: - target_column_names.append(data.columns[idx]) - return targets, target_column_names, target_column_indices - - @classmethod - def _get_target_columns_metadata(cls, outputs_metadata: metadata_base.DataMetadata, hyperparams) -> List[OrderedDict]: - outputs_length = outputs_metadata.query((metadata_base.ALL_ELEMENTS,))['dimension']['length'] - - target_columns_metadata: List[OrderedDict] = [] - for column_index in range(outputs_length): - column_metadata = OrderedDict(outputs_metadata.query_column(column_index)) - - # Update semantic types and prepare it for predicted targets. - semantic_types = set(column_metadata.get('semantic_types', [])) - semantic_types_to_remove = set(["https://metadata.datadrivendiscovery.org/types/TrueTarget","https://metadata.datadrivendiscovery.org/types/SuggestedTarget",]) - add_semantic_types = set(["https://metadata.datadrivendiscovery.org/types/PredictedTarget",]) - add_semantic_types.add(hyperparams["return_semantic_type"]) - semantic_types = semantic_types - semantic_types_to_remove - semantic_types = semantic_types.union(add_semantic_types) - column_metadata['semantic_types'] = list(semantic_types) - - target_columns_metadata.append(column_metadata) - - return target_columns_metadata - - @classmethod - def _update_predictions_metadata(cls, inputs_metadata: metadata_base.DataMetadata, outputs: Optional[Outputs], - target_columns_metadata: List[OrderedDict]) -> metadata_base.DataMetadata: - outputs_metadata = metadata_base.DataMetadata().generate(value=outputs) - - for column_index, column_metadata in enumerate(target_columns_metadata): - column_metadata.pop("structural_type", None) - outputs_metadata = outputs_metadata.update_column(column_index, column_metadata) - - return outputs_metadata - - def _wrap_predictions(self, inputs: Inputs, predictions: ndarray) -> Outputs: - outputs = d3m_dataframe(predictions, generate_metadata=False) - outputs.metadata = self._update_predictions_metadata(inputs.metadata, outputs, self._target_columns_metadata) - return outputs - - - @classmethod - def _add_target_columns_metadata(cls, outputs_metadata: metadata_base.DataMetadata): - outputs_length = outputs_metadata.query((metadata_base.ALL_ELEMENTS,))['dimension']['length'] - - target_columns_metadata: List[OrderedDict] = [] - for column_index in range(outputs_length): - column_metadata = OrderedDict() - semantic_types = [] - semantic_types.append('https://metadata.datadrivendiscovery.org/types/PredictedTarget') - column_name = outputs_metadata.query((metadata_base.ALL_ELEMENTS, column_index)).get("name") - if column_name is None: - column_name = "output_{}".format(column_index) - column_metadata["semantic_types"] = semantic_types - column_metadata["name"] = str(column_name) - target_columns_metadata.append(column_metadata) - - return target_columns_metadata - - -SKDecisionTreeClassifier.__doc__ = DecisionTreeClassifier.__doc__ \ No newline at end of file diff --git a/common-primitives/sklearn-wrap/sklearn_wrap/SKDecisionTreeRegressor.py b/common-primitives/sklearn-wrap/sklearn_wrap/SKDecisionTreeRegressor.py deleted file mode 100644 index 1886dd3..0000000 --- a/common-primitives/sklearn-wrap/sklearn_wrap/SKDecisionTreeRegressor.py +++ /dev/null @@ -1,565 +0,0 @@ -from typing import Any, Callable, List, Dict, Union, Optional, Sequence, Tuple -from numpy import ndarray -from collections import OrderedDict -from scipy import sparse -import os -import sklearn -import numpy -import typing - -# Custom import commands if any -from sklearn.tree.tree import DecisionTreeRegressor - - -from d3m.container.numpy import ndarray as d3m_ndarray -from d3m.container import DataFrame as d3m_dataframe -from d3m.metadata import hyperparams, params, base as metadata_base -from d3m import utils -from d3m.base import utils as base_utils -from d3m.exceptions import PrimitiveNotFittedError -from d3m.primitive_interfaces.base import CallResult, DockerContainer - -from d3m.primitive_interfaces.supervised_learning import SupervisedLearnerPrimitiveBase -from d3m.primitive_interfaces.base import ProbabilisticCompositionalityMixin, ContinueFitMixin -from d3m import exceptions -import pandas - - - -Inputs = d3m_dataframe -Outputs = d3m_dataframe - - -class Params(params.Params): - max_features_: Optional[int] - n_features_: Optional[int] - n_outputs_: Optional[int] - tree_: Optional[object] - classes_: Optional[Union[ndarray, List[ndarray]]] - n_classes_: Optional[Union[numpy.int64, List[numpy.int64]]] - class_weight: Optional[Union[str, dict, List[dict]]] - input_column_names: Optional[Any] - target_names_: Optional[Sequence[Any]] - training_indices_: Optional[Sequence[int]] - target_column_indices_: Optional[Sequence[int]] - target_columns_metadata_: Optional[List[OrderedDict]] - - - -class Hyperparams(hyperparams.Hyperparams): - criterion = hyperparams.Enumeration[str]( - values=['mse', 'friedman_mse', 'mae'], - default='mse', - description='The function to measure the quality of a split. Supported criteria are "mse" for the mean squared error, which is equal to variance reduction as feature selection criterion, and "mae" for the mean absolute error. .. versionadded:: 0.18 Mean Absolute Error (MAE) criterion.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - splitter = hyperparams.Enumeration[str]( - values=['best', 'random'], - default='best', - description='The strategy used to choose the split at each node. Supported strategies are "best" to choose the best split and "random" to choose the best random split.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - max_depth = hyperparams.Union( - configuration=OrderedDict({ - 'int': hyperparams.Bounded[int]( - lower=0, - upper=None, - default=5, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ), - 'none': hyperparams.Constant( - default=None, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ) - }), - default='none', - description='The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - min_samples_split = hyperparams.Union( - configuration=OrderedDict({ - 'float': hyperparams.Bounded[float]( - lower=0, - upper=1, - default=1.0, - description='It\'s a percentage and `ceil(min_samples_split * n_samples)` is the minimum number of samples for each split.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ), - 'int': hyperparams.Bounded[int]( - lower=0, - upper=None, - default=2, - description='Minimum number.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ) - }), - default='int', - description='The minimum number of samples required to split an internal node: - If int, then consider `min_samples_split` as the minimum number. - If float, then `min_samples_split` is a percentage and `ceil(min_samples_split * n_samples)` are the minimum number of samples for each split. .. versionchanged:: 0.18 Added float values for percentages.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - min_samples_leaf = hyperparams.Union( - configuration=OrderedDict({ - 'percent': hyperparams.Bounded[float]( - lower=0, - upper=0.5, - default=0.25, - description='It\'s a percentage and `ceil(min_samples_leaf * n_samples)` is the minimum number of samples for each node.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ), - 'absolute': hyperparams.Bounded[int]( - lower=1, - upper=None, - default=1, - description='Minimum number.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ) - }), - default='absolute', - description='The minimum number of samples required to be at a leaf node: - If int, then consider `min_samples_leaf` as the minimum number. - If float, then `min_samples_leaf` is a percentage and `ceil(min_samples_leaf * n_samples)` are the minimum number of samples for each node. .. versionchanged:: 0.18 Added float values for percentages.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - min_weight_fraction_leaf = hyperparams.Bounded[float]( - default=0, - lower=0, - upper=0.5, - description='The minimum weighted fraction of the sum total of weights (of all the input samples) required to be at a leaf node. Samples have equal weight when sample_weight is not provided.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - max_leaf_nodes = hyperparams.Union( - configuration=OrderedDict({ - 'int': hyperparams.Bounded[int]( - lower=0, - upper=None, - default=10, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ), - 'none': hyperparams.Constant( - default=None, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ) - }), - default='none', - description='Grow a tree with ``max_leaf_nodes`` in best-first fashion. Best nodes are defined as relative reduction in impurity. If None then unlimited number of leaf nodes.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - max_features = hyperparams.Union( - configuration=OrderedDict({ - 'specified_int': hyperparams.Bounded[int]( - lower=0, - upper=None, - default=0, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ), - 'calculated': hyperparams.Enumeration[str]( - values=['auto', 'sqrt', 'log2'], - default='auto', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ), - 'none': hyperparams.Constant( - default=None, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ), - 'percent': hyperparams.Bounded[float]( - default=0.25, - lower=0, - upper=1, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ) - }), - default='calculated', - description='The number of features to consider when looking for the best split: - If int, then consider `max_features` features at each split. - If float, then `max_features` is a percentage and `int(max_features * n_features)` features are considered at each split. - If "auto", then `max_features=n_features`. - If "sqrt", then `max_features=sqrt(n_features)`. - If "log2", then `max_features=log2(n_features)`. - If None, then `max_features=n_features`. Note: the search for a split does not stop until at least one valid partition of the node samples is found, even if it requires to effectively inspect more than ``max_features`` features.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - min_impurity_decrease = hyperparams.Bounded[float]( - default=0.0, - lower=0.0, - upper=None, - description='A node will be split if this split induces a decrease of the impurity greater than or equal to this value. The weighted impurity decrease equation is the following:: N_t / N * (impurity - N_t_R / N_t * right_impurity - N_t_L / N_t * left_impurity) where ``N`` is the total number of samples, ``N_t`` is the number of samples at the current node, ``N_t_L`` is the number of samples in the left child, and ``N_t_R`` is the number of samples in the right child. ``N``, ``N_t``, ``N_t_R`` and ``N_t_L`` all refer to the weighted sum, if ``sample_weight`` is passed. .. versionadded:: 0.19 ', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - presort = hyperparams.UniformBool( - default=False, - description='Whether to presort the data to speed up the finding of best splits in fitting. For the default settings of a decision tree on large datasets, setting this to true may slow down the training process. When using either a smaller dataset or a restricted depth, this may speed up the training.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'] - ) - - use_inputs_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to force primitive to use as training input. If any specified column cannot be parsed, it is skipped.", - ) - use_outputs_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to force primitive to use as training target. If any specified column cannot be parsed, it is skipped.", - ) - exclude_inputs_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to not use as training inputs. Applicable only if \"use_columns\" is not provided.", - ) - exclude_outputs_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to not use as training target. Applicable only if \"use_columns\" is not provided.", - ) - return_result = hyperparams.Enumeration( - values=['append', 'replace', 'new'], - default='new', - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Should parsed columns be appended, should they replace original columns, or should only parsed columns be returned? This hyperparam is ignored if use_semantic_types is set to false.", - ) - use_semantic_types = hyperparams.UniformBool( - default=False, - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Controls whether semantic_types metadata will be used for filtering columns in input dataframe. Setting this to false makes the code ignore return_result and will produce only the output dataframe" - ) - add_index_columns = hyperparams.UniformBool( - default=False, - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Also include primary index columns if input data has them. Applicable only if \"return_result\" is set to \"new\".", - ) - error_on_no_input = hyperparams.UniformBool( - default=True, - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Throw an exception if no input column is selected/provided. Defaults to true to behave like sklearn. To prevent pipelines from breaking set this to False.", - ) - - return_semantic_type = hyperparams.Enumeration[str]( - values=['https://metadata.datadrivendiscovery.org/types/Attribute', 'https://metadata.datadrivendiscovery.org/types/ConstructedAttribute', 'https://metadata.datadrivendiscovery.org/types/PredictedTarget'], - default='https://metadata.datadrivendiscovery.org/types/PredictedTarget', - description='Decides what semantic type to attach to generated output', - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'] - ) - -class SKDecisionTreeRegressor(SupervisedLearnerPrimitiveBase[Inputs, Outputs, Params, Hyperparams]): - """ - Primitive wrapping for sklearn DecisionTreeRegressor - `sklearn documentation `_ - - """ - - __author__ = "JPL MARVIN" - metadata = metadata_base.PrimitiveMetadata({ - "algorithm_types": [metadata_base.PrimitiveAlgorithmType.DECISION_TREE, ], - "name": "sklearn.tree.tree.DecisionTreeRegressor", - "primitive_family": metadata_base.PrimitiveFamily.REGRESSION, - "python_path": "d3m.primitives.regression.decision_tree.SKlearn", - "source": {'name': 'JPL', 'contact': 'mailto:shah@jpl.nasa.gov', 'uris': ['https://gitlab.com/datadrivendiscovery/sklearn-wrap/issues', 'https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html']}, - "version": "2019.11.13", - "id": "6c420bd8-01d1-321f-9a35-afc4b758a5c6", - "hyperparams_to_tune": ['max_depth', 'min_samples_split', 'min_samples_leaf', 'max_features'], - 'installation': [ - {'type': metadata_base.PrimitiveInstallationType.PIP, - 'package_uri': 'git+https://gitlab.com/datadrivendiscovery/sklearn-wrap.git@{git_commit}#egg=sklearn_wrap'.format( - git_commit=utils.current_git_commit(os.path.dirname(__file__)), - ), - }] - }) - - def __init__(self, *, - hyperparams: Hyperparams, - random_seed: int = 0, - docker_containers: Dict[str, DockerContainer] = None) -> None: - - super().__init__(hyperparams=hyperparams, random_seed=random_seed, docker_containers=docker_containers) - - # False - self._clf = DecisionTreeRegressor( - criterion=self.hyperparams['criterion'], - splitter=self.hyperparams['splitter'], - max_depth=self.hyperparams['max_depth'], - min_samples_split=self.hyperparams['min_samples_split'], - min_samples_leaf=self.hyperparams['min_samples_leaf'], - min_weight_fraction_leaf=self.hyperparams['min_weight_fraction_leaf'], - max_leaf_nodes=self.hyperparams['max_leaf_nodes'], - max_features=self.hyperparams['max_features'], - min_impurity_decrease=self.hyperparams['min_impurity_decrease'], - presort=self.hyperparams['presort'], - random_state=self.random_seed, - ) - - self._inputs = None - self._outputs = None - self._training_inputs = None - self._training_outputs = None - self._target_names = None - self._training_indices = None - self._target_column_indices = None - self._target_columns_metadata: List[OrderedDict] = None - self._input_column_names = None - self._fitted = False - self._new_training_data = False - - def set_training_data(self, *, inputs: Inputs, outputs: Outputs) -> None: - self._inputs = inputs - self._outputs = outputs - self._fitted = False - self._new_training_data = True - - def fit(self, *, timeout: float = None, iterations: int = None) -> CallResult[None]: - if self._inputs is None or self._outputs is None: - raise ValueError("Missing training data.") - - if not self._new_training_data: - return CallResult(None) - self._new_training_data = False - - self._training_inputs, self._training_indices = self._get_columns_to_fit(self._inputs, self.hyperparams) - self._training_outputs, self._target_names, self._target_column_indices = self._get_targets(self._outputs, self.hyperparams) - self._input_column_names = self._training_inputs.columns - - if len(self._training_indices) > 0 and len(self._target_column_indices) > 0: - self._target_columns_metadata = self._get_target_columns_metadata(self._training_outputs.metadata, self.hyperparams) - sk_training_output = self._training_outputs.values - - shape = sk_training_output.shape - if len(shape) == 2 and shape[1] == 1: - sk_training_output = numpy.ravel(sk_training_output) - - self._clf.fit(self._training_inputs, sk_training_output) - self._fitted = True - else: - if self.hyperparams['error_on_no_input']: - raise RuntimeError("No input columns were selected") - self.logger.warn("No input columns were selected") - - return CallResult(None) - - - - def produce(self, *, inputs: Inputs, timeout: float = None, iterations: int = None) -> CallResult[Outputs]: - sk_inputs, columns_to_use = self._get_columns_to_fit(inputs, self.hyperparams) - output = [] - if len(sk_inputs.columns): - try: - sk_output = self._clf.predict(sk_inputs) - except sklearn.exceptions.NotFittedError as error: - raise PrimitiveNotFittedError("Primitive not fitted.") from error - # For primitives that allow predicting without fitting like GaussianProcessRegressor - if not self._fitted: - raise PrimitiveNotFittedError("Primitive not fitted.") - if sparse.issparse(sk_output): - sk_output = sk_output.toarray() - output = self._wrap_predictions(inputs, sk_output) - output.columns = self._target_names - output = [output] - else: - if self.hyperparams['error_on_no_input']: - raise RuntimeError("No input columns were selected") - self.logger.warn("No input columns were selected") - outputs = base_utils.combine_columns(return_result=self.hyperparams['return_result'], - add_index_columns=self.hyperparams['add_index_columns'], - inputs=inputs, column_indices=self._target_column_indices, - columns_list=output) - - return CallResult(outputs) - - - def get_params(self) -> Params: - if not self._fitted: - return Params( - max_features_=None, - n_features_=None, - n_outputs_=None, - tree_=None, - classes_=None, - n_classes_=None, - class_weight=None, - input_column_names=self._input_column_names, - training_indices_=self._training_indices, - target_names_=self._target_names, - target_column_indices_=self._target_column_indices, - target_columns_metadata_=self._target_columns_metadata - ) - - return Params( - max_features_=getattr(self._clf, 'max_features_', None), - n_features_=getattr(self._clf, 'n_features_', None), - n_outputs_=getattr(self._clf, 'n_outputs_', None), - tree_=getattr(self._clf, 'tree_', None), - classes_=getattr(self._clf, 'classes_', None), - n_classes_=getattr(self._clf, 'n_classes_', None), - class_weight=getattr(self._clf, 'class_weight', None), - input_column_names=self._input_column_names, - training_indices_=self._training_indices, - target_names_=self._target_names, - target_column_indices_=self._target_column_indices, - target_columns_metadata_=self._target_columns_metadata - ) - - def set_params(self, *, params: Params) -> None: - self._clf.max_features_ = params['max_features_'] - self._clf.n_features_ = params['n_features_'] - self._clf.n_outputs_ = params['n_outputs_'] - self._clf.tree_ = params['tree_'] - self._clf.classes_ = params['classes_'] - self._clf.n_classes_ = params['n_classes_'] - self._clf.class_weight = params['class_weight'] - self._input_column_names = params['input_column_names'] - self._training_indices = params['training_indices_'] - self._target_names = params['target_names_'] - self._target_column_indices = params['target_column_indices_'] - self._target_columns_metadata = params['target_columns_metadata_'] - - if params['max_features_'] is not None: - self._fitted = True - if params['n_features_'] is not None: - self._fitted = True - if params['n_outputs_'] is not None: - self._fitted = True - if params['tree_'] is not None: - self._fitted = True - if params['classes_'] is not None: - self._fitted = True - if params['n_classes_'] is not None: - self._fitted = True - if params['class_weight'] is not None: - self._fitted = True - - - - - - def produce_feature_importances(self, *, timeout: float = None, iterations: int = None) -> CallResult[d3m_dataframe]: - output = d3m_dataframe(self._clf.feature_importances_.reshape((1, len(self._input_column_names)))) - output.columns = self._input_column_names - for i in range(len(self._input_column_names)): - output.metadata = output.metadata.update_column(i, {"name": self._input_column_names[i]}) - return CallResult(output) - - @classmethod - def _get_columns_to_fit(cls, inputs: Inputs, hyperparams: Hyperparams): - if not hyperparams['use_semantic_types']: - return inputs, list(range(len(inputs.columns))) - - inputs_metadata = inputs.metadata - - def can_produce_column(column_index: int) -> bool: - return cls._can_produce_column(inputs_metadata, column_index, hyperparams) - - columns_to_produce, columns_not_to_produce = base_utils.get_columns_to_use(inputs_metadata, - use_columns=hyperparams['use_inputs_columns'], - exclude_columns=hyperparams['exclude_inputs_columns'], - can_use_column=can_produce_column) - return inputs.iloc[:, columns_to_produce], columns_to_produce - # return columns_to_produce - - @classmethod - def _can_produce_column(cls, inputs_metadata: metadata_base.DataMetadata, column_index: int, hyperparams: Hyperparams) -> bool: - column_metadata = inputs_metadata.query((metadata_base.ALL_ELEMENTS, column_index)) - - accepted_structural_types = (int, float, numpy.integer, numpy.float64) - accepted_semantic_types = set() - accepted_semantic_types.add("https://metadata.datadrivendiscovery.org/types/Attribute") - if not issubclass(column_metadata['structural_type'], accepted_structural_types): - return False - - semantic_types = set(column_metadata.get('semantic_types', [])) - - if len(semantic_types) == 0: - cls.logger.warning("No semantic types found in column metadata") - return False - # Making sure all accepted_semantic_types are available in semantic_types - if len(accepted_semantic_types - semantic_types) == 0: - return True - - return False - - @classmethod - def _get_targets(cls, data: d3m_dataframe, hyperparams: Hyperparams): - if not hyperparams['use_semantic_types']: - return data, list(data.columns), list(range(len(data.columns))) - - metadata = data.metadata - - def can_produce_column(column_index: int) -> bool: - accepted_semantic_types = set() - accepted_semantic_types.add("https://metadata.datadrivendiscovery.org/types/TrueTarget") - column_metadata = metadata.query((metadata_base.ALL_ELEMENTS, column_index)) - semantic_types = set(column_metadata.get('semantic_types', [])) - if len(semantic_types) == 0: - cls.logger.warning("No semantic types found in column metadata") - return False - # Making sure all accepted_semantic_types are available in semantic_types - if len(accepted_semantic_types - semantic_types) == 0: - return True - return False - - target_column_indices, target_columns_not_to_produce = base_utils.get_columns_to_use(metadata, - use_columns=hyperparams[ - 'use_outputs_columns'], - exclude_columns= - hyperparams[ - 'exclude_outputs_columns'], - can_use_column=can_produce_column) - targets = [] - if target_column_indices: - targets = data.select_columns(target_column_indices) - target_column_names = [] - for idx in target_column_indices: - target_column_names.append(data.columns[idx]) - return targets, target_column_names, target_column_indices - - @classmethod - def _get_target_columns_metadata(cls, outputs_metadata: metadata_base.DataMetadata, hyperparams) -> List[OrderedDict]: - outputs_length = outputs_metadata.query((metadata_base.ALL_ELEMENTS,))['dimension']['length'] - - target_columns_metadata: List[OrderedDict] = [] - for column_index in range(outputs_length): - column_metadata = OrderedDict(outputs_metadata.query_column(column_index)) - - # Update semantic types and prepare it for predicted targets. - semantic_types = set(column_metadata.get('semantic_types', [])) - semantic_types_to_remove = set(["https://metadata.datadrivendiscovery.org/types/TrueTarget","https://metadata.datadrivendiscovery.org/types/SuggestedTarget",]) - add_semantic_types = set(["https://metadata.datadrivendiscovery.org/types/PredictedTarget",]) - add_semantic_types.add(hyperparams["return_semantic_type"]) - semantic_types = semantic_types - semantic_types_to_remove - semantic_types = semantic_types.union(add_semantic_types) - column_metadata['semantic_types'] = list(semantic_types) - - target_columns_metadata.append(column_metadata) - - return target_columns_metadata - - @classmethod - def _update_predictions_metadata(cls, inputs_metadata: metadata_base.DataMetadata, outputs: Optional[Outputs], - target_columns_metadata: List[OrderedDict]) -> metadata_base.DataMetadata: - outputs_metadata = metadata_base.DataMetadata().generate(value=outputs) - - for column_index, column_metadata in enumerate(target_columns_metadata): - column_metadata.pop("structural_type", None) - outputs_metadata = outputs_metadata.update_column(column_index, column_metadata) - - return outputs_metadata - - def _wrap_predictions(self, inputs: Inputs, predictions: ndarray) -> Outputs: - outputs = d3m_dataframe(predictions, generate_metadata=False) - outputs.metadata = self._update_predictions_metadata(inputs.metadata, outputs, self._target_columns_metadata) - return outputs - - - @classmethod - def _add_target_columns_metadata(cls, outputs_metadata: metadata_base.DataMetadata): - outputs_length = outputs_metadata.query((metadata_base.ALL_ELEMENTS,))['dimension']['length'] - - target_columns_metadata: List[OrderedDict] = [] - for column_index in range(outputs_length): - column_metadata = OrderedDict() - semantic_types = [] - semantic_types.append('https://metadata.datadrivendiscovery.org/types/PredictedTarget') - column_name = outputs_metadata.query((metadata_base.ALL_ELEMENTS, column_index)).get("name") - if column_name is None: - column_name = "output_{}".format(column_index) - column_metadata["semantic_types"] = semantic_types - column_metadata["name"] = str(column_name) - target_columns_metadata.append(column_metadata) - - return target_columns_metadata - - -SKDecisionTreeRegressor.__doc__ = DecisionTreeRegressor.__doc__ \ No newline at end of file diff --git a/common-primitives/sklearn-wrap/sklearn_wrap/SKDummyClassifier.py b/common-primitives/sklearn-wrap/sklearn_wrap/SKDummyClassifier.py deleted file mode 100644 index 4425428..0000000 --- a/common-primitives/sklearn-wrap/sklearn_wrap/SKDummyClassifier.py +++ /dev/null @@ -1,503 +0,0 @@ -from typing import Any, Callable, List, Dict, Union, Optional, Sequence, Tuple -from numpy import ndarray -from collections import OrderedDict -from scipy import sparse -import os -import sklearn -import numpy -import typing - -# Custom import commands if any -from sklearn.dummy import DummyClassifier - - -from d3m.container.numpy import ndarray as d3m_ndarray -from d3m.container import DataFrame as d3m_dataframe -from d3m.metadata import hyperparams, params, base as metadata_base -from d3m import utils -from d3m.base import utils as base_utils -from d3m.exceptions import PrimitiveNotFittedError -from d3m.primitive_interfaces.base import CallResult, DockerContainer - -from d3m.primitive_interfaces.supervised_learning import SupervisedLearnerPrimitiveBase -from d3m.primitive_interfaces.base import ProbabilisticCompositionalityMixin, ContinueFitMixin -from d3m import exceptions -import pandas - - - -Inputs = d3m_dataframe -Outputs = d3m_dataframe - - -class Params(params.Params): - classes_: Optional[ndarray] - n_classes_: Optional[Union[int,ndarray]] - class_prior_: Optional[ndarray] - n_outputs_: Optional[int] - sparse_output_: Optional[bool] - output_2d_: Optional[bool] - input_column_names: Optional[Any] - target_names_: Optional[Sequence[Any]] - training_indices_: Optional[Sequence[int]] - target_column_indices_: Optional[Sequence[int]] - target_columns_metadata_: Optional[List[OrderedDict]] - - - -class Hyperparams(hyperparams.Hyperparams): - strategy = hyperparams.Choice( - choices={ - 'stratified': hyperparams.Hyperparams.define( - configuration=OrderedDict({}) - ), - 'most_frequent': hyperparams.Hyperparams.define( - configuration=OrderedDict({}) - ), - 'prior': hyperparams.Hyperparams.define( - configuration=OrderedDict({}) - ), - 'uniform': hyperparams.Hyperparams.define( - configuration=OrderedDict({}) - ), - 'constant': hyperparams.Hyperparams.define( - configuration=OrderedDict({ - 'constant': hyperparams.Union( - configuration=OrderedDict({ - 'str': hyperparams.Hyperparameter[str]( - default='one', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ), - 'int': hyperparams.Bounded[int]( - default=1, - lower=0, - upper=None, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ), - 'ndarray': hyperparams.Hyperparameter[ndarray]( - default=numpy.array([]), - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ) - }), - default='int', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - }) - ) - }, - default='stratified', - description='Strategy to use to generate predictions. * "stratified": generates predictions by respecting the training set\'s class distribution. * "most_frequent": always predicts the most frequent label in the training set. * "prior": always predicts the class that maximizes the class prior (like "most_frequent") and ``predict_proba`` returns the class prior. * "uniform": generates predictions uniformly at random. * "constant": always predicts a constant label that is provided by the user. This is useful for metrics that evaluate a non-majority class .. versionadded:: 0.17 Dummy Classifier now supports prior fitting strategy using parameter *prior*.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - - use_inputs_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to force primitive to use as training input. If any specified column cannot be parsed, it is skipped.", - ) - use_outputs_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to force primitive to use as training target. If any specified column cannot be parsed, it is skipped.", - ) - exclude_inputs_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to not use as training inputs. Applicable only if \"use_columns\" is not provided.", - ) - exclude_outputs_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to not use as training target. Applicable only if \"use_columns\" is not provided.", - ) - return_result = hyperparams.Enumeration( - values=['append', 'replace', 'new'], - default='new', - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Should parsed columns be appended, should they replace original columns, or should only parsed columns be returned? This hyperparam is ignored if use_semantic_types is set to false.", - ) - use_semantic_types = hyperparams.UniformBool( - default=False, - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Controls whether semantic_types metadata will be used for filtering columns in input dataframe. Setting this to false makes the code ignore return_result and will produce only the output dataframe" - ) - add_index_columns = hyperparams.UniformBool( - default=False, - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Also include primary index columns if input data has them. Applicable only if \"return_result\" is set to \"new\".", - ) - error_on_no_input = hyperparams.UniformBool( - default=True, - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Throw an exception if no input column is selected/provided. Defaults to true to behave like sklearn. To prevent pipelines from breaking set this to False.", - ) - - return_semantic_type = hyperparams.Enumeration[str]( - values=['https://metadata.datadrivendiscovery.org/types/Attribute', 'https://metadata.datadrivendiscovery.org/types/ConstructedAttribute', 'https://metadata.datadrivendiscovery.org/types/PredictedTarget'], - default='https://metadata.datadrivendiscovery.org/types/PredictedTarget', - description='Decides what semantic type to attach to generated output', - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'] - ) - -class SKDummyClassifier(SupervisedLearnerPrimitiveBase[Inputs, Outputs, Params, Hyperparams], - ProbabilisticCompositionalityMixin[Inputs, Outputs, Params, Hyperparams]): - """ - Primitive wrapping for sklearn DummyClassifier - `sklearn documentation `_ - - """ - - __author__ = "JPL MARVIN" - metadata = metadata_base.PrimitiveMetadata({ - "algorithm_types": [metadata_base.PrimitiveAlgorithmType.RULE_BASED_MACHINE_LEARNING, ], - "name": "sklearn.dummy.DummyClassifier", - "primitive_family": metadata_base.PrimitiveFamily.CLASSIFICATION, - "python_path": "d3m.primitives.classification.dummy.SKlearn", - "source": {'name': 'JPL', 'contact': 'mailto:shah@jpl.nasa.gov', 'uris': ['https://gitlab.com/datadrivendiscovery/sklearn-wrap/issues', 'https://scikit-learn.org/stable/modules/generated/sklearn.dummy.DummyClassifier.html']}, - "version": "2019.11.13", - "id": "a1056ddf-2e89-3d8d-8308-2146170ae54d", - "hyperparams_to_tune": ['strategy'], - 'installation': [ - {'type': metadata_base.PrimitiveInstallationType.PIP, - 'package_uri': 'git+https://gitlab.com/datadrivendiscovery/sklearn-wrap.git@{git_commit}#egg=sklearn_wrap'.format( - git_commit=utils.current_git_commit(os.path.dirname(__file__)), - ), - }] - }) - - def __init__(self, *, - hyperparams: Hyperparams, - random_seed: int = 0, - docker_containers: Dict[str, DockerContainer] = None) -> None: - - super().__init__(hyperparams=hyperparams, random_seed=random_seed, docker_containers=docker_containers) - - # False - self._clf = DummyClassifier( - strategy=self.hyperparams['strategy']['choice'], - constant=self.hyperparams['strategy'].get('constant', 'int'), - random_state=self.random_seed, - ) - - self._inputs = None - self._outputs = None - self._training_inputs = None - self._training_outputs = None - self._target_names = None - self._training_indices = None - self._target_column_indices = None - self._target_columns_metadata: List[OrderedDict] = None - self._input_column_names = None - self._fitted = False - self._new_training_data = False - - def set_training_data(self, *, inputs: Inputs, outputs: Outputs) -> None: - self._inputs = inputs - self._outputs = outputs - self._fitted = False - self._new_training_data = True - - def fit(self, *, timeout: float = None, iterations: int = None) -> CallResult[None]: - if self._inputs is None or self._outputs is None: - raise ValueError("Missing training data.") - - if not self._new_training_data: - return CallResult(None) - self._new_training_data = False - - self._training_inputs, self._training_indices = self._get_columns_to_fit(self._inputs, self.hyperparams) - self._training_outputs, self._target_names, self._target_column_indices = self._get_targets(self._outputs, self.hyperparams) - self._input_column_names = self._training_inputs.columns - - if len(self._training_indices) > 0 and len(self._target_column_indices) > 0: - self._target_columns_metadata = self._get_target_columns_metadata(self._training_outputs.metadata, self.hyperparams) - sk_training_output = self._training_outputs.values - - shape = sk_training_output.shape - if len(shape) == 2 and shape[1] == 1: - sk_training_output = numpy.ravel(sk_training_output) - - self._clf.fit(self._training_inputs, sk_training_output) - self._fitted = True - else: - if self.hyperparams['error_on_no_input']: - raise RuntimeError("No input columns were selected") - self.logger.warn("No input columns were selected") - - return CallResult(None) - - - - def produce(self, *, inputs: Inputs, timeout: float = None, iterations: int = None) -> CallResult[Outputs]: - sk_inputs, columns_to_use = self._get_columns_to_fit(inputs, self.hyperparams) - output = [] - if len(sk_inputs.columns): - try: - sk_output = self._clf.predict(sk_inputs) - except sklearn.exceptions.NotFittedError as error: - raise PrimitiveNotFittedError("Primitive not fitted.") from error - # For primitives that allow predicting without fitting like GaussianProcessRegressor - if not self._fitted: - raise PrimitiveNotFittedError("Primitive not fitted.") - if sparse.issparse(sk_output): - sk_output = sk_output.toarray() - output = self._wrap_predictions(inputs, sk_output) - output.columns = self._target_names - output = [output] - else: - if self.hyperparams['error_on_no_input']: - raise RuntimeError("No input columns were selected") - self.logger.warn("No input columns were selected") - outputs = base_utils.combine_columns(return_result=self.hyperparams['return_result'], - add_index_columns=self.hyperparams['add_index_columns'], - inputs=inputs, column_indices=self._target_column_indices, - columns_list=output) - - return CallResult(outputs) - - - def get_params(self) -> Params: - if not self._fitted: - return Params( - classes_=None, - n_classes_=None, - class_prior_=None, - n_outputs_=None, - sparse_output_=None, - output_2d_=None, - input_column_names=self._input_column_names, - training_indices_=self._training_indices, - target_names_=self._target_names, - target_column_indices_=self._target_column_indices, - target_columns_metadata_=self._target_columns_metadata - ) - - return Params( - classes_=getattr(self._clf, 'classes_', None), - n_classes_=getattr(self._clf, 'n_classes_', None), - class_prior_=getattr(self._clf, 'class_prior_', None), - n_outputs_=getattr(self._clf, 'n_outputs_', None), - sparse_output_=getattr(self._clf, 'sparse_output_', None), - output_2d_=getattr(self._clf, 'output_2d_', None), - input_column_names=self._input_column_names, - training_indices_=self._training_indices, - target_names_=self._target_names, - target_column_indices_=self._target_column_indices, - target_columns_metadata_=self._target_columns_metadata - ) - - def set_params(self, *, params: Params) -> None: - self._clf.classes_ = params['classes_'] - self._clf.n_classes_ = params['n_classes_'] - self._clf.class_prior_ = params['class_prior_'] - self._clf.n_outputs_ = params['n_outputs_'] - self._clf.sparse_output_ = params['sparse_output_'] - self._clf.output_2d_ = params['output_2d_'] - self._input_column_names = params['input_column_names'] - self._training_indices = params['training_indices_'] - self._target_names = params['target_names_'] - self._target_column_indices = params['target_column_indices_'] - self._target_columns_metadata = params['target_columns_metadata_'] - - if params['classes_'] is not None: - self._fitted = True - if params['n_classes_'] is not None: - self._fitted = True - if params['class_prior_'] is not None: - self._fitted = True - if params['n_outputs_'] is not None: - self._fitted = True - if params['sparse_output_'] is not None: - self._fitted = True - if params['output_2d_'] is not None: - self._fitted = True - - - def log_likelihoods(self, *, - outputs: Outputs, - inputs: Inputs, - timeout: float = None, - iterations: int = None) -> CallResult[Sequence[float]]: - inputs = inputs.iloc[:, self._training_indices] # Get ndarray - outputs = outputs.iloc[:, self._target_column_indices] - - if len(inputs.columns) and len(outputs.columns): - - if outputs.shape[1] != self._clf.n_outputs_: - raise exceptions.InvalidArgumentValueError("\"outputs\" argument does not have the correct number of target columns.") - - log_proba = self._clf.predict_log_proba(inputs) - - # Making it always a list, even when only one target. - if self._clf.n_outputs_ == 1: - log_proba = [log_proba] - classes = [self._clf.classes_] - else: - classes = self._clf.classes_ - - samples_length = inputs.shape[0] - - log_likelihoods = [] - for k in range(self._clf.n_outputs_): - # We have to map each class to its internal (numerical) index used in the learner. - # This allows "outputs" to contain string classes. - outputs_column = outputs.iloc[:, k] - classes_map = pandas.Series(numpy.arange(len(classes[k])), index=classes[k]) - mapped_outputs_column = outputs_column.map(classes_map) - - # For each target column (column in "outputs"), for each sample (row) we pick the log - # likelihood for a given class. - log_likelihoods.append(log_proba[k][numpy.arange(samples_length), mapped_outputs_column]) - - results = d3m_dataframe(dict(enumerate(log_likelihoods)), generate_metadata=True) - results.columns = outputs.columns - - for k in range(self._clf.n_outputs_): - column_metadata = outputs.metadata.query_column(k) - if 'name' in column_metadata: - results.metadata = results.metadata.update_column(k, {'name': column_metadata['name']}) - - else: - results = d3m_dataframe(generate_metadata=True) - - return CallResult(results) - - - - - - @classmethod - def _get_columns_to_fit(cls, inputs: Inputs, hyperparams: Hyperparams): - if not hyperparams['use_semantic_types']: - return inputs, list(range(len(inputs.columns))) - - inputs_metadata = inputs.metadata - - def can_produce_column(column_index: int) -> bool: - return cls._can_produce_column(inputs_metadata, column_index, hyperparams) - - columns_to_produce, columns_not_to_produce = base_utils.get_columns_to_use(inputs_metadata, - use_columns=hyperparams['use_inputs_columns'], - exclude_columns=hyperparams['exclude_inputs_columns'], - can_use_column=can_produce_column) - return inputs.iloc[:, columns_to_produce], columns_to_produce - # return columns_to_produce - - @classmethod - def _can_produce_column(cls, inputs_metadata: metadata_base.DataMetadata, column_index: int, hyperparams: Hyperparams) -> bool: - column_metadata = inputs_metadata.query((metadata_base.ALL_ELEMENTS, column_index)) - - accepted_structural_types = (int, float, numpy.integer, numpy.float64) - accepted_semantic_types = set() - accepted_semantic_types.add("https://metadata.datadrivendiscovery.org/types/Attribute") - if not issubclass(column_metadata['structural_type'], accepted_structural_types): - return False - - semantic_types = set(column_metadata.get('semantic_types', [])) - - if len(semantic_types) == 0: - cls.logger.warning("No semantic types found in column metadata") - return False - # Making sure all accepted_semantic_types are available in semantic_types - if len(accepted_semantic_types - semantic_types) == 0: - return True - - return False - - @classmethod - def _get_targets(cls, data: d3m_dataframe, hyperparams: Hyperparams): - if not hyperparams['use_semantic_types']: - return data, list(data.columns), list(range(len(data.columns))) - - metadata = data.metadata - - def can_produce_column(column_index: int) -> bool: - accepted_semantic_types = set() - accepted_semantic_types.add("https://metadata.datadrivendiscovery.org/types/TrueTarget") - column_metadata = metadata.query((metadata_base.ALL_ELEMENTS, column_index)) - semantic_types = set(column_metadata.get('semantic_types', [])) - if len(semantic_types) == 0: - cls.logger.warning("No semantic types found in column metadata") - return False - # Making sure all accepted_semantic_types are available in semantic_types - if len(accepted_semantic_types - semantic_types) == 0: - return True - return False - - target_column_indices, target_columns_not_to_produce = base_utils.get_columns_to_use(metadata, - use_columns=hyperparams[ - 'use_outputs_columns'], - exclude_columns= - hyperparams[ - 'exclude_outputs_columns'], - can_use_column=can_produce_column) - targets = [] - if target_column_indices: - targets = data.select_columns(target_column_indices) - target_column_names = [] - for idx in target_column_indices: - target_column_names.append(data.columns[idx]) - return targets, target_column_names, target_column_indices - - @classmethod - def _get_target_columns_metadata(cls, outputs_metadata: metadata_base.DataMetadata, hyperparams) -> List[OrderedDict]: - outputs_length = outputs_metadata.query((metadata_base.ALL_ELEMENTS,))['dimension']['length'] - - target_columns_metadata: List[OrderedDict] = [] - for column_index in range(outputs_length): - column_metadata = OrderedDict(outputs_metadata.query_column(column_index)) - - # Update semantic types and prepare it for predicted targets. - semantic_types = set(column_metadata.get('semantic_types', [])) - semantic_types_to_remove = set(["https://metadata.datadrivendiscovery.org/types/TrueTarget","https://metadata.datadrivendiscovery.org/types/SuggestedTarget",]) - add_semantic_types = set(["https://metadata.datadrivendiscovery.org/types/PredictedTarget",]) - add_semantic_types.add(hyperparams["return_semantic_type"]) - semantic_types = semantic_types - semantic_types_to_remove - semantic_types = semantic_types.union(add_semantic_types) - column_metadata['semantic_types'] = list(semantic_types) - - target_columns_metadata.append(column_metadata) - - return target_columns_metadata - - @classmethod - def _update_predictions_metadata(cls, inputs_metadata: metadata_base.DataMetadata, outputs: Optional[Outputs], - target_columns_metadata: List[OrderedDict]) -> metadata_base.DataMetadata: - outputs_metadata = metadata_base.DataMetadata().generate(value=outputs) - - for column_index, column_metadata in enumerate(target_columns_metadata): - column_metadata.pop("structural_type", None) - outputs_metadata = outputs_metadata.update_column(column_index, column_metadata) - - return outputs_metadata - - def _wrap_predictions(self, inputs: Inputs, predictions: ndarray) -> Outputs: - outputs = d3m_dataframe(predictions, generate_metadata=False) - outputs.metadata = self._update_predictions_metadata(inputs.metadata, outputs, self._target_columns_metadata) - return outputs - - - @classmethod - def _add_target_columns_metadata(cls, outputs_metadata: metadata_base.DataMetadata): - outputs_length = outputs_metadata.query((metadata_base.ALL_ELEMENTS,))['dimension']['length'] - - target_columns_metadata: List[OrderedDict] = [] - for column_index in range(outputs_length): - column_metadata = OrderedDict() - semantic_types = [] - semantic_types.append('https://metadata.datadrivendiscovery.org/types/PredictedTarget') - column_name = outputs_metadata.query((metadata_base.ALL_ELEMENTS, column_index)).get("name") - if column_name is None: - column_name = "output_{}".format(column_index) - column_metadata["semantic_types"] = semantic_types - column_metadata["name"] = str(column_name) - target_columns_metadata.append(column_metadata) - - return target_columns_metadata - - -SKDummyClassifier.__doc__ = DummyClassifier.__doc__ \ No newline at end of file diff --git a/common-primitives/sklearn-wrap/sklearn_wrap/SKDummyRegressor.py b/common-primitives/sklearn-wrap/sklearn_wrap/SKDummyRegressor.py deleted file mode 100644 index 020942d..0000000 --- a/common-primitives/sklearn-wrap/sklearn_wrap/SKDummyRegressor.py +++ /dev/null @@ -1,442 +0,0 @@ -from typing import Any, Callable, List, Dict, Union, Optional, Sequence, Tuple -from numpy import ndarray -from collections import OrderedDict -from scipy import sparse -import os -import sklearn -import numpy -import typing - -# Custom import commands if any -from sklearn.dummy import DummyRegressor - - -from d3m.container.numpy import ndarray as d3m_ndarray -from d3m.container import DataFrame as d3m_dataframe -from d3m.metadata import hyperparams, params, base as metadata_base -from d3m import utils -from d3m.base import utils as base_utils -from d3m.exceptions import PrimitiveNotFittedError -from d3m.primitive_interfaces.base import CallResult, DockerContainer - -from d3m.primitive_interfaces.supervised_learning import SupervisedLearnerPrimitiveBase -from d3m.primitive_interfaces.base import ProbabilisticCompositionalityMixin, ContinueFitMixin -from d3m import exceptions -import pandas - - - -Inputs = d3m_dataframe -Outputs = d3m_dataframe - - -class Params(params.Params): - constant_: Optional[Union[float, ndarray]] - n_outputs_: Optional[int] - output_2d_: Optional[bool] - input_column_names: Optional[Any] - target_names_: Optional[Sequence[Any]] - training_indices_: Optional[Sequence[int]] - target_column_indices_: Optional[Sequence[int]] - target_columns_metadata_: Optional[List[OrderedDict]] - - - -class Hyperparams(hyperparams.Hyperparams): - strategy = hyperparams.Choice( - choices={ - 'mean': hyperparams.Hyperparams.define( - configuration=OrderedDict({}) - ), - 'median': hyperparams.Hyperparams.define( - configuration=OrderedDict({}) - ), - 'quantile': hyperparams.Hyperparams.define( - configuration=OrderedDict({ - 'quantile': hyperparams.Uniform( - default=0.5, - lower=0, - upper=1.0, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - }) - ), - 'constant': hyperparams.Hyperparams.define( - configuration=OrderedDict({ - 'constant': hyperparams.Union( - configuration=OrderedDict({ - 'float': hyperparams.Bounded[float]( - lower=0, - upper=None, - default=1.0, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ), - 'int': hyperparams.Bounded[int]( - default=1, - lower=0, - upper=None, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ), - 'ndarray': hyperparams.Hyperparameter[ndarray]( - default=numpy.array([]), - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ) - }), - default='float', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - }) - ) - }, - default='mean', - description='Strategy to use to generate predictions. * "mean": always predicts the mean of the training set * "median": always predicts the median of the training set * "quantile": always predicts a specified quantile of the training set, provided with the quantile parameter. * "constant": always predicts a constant value that is provided by the user.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - - use_inputs_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to force primitive to use as training input. If any specified column cannot be parsed, it is skipped.", - ) - use_outputs_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to force primitive to use as training target. If any specified column cannot be parsed, it is skipped.", - ) - exclude_inputs_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to not use as training inputs. Applicable only if \"use_columns\" is not provided.", - ) - exclude_outputs_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to not use as training target. Applicable only if \"use_columns\" is not provided.", - ) - return_result = hyperparams.Enumeration( - values=['append', 'replace', 'new'], - default='new', - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Should parsed columns be appended, should they replace original columns, or should only parsed columns be returned? This hyperparam is ignored if use_semantic_types is set to false.", - ) - use_semantic_types = hyperparams.UniformBool( - default=False, - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Controls whether semantic_types metadata will be used for filtering columns in input dataframe. Setting this to false makes the code ignore return_result and will produce only the output dataframe" - ) - add_index_columns = hyperparams.UniformBool( - default=False, - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Also include primary index columns if input data has them. Applicable only if \"return_result\" is set to \"new\".", - ) - error_on_no_input = hyperparams.UniformBool( - default=True, - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Throw an exception if no input column is selected/provided. Defaults to true to behave like sklearn. To prevent pipelines from breaking set this to False.", - ) - - return_semantic_type = hyperparams.Enumeration[str]( - values=['https://metadata.datadrivendiscovery.org/types/Attribute', 'https://metadata.datadrivendiscovery.org/types/ConstructedAttribute', 'https://metadata.datadrivendiscovery.org/types/PredictedTarget'], - default='https://metadata.datadrivendiscovery.org/types/PredictedTarget', - description='Decides what semantic type to attach to generated output', - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'] - ) - -class SKDummyRegressor(SupervisedLearnerPrimitiveBase[Inputs, Outputs, Params, Hyperparams]): - """ - Primitive wrapping for sklearn DummyRegressor - `sklearn documentation `_ - - """ - - __author__ = "JPL MARVIN" - metadata = metadata_base.PrimitiveMetadata({ - "algorithm_types": [metadata_base.PrimitiveAlgorithmType.RULE_BASED_MACHINE_LEARNING, ], - "name": "sklearn.dummy.DummyRegressor", - "primitive_family": metadata_base.PrimitiveFamily.REGRESSION, - "python_path": "d3m.primitives.regression.dummy.SKlearn", - "source": {'name': 'JPL', 'contact': 'mailto:shah@jpl.nasa.gov', 'uris': ['https://gitlab.com/datadrivendiscovery/sklearn-wrap/issues', 'https://scikit-learn.org/stable/modules/generated/sklearn.dummy.DummyRegressor.html']}, - "version": "2019.11.13", - "id": "05aa5b6a-3b27-34dc-9ba7-8511fb13f253", - "hyperparams_to_tune": ['strategy'], - 'installation': [ - {'type': metadata_base.PrimitiveInstallationType.PIP, - 'package_uri': 'git+https://gitlab.com/datadrivendiscovery/sklearn-wrap.git@{git_commit}#egg=sklearn_wrap'.format( - git_commit=utils.current_git_commit(os.path.dirname(__file__)), - ), - }] - }) - - def __init__(self, *, - hyperparams: Hyperparams, - random_seed: int = 0, - docker_containers: Dict[str, DockerContainer] = None) -> None: - - super().__init__(hyperparams=hyperparams, random_seed=random_seed, docker_containers=docker_containers) - - # False - self._clf = DummyRegressor( - strategy=self.hyperparams['strategy']['choice'], - quantile=self.hyperparams['strategy'].get('quantile', 0.5), - constant=self.hyperparams['strategy'].get('constant', 'float'), - ) - - self._inputs = None - self._outputs = None - self._training_inputs = None - self._training_outputs = None - self._target_names = None - self._training_indices = None - self._target_column_indices = None - self._target_columns_metadata: List[OrderedDict] = None - self._input_column_names = None - self._fitted = False - self._new_training_data = False - - def set_training_data(self, *, inputs: Inputs, outputs: Outputs) -> None: - self._inputs = inputs - self._outputs = outputs - self._fitted = False - self._new_training_data = True - - def fit(self, *, timeout: float = None, iterations: int = None) -> CallResult[None]: - if self._inputs is None or self._outputs is None: - raise ValueError("Missing training data.") - - if not self._new_training_data: - return CallResult(None) - self._new_training_data = False - - self._training_inputs, self._training_indices = self._get_columns_to_fit(self._inputs, self.hyperparams) - self._training_outputs, self._target_names, self._target_column_indices = self._get_targets(self._outputs, self.hyperparams) - self._input_column_names = self._training_inputs.columns - - if len(self._training_indices) > 0 and len(self._target_column_indices) > 0: - self._target_columns_metadata = self._get_target_columns_metadata(self._training_outputs.metadata, self.hyperparams) - sk_training_output = self._training_outputs.values - - shape = sk_training_output.shape - if len(shape) == 2 and shape[1] == 1: - sk_training_output = numpy.ravel(sk_training_output) - - self._clf.fit(self._training_inputs, sk_training_output) - self._fitted = True - else: - if self.hyperparams['error_on_no_input']: - raise RuntimeError("No input columns were selected") - self.logger.warn("No input columns were selected") - - return CallResult(None) - - - - def produce(self, *, inputs: Inputs, timeout: float = None, iterations: int = None) -> CallResult[Outputs]: - sk_inputs, columns_to_use = self._get_columns_to_fit(inputs, self.hyperparams) - output = [] - if len(sk_inputs.columns): - try: - sk_output = self._clf.predict(sk_inputs) - except sklearn.exceptions.NotFittedError as error: - raise PrimitiveNotFittedError("Primitive not fitted.") from error - # For primitives that allow predicting without fitting like GaussianProcessRegressor - if not self._fitted: - raise PrimitiveNotFittedError("Primitive not fitted.") - if sparse.issparse(sk_output): - sk_output = sk_output.toarray() - output = self._wrap_predictions(inputs, sk_output) - output.columns = self._target_names - output = [output] - else: - if self.hyperparams['error_on_no_input']: - raise RuntimeError("No input columns were selected") - self.logger.warn("No input columns were selected") - outputs = base_utils.combine_columns(return_result=self.hyperparams['return_result'], - add_index_columns=self.hyperparams['add_index_columns'], - inputs=inputs, column_indices=self._target_column_indices, - columns_list=output) - - return CallResult(outputs) - - - def get_params(self) -> Params: - if not self._fitted: - return Params( - constant_=None, - n_outputs_=None, - output_2d_=None, - input_column_names=self._input_column_names, - training_indices_=self._training_indices, - target_names_=self._target_names, - target_column_indices_=self._target_column_indices, - target_columns_metadata_=self._target_columns_metadata - ) - - return Params( - constant_=getattr(self._clf, 'constant_', None), - n_outputs_=getattr(self._clf, 'n_outputs_', None), - output_2d_=getattr(self._clf, 'output_2d_', None), - input_column_names=self._input_column_names, - training_indices_=self._training_indices, - target_names_=self._target_names, - target_column_indices_=self._target_column_indices, - target_columns_metadata_=self._target_columns_metadata - ) - - def set_params(self, *, params: Params) -> None: - self._clf.constant_ = params['constant_'] - self._clf.n_outputs_ = params['n_outputs_'] - self._clf.output_2d_ = params['output_2d_'] - self._input_column_names = params['input_column_names'] - self._training_indices = params['training_indices_'] - self._target_names = params['target_names_'] - self._target_column_indices = params['target_column_indices_'] - self._target_columns_metadata = params['target_columns_metadata_'] - - if params['constant_'] is not None: - self._fitted = True - if params['n_outputs_'] is not None: - self._fitted = True - if params['output_2d_'] is not None: - self._fitted = True - - - - - - - - @classmethod - def _get_columns_to_fit(cls, inputs: Inputs, hyperparams: Hyperparams): - if not hyperparams['use_semantic_types']: - return inputs, list(range(len(inputs.columns))) - - inputs_metadata = inputs.metadata - - def can_produce_column(column_index: int) -> bool: - return cls._can_produce_column(inputs_metadata, column_index, hyperparams) - - columns_to_produce, columns_not_to_produce = base_utils.get_columns_to_use(inputs_metadata, - use_columns=hyperparams['use_inputs_columns'], - exclude_columns=hyperparams['exclude_inputs_columns'], - can_use_column=can_produce_column) - return inputs.iloc[:, columns_to_produce], columns_to_produce - # return columns_to_produce - - @classmethod - def _can_produce_column(cls, inputs_metadata: metadata_base.DataMetadata, column_index: int, hyperparams: Hyperparams) -> bool: - column_metadata = inputs_metadata.query((metadata_base.ALL_ELEMENTS, column_index)) - - accepted_structural_types = (int, float, numpy.integer, numpy.float64) - accepted_semantic_types = set() - accepted_semantic_types.add("https://metadata.datadrivendiscovery.org/types/Attribute") - if not issubclass(column_metadata['structural_type'], accepted_structural_types): - return False - - semantic_types = set(column_metadata.get('semantic_types', [])) - - if len(semantic_types) == 0: - cls.logger.warning("No semantic types found in column metadata") - return False - # Making sure all accepted_semantic_types are available in semantic_types - if len(accepted_semantic_types - semantic_types) == 0: - return True - - return False - - @classmethod - def _get_targets(cls, data: d3m_dataframe, hyperparams: Hyperparams): - if not hyperparams['use_semantic_types']: - return data, list(data.columns), list(range(len(data.columns))) - - metadata = data.metadata - - def can_produce_column(column_index: int) -> bool: - accepted_semantic_types = set() - accepted_semantic_types.add("https://metadata.datadrivendiscovery.org/types/TrueTarget") - column_metadata = metadata.query((metadata_base.ALL_ELEMENTS, column_index)) - semantic_types = set(column_metadata.get('semantic_types', [])) - if len(semantic_types) == 0: - cls.logger.warning("No semantic types found in column metadata") - return False - # Making sure all accepted_semantic_types are available in semantic_types - if len(accepted_semantic_types - semantic_types) == 0: - return True - return False - - target_column_indices, target_columns_not_to_produce = base_utils.get_columns_to_use(metadata, - use_columns=hyperparams[ - 'use_outputs_columns'], - exclude_columns= - hyperparams[ - 'exclude_outputs_columns'], - can_use_column=can_produce_column) - targets = [] - if target_column_indices: - targets = data.select_columns(target_column_indices) - target_column_names = [] - for idx in target_column_indices: - target_column_names.append(data.columns[idx]) - return targets, target_column_names, target_column_indices - - @classmethod - def _get_target_columns_metadata(cls, outputs_metadata: metadata_base.DataMetadata, hyperparams) -> List[OrderedDict]: - outputs_length = outputs_metadata.query((metadata_base.ALL_ELEMENTS,))['dimension']['length'] - - target_columns_metadata: List[OrderedDict] = [] - for column_index in range(outputs_length): - column_metadata = OrderedDict(outputs_metadata.query_column(column_index)) - - # Update semantic types and prepare it for predicted targets. - semantic_types = set(column_metadata.get('semantic_types', [])) - semantic_types_to_remove = set(["https://metadata.datadrivendiscovery.org/types/TrueTarget","https://metadata.datadrivendiscovery.org/types/SuggestedTarget",]) - add_semantic_types = set(["https://metadata.datadrivendiscovery.org/types/PredictedTarget",]) - add_semantic_types.add(hyperparams["return_semantic_type"]) - semantic_types = semantic_types - semantic_types_to_remove - semantic_types = semantic_types.union(add_semantic_types) - column_metadata['semantic_types'] = list(semantic_types) - - target_columns_metadata.append(column_metadata) - - return target_columns_metadata - - @classmethod - def _update_predictions_metadata(cls, inputs_metadata: metadata_base.DataMetadata, outputs: Optional[Outputs], - target_columns_metadata: List[OrderedDict]) -> metadata_base.DataMetadata: - outputs_metadata = metadata_base.DataMetadata().generate(value=outputs) - - for column_index, column_metadata in enumerate(target_columns_metadata): - column_metadata.pop("structural_type", None) - outputs_metadata = outputs_metadata.update_column(column_index, column_metadata) - - return outputs_metadata - - def _wrap_predictions(self, inputs: Inputs, predictions: ndarray) -> Outputs: - outputs = d3m_dataframe(predictions, generate_metadata=False) - outputs.metadata = self._update_predictions_metadata(inputs.metadata, outputs, self._target_columns_metadata) - return outputs - - - @classmethod - def _add_target_columns_metadata(cls, outputs_metadata: metadata_base.DataMetadata): - outputs_length = outputs_metadata.query((metadata_base.ALL_ELEMENTS,))['dimension']['length'] - - target_columns_metadata: List[OrderedDict] = [] - for column_index in range(outputs_length): - column_metadata = OrderedDict() - semantic_types = [] - semantic_types.append('https://metadata.datadrivendiscovery.org/types/PredictedTarget') - column_name = outputs_metadata.query((metadata_base.ALL_ELEMENTS, column_index)).get("name") - if column_name is None: - column_name = "output_{}".format(column_index) - column_metadata["semantic_types"] = semantic_types - column_metadata["name"] = str(column_name) - target_columns_metadata.append(column_metadata) - - return target_columns_metadata - - -SKDummyRegressor.__doc__ = DummyRegressor.__doc__ \ No newline at end of file diff --git a/common-primitives/sklearn-wrap/sklearn_wrap/SKElasticNet.py b/common-primitives/sklearn-wrap/sklearn_wrap/SKElasticNet.py deleted file mode 100644 index 894fcad..0000000 --- a/common-primitives/sklearn-wrap/sklearn_wrap/SKElasticNet.py +++ /dev/null @@ -1,466 +0,0 @@ -from typing import Any, Callable, List, Dict, Union, Optional, Sequence, Tuple -from numpy import ndarray -from collections import OrderedDict -from scipy import sparse -import os -import sklearn -import numpy -import typing - -# Custom import commands if any -from sklearn.linear_model.coordinate_descent import ElasticNet - - -from d3m.container.numpy import ndarray as d3m_ndarray -from d3m.container import DataFrame as d3m_dataframe -from d3m.metadata import hyperparams, params, base as metadata_base -from d3m import utils -from d3m.base import utils as base_utils -from d3m.exceptions import PrimitiveNotFittedError -from d3m.primitive_interfaces.base import CallResult, DockerContainer - -from d3m.primitive_interfaces.supervised_learning import SupervisedLearnerPrimitiveBase -from d3m.primitive_interfaces.base import ProbabilisticCompositionalityMixin, ContinueFitMixin -from d3m import exceptions -import pandas - - - -Inputs = d3m_dataframe -Outputs = d3m_dataframe - - -class Params(params.Params): - coef_: Optional[ndarray] - intercept_: Optional[float] - n_iter_: Optional[int] - dual_gap_: Optional[float] - input_column_names: Optional[Any] - target_names_: Optional[Sequence[Any]] - training_indices_: Optional[Sequence[int]] - target_column_indices_: Optional[Sequence[int]] - target_columns_metadata_: Optional[List[OrderedDict]] - - - -class Hyperparams(hyperparams.Hyperparams): - alpha = hyperparams.Bounded[float]( - default=1.0, - lower=0, - upper=None, - description='Constant that multiplies the penalty terms. Defaults to 1.0. See the notes for the exact mathematical meaning of this parameter.``alpha = 0`` is equivalent to an ordinary least square, solved by the :class:`LinearRegression` object. For numerical reasons, using ``alpha = 0`` with the ``Lasso`` object is not advised. Given this, you should use the :class:`LinearRegression` object.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - l1_ratio = hyperparams.Uniform( - default=0.5, - lower=0, - upper=1, - description='The ElasticNet mixing parameter, with ``0 <= l1_ratio <= 1``. For ``l1_ratio = 0`` the penalty is an L2 penalty. ``For l1_ratio = 1`` it is an L1 penalty. For ``0 < l1_ratio < 1``, the penalty is a combination of L1 and L2.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - fit_intercept = hyperparams.UniformBool( - default=True, - description='Whether the intercept should be estimated or not. If ``False``, the data is assumed to be already centered.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - normalize = hyperparams.UniformBool( - default=False, - description='This parameter is ignored when ``fit_intercept`` is set to False. If True, the regressors X will be normalized before regression by subtracting the mean and dividing by the l2-norm. If you wish to standardize, please use :class:`sklearn.preprocessing.StandardScaler` before calling ``fit`` on an estimator with ``normalize=False``.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - precompute = hyperparams.UniformBool( - default=False, - description='Whether to use a precomputed Gram matrix to speed up calculations. The Gram matrix can also be passed as argument. For sparse input this option is always ``True`` to preserve sparsity.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/ResourcesUseParameter'] - ) - max_iter = hyperparams.Bounded[int]( - default=1000, - lower=0, - upper=None, - description='The maximum number of iterations', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - tol = hyperparams.Bounded[float]( - default=0.0001, - lower=0, - upper=None, - description='The tolerance for the optimization: if the updates are smaller than ``tol``, the optimization code checks the dual gap for optimality and continues until it is smaller than ``tol``.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - positive = hyperparams.UniformBool( - default=False, - description='When set to ``True``, forces the coefficients to be positive.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - selection = hyperparams.Enumeration[str]( - default='cyclic', - values=['cyclic', 'random'], - description='If set to \'random\', a random coefficient is updated every iteration rather than looping over features sequentially by default. This (setting to \'random\') often leads to significantly faster convergence especially when tol is higher than 1e-4.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - warm_start = hyperparams.UniformBool( - default=False, - description='When set to ``True``, reuse the solution of the previous call to fit as initialization, otherwise, just erase the previous solution. See :term:`the Glossary `.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - - use_inputs_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to force primitive to use as training input. If any specified column cannot be parsed, it is skipped.", - ) - use_outputs_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to force primitive to use as training target. If any specified column cannot be parsed, it is skipped.", - ) - exclude_inputs_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to not use as training inputs. Applicable only if \"use_columns\" is not provided.", - ) - exclude_outputs_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to not use as training target. Applicable only if \"use_columns\" is not provided.", - ) - return_result = hyperparams.Enumeration( - values=['append', 'replace', 'new'], - default='new', - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Should parsed columns be appended, should they replace original columns, or should only parsed columns be returned? This hyperparam is ignored if use_semantic_types is set to false.", - ) - use_semantic_types = hyperparams.UniformBool( - default=False, - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Controls whether semantic_types metadata will be used for filtering columns in input dataframe. Setting this to false makes the code ignore return_result and will produce only the output dataframe" - ) - add_index_columns = hyperparams.UniformBool( - default=False, - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Also include primary index columns if input data has them. Applicable only if \"return_result\" is set to \"new\".", - ) - error_on_no_input = hyperparams.UniformBool( - default=True, - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Throw an exception if no input column is selected/provided. Defaults to true to behave like sklearn. To prevent pipelines from breaking set this to False.", - ) - - return_semantic_type = hyperparams.Enumeration[str]( - values=['https://metadata.datadrivendiscovery.org/types/Attribute', 'https://metadata.datadrivendiscovery.org/types/ConstructedAttribute', 'https://metadata.datadrivendiscovery.org/types/PredictedTarget'], - default='https://metadata.datadrivendiscovery.org/types/PredictedTarget', - description='Decides what semantic type to attach to generated output', - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'] - ) - -class SKElasticNet(SupervisedLearnerPrimitiveBase[Inputs, Outputs, Params, Hyperparams]): - """ - Primitive wrapping for sklearn ElasticNet - `sklearn documentation `_ - - """ - - __author__ = "JPL MARVIN" - metadata = metadata_base.PrimitiveMetadata({ - "algorithm_types": [metadata_base.PrimitiveAlgorithmType.ELASTIC_NET_REGULARIZATION, ], - "name": "sklearn.linear_model.coordinate_descent.ElasticNet", - "primitive_family": metadata_base.PrimitiveFamily.REGRESSION, - "python_path": "d3m.primitives.regression.elastic_net.SKlearn", - "source": {'name': 'JPL', 'contact': 'mailto:shah@jpl.nasa.gov', 'uris': ['https://gitlab.com/datadrivendiscovery/sklearn-wrap/issues', 'https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.ElasticNet.html']}, - "version": "2019.11.13", - "id": "a85d4ffb-49ab-35b1-a70c-6df209312aae", - "hyperparams_to_tune": ['alpha', 'max_iter', 'l1_ratio'], - 'installation': [ - {'type': metadata_base.PrimitiveInstallationType.PIP, - 'package_uri': 'git+https://gitlab.com/datadrivendiscovery/sklearn-wrap.git@{git_commit}#egg=sklearn_wrap'.format( - git_commit=utils.current_git_commit(os.path.dirname(__file__)), - ), - }] - }) - - def __init__(self, *, - hyperparams: Hyperparams, - random_seed: int = 0, - docker_containers: Dict[str, DockerContainer] = None) -> None: - - super().__init__(hyperparams=hyperparams, random_seed=random_seed, docker_containers=docker_containers) - - # False - self._clf = ElasticNet( - alpha=self.hyperparams['alpha'], - l1_ratio=self.hyperparams['l1_ratio'], - fit_intercept=self.hyperparams['fit_intercept'], - normalize=self.hyperparams['normalize'], - precompute=self.hyperparams['precompute'], - max_iter=self.hyperparams['max_iter'], - tol=self.hyperparams['tol'], - positive=self.hyperparams['positive'], - selection=self.hyperparams['selection'], - warm_start=self.hyperparams['warm_start'], - random_state=self.random_seed, - ) - - self._inputs = None - self._outputs = None - self._training_inputs = None - self._training_outputs = None - self._target_names = None - self._training_indices = None - self._target_column_indices = None - self._target_columns_metadata: List[OrderedDict] = None - self._input_column_names = None - self._fitted = False - self._new_training_data = False - - def set_training_data(self, *, inputs: Inputs, outputs: Outputs) -> None: - self._inputs = inputs - self._outputs = outputs - self._fitted = False - self._new_training_data = True - - def fit(self, *, timeout: float = None, iterations: int = None) -> CallResult[None]: - if self._inputs is None or self._outputs is None: - raise ValueError("Missing training data.") - - if not self._new_training_data: - return CallResult(None) - self._new_training_data = False - - self._training_inputs, self._training_indices = self._get_columns_to_fit(self._inputs, self.hyperparams) - self._training_outputs, self._target_names, self._target_column_indices = self._get_targets(self._outputs, self.hyperparams) - self._input_column_names = self._training_inputs.columns - - if len(self._training_indices) > 0 and len(self._target_column_indices) > 0: - self._target_columns_metadata = self._get_target_columns_metadata(self._training_outputs.metadata, self.hyperparams) - sk_training_output = self._training_outputs.values - - shape = sk_training_output.shape - if len(shape) == 2 and shape[1] == 1: - sk_training_output = numpy.ravel(sk_training_output) - - self._clf.fit(self._training_inputs, sk_training_output) - self._fitted = True - else: - if self.hyperparams['error_on_no_input']: - raise RuntimeError("No input columns were selected") - self.logger.warn("No input columns were selected") - - return CallResult(None) - - - - def produce(self, *, inputs: Inputs, timeout: float = None, iterations: int = None) -> CallResult[Outputs]: - sk_inputs, columns_to_use = self._get_columns_to_fit(inputs, self.hyperparams) - output = [] - if len(sk_inputs.columns): - try: - sk_output = self._clf.predict(sk_inputs) - except sklearn.exceptions.NotFittedError as error: - raise PrimitiveNotFittedError("Primitive not fitted.") from error - # For primitives that allow predicting without fitting like GaussianProcessRegressor - if not self._fitted: - raise PrimitiveNotFittedError("Primitive not fitted.") - if sparse.issparse(sk_output): - sk_output = sk_output.toarray() - output = self._wrap_predictions(inputs, sk_output) - output.columns = self._target_names - output = [output] - else: - if self.hyperparams['error_on_no_input']: - raise RuntimeError("No input columns were selected") - self.logger.warn("No input columns were selected") - outputs = base_utils.combine_columns(return_result=self.hyperparams['return_result'], - add_index_columns=self.hyperparams['add_index_columns'], - inputs=inputs, column_indices=self._target_column_indices, - columns_list=output) - - return CallResult(outputs) - - - def get_params(self) -> Params: - if not self._fitted: - return Params( - coef_=None, - intercept_=None, - n_iter_=None, - dual_gap_=None, - input_column_names=self._input_column_names, - training_indices_=self._training_indices, - target_names_=self._target_names, - target_column_indices_=self._target_column_indices, - target_columns_metadata_=self._target_columns_metadata - ) - - return Params( - coef_=getattr(self._clf, 'coef_', None), - intercept_=getattr(self._clf, 'intercept_', None), - n_iter_=getattr(self._clf, 'n_iter_', None), - dual_gap_=getattr(self._clf, 'dual_gap_', None), - input_column_names=self._input_column_names, - training_indices_=self._training_indices, - target_names_=self._target_names, - target_column_indices_=self._target_column_indices, - target_columns_metadata_=self._target_columns_metadata - ) - - def set_params(self, *, params: Params) -> None: - self._clf.coef_ = params['coef_'] - self._clf.intercept_ = params['intercept_'] - self._clf.n_iter_ = params['n_iter_'] - self._clf.dual_gap_ = params['dual_gap_'] - self._input_column_names = params['input_column_names'] - self._training_indices = params['training_indices_'] - self._target_names = params['target_names_'] - self._target_column_indices = params['target_column_indices_'] - self._target_columns_metadata = params['target_columns_metadata_'] - - if params['coef_'] is not None: - self._fitted = True - if params['intercept_'] is not None: - self._fitted = True - if params['n_iter_'] is not None: - self._fitted = True - if params['dual_gap_'] is not None: - self._fitted = True - - - - - - - - @classmethod - def _get_columns_to_fit(cls, inputs: Inputs, hyperparams: Hyperparams): - if not hyperparams['use_semantic_types']: - return inputs, list(range(len(inputs.columns))) - - inputs_metadata = inputs.metadata - - def can_produce_column(column_index: int) -> bool: - return cls._can_produce_column(inputs_metadata, column_index, hyperparams) - - columns_to_produce, columns_not_to_produce = base_utils.get_columns_to_use(inputs_metadata, - use_columns=hyperparams['use_inputs_columns'], - exclude_columns=hyperparams['exclude_inputs_columns'], - can_use_column=can_produce_column) - return inputs.iloc[:, columns_to_produce], columns_to_produce - # return columns_to_produce - - @classmethod - def _can_produce_column(cls, inputs_metadata: metadata_base.DataMetadata, column_index: int, hyperparams: Hyperparams) -> bool: - column_metadata = inputs_metadata.query((metadata_base.ALL_ELEMENTS, column_index)) - - accepted_structural_types = (int, float, numpy.integer, numpy.float64) - accepted_semantic_types = set() - accepted_semantic_types.add("https://metadata.datadrivendiscovery.org/types/Attribute") - if not issubclass(column_metadata['structural_type'], accepted_structural_types): - return False - - semantic_types = set(column_metadata.get('semantic_types', [])) - - if len(semantic_types) == 0: - cls.logger.warning("No semantic types found in column metadata") - return False - # Making sure all accepted_semantic_types are available in semantic_types - if len(accepted_semantic_types - semantic_types) == 0: - return True - - return False - - @classmethod - def _get_targets(cls, data: d3m_dataframe, hyperparams: Hyperparams): - if not hyperparams['use_semantic_types']: - return data, list(data.columns), list(range(len(data.columns))) - - metadata = data.metadata - - def can_produce_column(column_index: int) -> bool: - accepted_semantic_types = set() - accepted_semantic_types.add("https://metadata.datadrivendiscovery.org/types/TrueTarget") - column_metadata = metadata.query((metadata_base.ALL_ELEMENTS, column_index)) - semantic_types = set(column_metadata.get('semantic_types', [])) - if len(semantic_types) == 0: - cls.logger.warning("No semantic types found in column metadata") - return False - # Making sure all accepted_semantic_types are available in semantic_types - if len(accepted_semantic_types - semantic_types) == 0: - return True - return False - - target_column_indices, target_columns_not_to_produce = base_utils.get_columns_to_use(metadata, - use_columns=hyperparams[ - 'use_outputs_columns'], - exclude_columns= - hyperparams[ - 'exclude_outputs_columns'], - can_use_column=can_produce_column) - targets = [] - if target_column_indices: - targets = data.select_columns(target_column_indices) - target_column_names = [] - for idx in target_column_indices: - target_column_names.append(data.columns[idx]) - return targets, target_column_names, target_column_indices - - @classmethod - def _get_target_columns_metadata(cls, outputs_metadata: metadata_base.DataMetadata, hyperparams) -> List[OrderedDict]: - outputs_length = outputs_metadata.query((metadata_base.ALL_ELEMENTS,))['dimension']['length'] - - target_columns_metadata: List[OrderedDict] = [] - for column_index in range(outputs_length): - column_metadata = OrderedDict(outputs_metadata.query_column(column_index)) - - # Update semantic types and prepare it for predicted targets. - semantic_types = set(column_metadata.get('semantic_types', [])) - semantic_types_to_remove = set(["https://metadata.datadrivendiscovery.org/types/TrueTarget","https://metadata.datadrivendiscovery.org/types/SuggestedTarget",]) - add_semantic_types = set(["https://metadata.datadrivendiscovery.org/types/PredictedTarget",]) - add_semantic_types.add(hyperparams["return_semantic_type"]) - semantic_types = semantic_types - semantic_types_to_remove - semantic_types = semantic_types.union(add_semantic_types) - column_metadata['semantic_types'] = list(semantic_types) - - target_columns_metadata.append(column_metadata) - - return target_columns_metadata - - @classmethod - def _update_predictions_metadata(cls, inputs_metadata: metadata_base.DataMetadata, outputs: Optional[Outputs], - target_columns_metadata: List[OrderedDict]) -> metadata_base.DataMetadata: - outputs_metadata = metadata_base.DataMetadata().generate(value=outputs) - - for column_index, column_metadata in enumerate(target_columns_metadata): - column_metadata.pop("structural_type", None) - outputs_metadata = outputs_metadata.update_column(column_index, column_metadata) - - return outputs_metadata - - def _wrap_predictions(self, inputs: Inputs, predictions: ndarray) -> Outputs: - outputs = d3m_dataframe(predictions, generate_metadata=False) - outputs.metadata = self._update_predictions_metadata(inputs.metadata, outputs, self._target_columns_metadata) - return outputs - - - @classmethod - def _add_target_columns_metadata(cls, outputs_metadata: metadata_base.DataMetadata): - outputs_length = outputs_metadata.query((metadata_base.ALL_ELEMENTS,))['dimension']['length'] - - target_columns_metadata: List[OrderedDict] = [] - for column_index in range(outputs_length): - column_metadata = OrderedDict() - semantic_types = [] - semantic_types.append('https://metadata.datadrivendiscovery.org/types/PredictedTarget') - column_name = outputs_metadata.query((metadata_base.ALL_ELEMENTS, column_index)).get("name") - if column_name is None: - column_name = "output_{}".format(column_index) - column_metadata["semantic_types"] = semantic_types - column_metadata["name"] = str(column_name) - target_columns_metadata.append(column_metadata) - - return target_columns_metadata - - -SKElasticNet.__doc__ = ElasticNet.__doc__ \ No newline at end of file diff --git a/common-primitives/sklearn-wrap/sklearn_wrap/SKExtraTreesClassifier.py b/common-primitives/sklearn-wrap/sklearn_wrap/SKExtraTreesClassifier.py deleted file mode 100644 index 51d77c9..0000000 --- a/common-primitives/sklearn-wrap/sklearn_wrap/SKExtraTreesClassifier.py +++ /dev/null @@ -1,675 +0,0 @@ -from typing import Any, Callable, List, Dict, Union, Optional, Sequence, Tuple -from numpy import ndarray -from collections import OrderedDict -from scipy import sparse -import os -import sklearn -import numpy -import typing - -# Custom import commands if any -from sklearn.ensemble.forest import ExtraTreesClassifier - - -from d3m.container.numpy import ndarray as d3m_ndarray -from d3m.container import DataFrame as d3m_dataframe -from d3m.metadata import hyperparams, params, base as metadata_base -from d3m import utils -from d3m.base import utils as base_utils -from d3m.exceptions import PrimitiveNotFittedError -from d3m.primitive_interfaces.base import CallResult, DockerContainer - -from d3m.primitive_interfaces.supervised_learning import SupervisedLearnerPrimitiveBase -from d3m.primitive_interfaces.base import ProbabilisticCompositionalityMixin, ContinueFitMixin -from d3m import exceptions -import pandas - - - -Inputs = d3m_dataframe -Outputs = d3m_dataframe - - -class Params(params.Params): - estimators_: Optional[Sequence[sklearn.base.BaseEstimator]] - classes_: Optional[Union[ndarray, List[ndarray]]] - n_classes_: Optional[Union[int, List[int]]] - n_features_: Optional[int] - n_outputs_: Optional[int] - oob_score_: Optional[float] - oob_decision_function_: Optional[ndarray] - base_estimator_: Optional[object] - estimator_params: Optional[tuple] - base_estimator: Optional[object] - input_column_names: Optional[Any] - target_names_: Optional[Sequence[Any]] - training_indices_: Optional[Sequence[int]] - target_column_indices_: Optional[Sequence[int]] - target_columns_metadata_: Optional[List[OrderedDict]] - - - -class Hyperparams(hyperparams.Hyperparams): - n_estimators = hyperparams.Bounded[int]( - default=10, - lower=1, - upper=None, - description='The number of trees in the forest.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - criterion = hyperparams.Enumeration[str]( - values=['gini', 'entropy'], - default='gini', - description='The function to measure the quality of a split. Supported criteria are "gini" for the Gini impurity and "entropy" for the information gain.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - max_depth = hyperparams.Union( - configuration=OrderedDict({ - 'int': hyperparams.Bounded[int]( - lower=0, - upper=None, - default=10, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ), - 'none': hyperparams.Constant( - default=None, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ) - }), - default='none', - description='The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - min_samples_split = hyperparams.Union( - configuration=OrderedDict({ - 'absolute': hyperparams.Bounded[int]( - default=2, - lower=1, - upper=None, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ), - 'percent': hyperparams.Bounded[float]( - default=0.25, - lower=0, - upper=1, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ) - }), - default='absolute', - description='The minimum number of samples required to split an internal node: - If int, then consider `min_samples_split` as the minimum number. - If float, then `min_samples_split` is a percentage and `ceil(min_samples_split * n_samples)` are the minimum number of samples for each split. .. versionchanged:: 0.18 Added float values for percentages.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - min_samples_leaf = hyperparams.Union( - configuration=OrderedDict({ - 'absolute': hyperparams.Bounded[int]( - default=1, - lower=1, - upper=None, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ), - 'percent': hyperparams.Bounded[float]( - default=0.25, - lower=0, - upper=0.5, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ) - }), - default='absolute', - description='The minimum number of samples required to be at a leaf node: - If int, then consider `min_samples_leaf` as the minimum number. - If float, then `min_samples_leaf` is a percentage and `ceil(min_samples_leaf * n_samples)` are the minimum number of samples for each node. .. versionchanged:: 0.18 Added float values for percentages.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - min_weight_fraction_leaf = hyperparams.Bounded[float]( - default=0, - lower=0, - upper=0.5, - description='The minimum weighted fraction of the sum total of weights (of all the input samples) required to be at a leaf node. Samples have equal weight when sample_weight is not provided.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - max_leaf_nodes = hyperparams.Union( - configuration=OrderedDict({ - 'int': hyperparams.Bounded[int]( - default=10, - lower=0, - upper=None, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ), - 'none': hyperparams.Constant( - default=None, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ) - }), - default='none', - description='Grow trees with ``max_leaf_nodes`` in best-first fashion. Best nodes are defined as relative reduction in impurity. If None then unlimited number of leaf nodes.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - max_features = hyperparams.Union( - configuration=OrderedDict({ - 'specified_int': hyperparams.Bounded[int]( - lower=0, - upper=None, - default=0, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ), - 'calculated': hyperparams.Enumeration[str]( - values=['auto', 'sqrt', 'log2'], - default='auto', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ), - 'none': hyperparams.Constant( - default=None, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ), - 'percent': hyperparams.Bounded[float]( - default=0.25, - lower=0, - upper=1, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ) - }), - default='calculated', - description='The number of features to consider when looking for the best split: - If int, then consider `max_features` features at each split. - If float, then `max_features` is a percentage and `int(max_features * n_features)` features are considered at each split. - If "auto", then `max_features=sqrt(n_features)`. - If "sqrt", then `max_features=sqrt(n_features)`. - If "log2", then `max_features=log2(n_features)`. - If None, then `max_features=n_features`. Note: the search for a split does not stop until at least one valid partition of the node samples is found, even if it requires to effectively inspect more than ``max_features`` features.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - min_impurity_decrease = hyperparams.Bounded[float]( - default=0.0, - lower=0.0, - upper=None, - description='A node will be split if this split induces a decrease of the impurity greater than or equal to this value. The weighted impurity decrease equation is the following:: N_t / N * (impurity - N_t_R / N_t * right_impurity - N_t_L / N_t * left_impurity) where ``N`` is the total number of samples, ``N_t`` is the number of samples at the current node, ``N_t_L`` is the number of samples in the left child, and ``N_t_R`` is the number of samples in the right child. ``N``, ``N_t``, ``N_t_R`` and ``N_t_L`` all refer to the weighted sum, if ``sample_weight`` is passed. .. versionadded:: 0.19 ', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - bootstrap = hyperparams.Enumeration[str]( - values=['bootstrap', 'bootstrap_with_oob_score', 'disabled'], - default='bootstrap', - description='Whether bootstrap samples are used when building trees.' - ' And whether to use out-of-bag samples to estimate the generalization accuracy.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ) - n_jobs = hyperparams.Union( - configuration=OrderedDict({ - 'limit': hyperparams.Bounded[int]( - default=1, - lower=1, - upper=None, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ), - 'all_cores': hyperparams.Constant( - default=-1, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ) - }), - default='limit', - description='The number of jobs to run in parallel for both `fit` and `predict`. If -1, then the number of jobs is set to the number of cores.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/ResourcesUseParameter'] - ) - warm_start = hyperparams.UniformBool( - default=False, - description='When set to ``True``, reuse the solution of the previous call to fit and add more estimators to the ensemble, otherwise, just fit a whole new forest.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - class_weight = hyperparams.Union( - configuration=OrderedDict({ - 'str': hyperparams.Enumeration[str]( - default='balanced', - values=['balanced', 'balanced_subsample'], - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ), - 'none': hyperparams.Constant( - default=None, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ) - }), - default='none', - description='Weights associated with classes in the form ``{class_label: weight}``. If not given, all classes are supposed to have weight one. For multi-output problems, a list of dicts can be provided in the same order as the columns of y. The "balanced" mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data as ``n_samples / (n_classes * np.bincount(y))`` The "balanced_subsample" mode is the same as "balanced" except that weights are computed based on the bootstrap sample for every tree grown. For multi-output, the weights of each column of y will be multiplied. Note that these weights will be multiplied with sample_weight (passed through the fit method) if sample_weight is specified.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - - use_inputs_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to force primitive to use as training input. If any specified column cannot be parsed, it is skipped.", - ) - use_outputs_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to force primitive to use as training target. If any specified column cannot be parsed, it is skipped.", - ) - exclude_inputs_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to not use as training inputs. Applicable only if \"use_columns\" is not provided.", - ) - exclude_outputs_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to not use as training target. Applicable only if \"use_columns\" is not provided.", - ) - return_result = hyperparams.Enumeration( - values=['append', 'replace', 'new'], - default='new', - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Should parsed columns be appended, should they replace original columns, or should only parsed columns be returned? This hyperparam is ignored if use_semantic_types is set to false.", - ) - use_semantic_types = hyperparams.UniformBool( - default=False, - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Controls whether semantic_types metadata will be used for filtering columns in input dataframe. Setting this to false makes the code ignore return_result and will produce only the output dataframe" - ) - add_index_columns = hyperparams.UniformBool( - default=False, - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Also include primary index columns if input data has them. Applicable only if \"return_result\" is set to \"new\".", - ) - error_on_no_input = hyperparams.UniformBool( - default=True, - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Throw an exception if no input column is selected/provided. Defaults to true to behave like sklearn. To prevent pipelines from breaking set this to False.", - ) - - return_semantic_type = hyperparams.Enumeration[str]( - values=['https://metadata.datadrivendiscovery.org/types/Attribute', 'https://metadata.datadrivendiscovery.org/types/ConstructedAttribute', 'https://metadata.datadrivendiscovery.org/types/PredictedTarget'], - default='https://metadata.datadrivendiscovery.org/types/PredictedTarget', - description='Decides what semantic type to attach to generated output', - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'] - ) - -class SKExtraTreesClassifier(SupervisedLearnerPrimitiveBase[Inputs, Outputs, Params, Hyperparams], - ProbabilisticCompositionalityMixin[Inputs, Outputs, Params, Hyperparams]): - """ - Primitive wrapping for sklearn ExtraTreesClassifier - `sklearn documentation `_ - - """ - - __author__ = "JPL MARVIN" - metadata = metadata_base.PrimitiveMetadata({ - "algorithm_types": [metadata_base.PrimitiveAlgorithmType.DECISION_TREE, ], - "name": "sklearn.ensemble.forest.ExtraTreesClassifier", - "primitive_family": metadata_base.PrimitiveFamily.CLASSIFICATION, - "python_path": "d3m.primitives.classification.extra_trees.SKlearn", - "source": {'name': 'JPL', 'contact': 'mailto:shah@jpl.nasa.gov', 'uris': ['https://gitlab.com/datadrivendiscovery/sklearn-wrap/issues', 'https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.ExtraTreesClassifier.html']}, - "version": "2019.11.13", - "id": "c8a28f02-ef4a-35a8-87f1-cf79980f5c3e", - "hyperparams_to_tune": ['n_estimators', 'max_depth', 'min_samples_split', 'min_samples_leaf', 'max_features'], - 'installation': [ - {'type': metadata_base.PrimitiveInstallationType.PIP, - 'package_uri': 'git+https://gitlab.com/datadrivendiscovery/sklearn-wrap.git@{git_commit}#egg=sklearn_wrap'.format( - git_commit=utils.current_git_commit(os.path.dirname(__file__)), - ), - }] - }) - - def __init__(self, *, - hyperparams: Hyperparams, - random_seed: int = 0, - docker_containers: Dict[str, DockerContainer] = None, - _verbose: int = 0) -> None: - - super().__init__(hyperparams=hyperparams, random_seed=random_seed, docker_containers=docker_containers) - - # False - self._clf = ExtraTreesClassifier( - n_estimators=self.hyperparams['n_estimators'], - criterion=self.hyperparams['criterion'], - max_depth=self.hyperparams['max_depth'], - min_samples_split=self.hyperparams['min_samples_split'], - min_samples_leaf=self.hyperparams['min_samples_leaf'], - min_weight_fraction_leaf=self.hyperparams['min_weight_fraction_leaf'], - max_leaf_nodes=self.hyperparams['max_leaf_nodes'], - max_features=self.hyperparams['max_features'], - min_impurity_decrease=self.hyperparams['min_impurity_decrease'], - bootstrap=self.hyperparams['bootstrap'] in ['bootstrap', 'bootstrap_with_oob_score'], - oob_score=self.hyperparams['bootstrap'] in ['bootstrap_with_oob_score'], - n_jobs=self.hyperparams['n_jobs'], - warm_start=self.hyperparams['warm_start'], - class_weight=self.hyperparams['class_weight'], - random_state=self.random_seed, - verbose=_verbose - ) - - self._inputs = None - self._outputs = None - self._training_inputs = None - self._training_outputs = None - self._target_names = None - self._training_indices = None - self._target_column_indices = None - self._target_columns_metadata: List[OrderedDict] = None - self._input_column_names = None - self._fitted = False - self._new_training_data = False - - def set_training_data(self, *, inputs: Inputs, outputs: Outputs) -> None: - self._inputs = inputs - self._outputs = outputs - self._fitted = False - self._new_training_data = True - - def fit(self, *, timeout: float = None, iterations: int = None) -> CallResult[None]: - if self._inputs is None or self._outputs is None: - raise ValueError("Missing training data.") - - if not self._new_training_data: - return CallResult(None) - self._new_training_data = False - - self._training_inputs, self._training_indices = self._get_columns_to_fit(self._inputs, self.hyperparams) - self._training_outputs, self._target_names, self._target_column_indices = self._get_targets(self._outputs, self.hyperparams) - self._input_column_names = self._training_inputs.columns - - if len(self._training_indices) > 0 and len(self._target_column_indices) > 0: - self._target_columns_metadata = self._get_target_columns_metadata(self._training_outputs.metadata, self.hyperparams) - sk_training_output = self._training_outputs.values - - shape = sk_training_output.shape - if len(shape) == 2 and shape[1] == 1: - sk_training_output = numpy.ravel(sk_training_output) - - self._clf.fit(self._training_inputs, sk_training_output) - self._fitted = True - else: - if self.hyperparams['error_on_no_input']: - raise RuntimeError("No input columns were selected") - self.logger.warn("No input columns were selected") - - return CallResult(None) - - - - def produce(self, *, inputs: Inputs, timeout: float = None, iterations: int = None) -> CallResult[Outputs]: - sk_inputs, columns_to_use = self._get_columns_to_fit(inputs, self.hyperparams) - output = [] - if len(sk_inputs.columns): - try: - sk_output = self._clf.predict(sk_inputs) - except sklearn.exceptions.NotFittedError as error: - raise PrimitiveNotFittedError("Primitive not fitted.") from error - # For primitives that allow predicting without fitting like GaussianProcessRegressor - if not self._fitted: - raise PrimitiveNotFittedError("Primitive not fitted.") - if sparse.issparse(sk_output): - sk_output = sk_output.toarray() - output = self._wrap_predictions(inputs, sk_output) - output.columns = self._target_names - output = [output] - else: - if self.hyperparams['error_on_no_input']: - raise RuntimeError("No input columns were selected") - self.logger.warn("No input columns were selected") - outputs = base_utils.combine_columns(return_result=self.hyperparams['return_result'], - add_index_columns=self.hyperparams['add_index_columns'], - inputs=inputs, column_indices=self._target_column_indices, - columns_list=output) - - return CallResult(outputs) - - - def get_params(self) -> Params: - if not self._fitted: - return Params( - estimators_=None, - classes_=None, - n_classes_=None, - n_features_=None, - n_outputs_=None, - oob_score_=None, - oob_decision_function_=None, - base_estimator_=None, - estimator_params=None, - base_estimator=None, - input_column_names=self._input_column_names, - training_indices_=self._training_indices, - target_names_=self._target_names, - target_column_indices_=self._target_column_indices, - target_columns_metadata_=self._target_columns_metadata - ) - - return Params( - estimators_=getattr(self._clf, 'estimators_', None), - classes_=getattr(self._clf, 'classes_', None), - n_classes_=getattr(self._clf, 'n_classes_', None), - n_features_=getattr(self._clf, 'n_features_', None), - n_outputs_=getattr(self._clf, 'n_outputs_', None), - oob_score_=getattr(self._clf, 'oob_score_', None), - oob_decision_function_=getattr(self._clf, 'oob_decision_function_', None), - base_estimator_=getattr(self._clf, 'base_estimator_', None), - estimator_params=getattr(self._clf, 'estimator_params', None), - base_estimator=getattr(self._clf, 'base_estimator', None), - input_column_names=self._input_column_names, - training_indices_=self._training_indices, - target_names_=self._target_names, - target_column_indices_=self._target_column_indices, - target_columns_metadata_=self._target_columns_metadata - ) - - def set_params(self, *, params: Params) -> None: - self._clf.estimators_ = params['estimators_'] - self._clf.classes_ = params['classes_'] - self._clf.n_classes_ = params['n_classes_'] - self._clf.n_features_ = params['n_features_'] - self._clf.n_outputs_ = params['n_outputs_'] - self._clf.oob_score_ = params['oob_score_'] - self._clf.oob_decision_function_ = params['oob_decision_function_'] - self._clf.base_estimator_ = params['base_estimator_'] - self._clf.estimator_params = params['estimator_params'] - self._clf.base_estimator = params['base_estimator'] - self._input_column_names = params['input_column_names'] - self._training_indices = params['training_indices_'] - self._target_names = params['target_names_'] - self._target_column_indices = params['target_column_indices_'] - self._target_columns_metadata = params['target_columns_metadata_'] - - if params['estimators_'] is not None: - self._fitted = True - if params['classes_'] is not None: - self._fitted = True - if params['n_classes_'] is not None: - self._fitted = True - if params['n_features_'] is not None: - self._fitted = True - if params['n_outputs_'] is not None: - self._fitted = True - if params['oob_score_'] is not None: - self._fitted = True - if params['oob_decision_function_'] is not None: - self._fitted = True - if params['base_estimator_'] is not None: - self._fitted = True - if params['estimator_params'] is not None: - self._fitted = True - if params['base_estimator'] is not None: - self._fitted = True - - - def log_likelihoods(self, *, - outputs: Outputs, - inputs: Inputs, - timeout: float = None, - iterations: int = None) -> CallResult[Sequence[float]]: - inputs = inputs.iloc[:, self._training_indices] # Get ndarray - outputs = outputs.iloc[:, self._target_column_indices] - - if len(inputs.columns) and len(outputs.columns): - - if outputs.shape[1] != self._clf.n_outputs_: - raise exceptions.InvalidArgumentValueError("\"outputs\" argument does not have the correct number of target columns.") - - log_proba = self._clf.predict_log_proba(inputs) - - # Making it always a list, even when only one target. - if self._clf.n_outputs_ == 1: - log_proba = [log_proba] - classes = [self._clf.classes_] - else: - classes = self._clf.classes_ - - samples_length = inputs.shape[0] - - log_likelihoods = [] - for k in range(self._clf.n_outputs_): - # We have to map each class to its internal (numerical) index used in the learner. - # This allows "outputs" to contain string classes. - outputs_column = outputs.iloc[:, k] - classes_map = pandas.Series(numpy.arange(len(classes[k])), index=classes[k]) - mapped_outputs_column = outputs_column.map(classes_map) - - # For each target column (column in "outputs"), for each sample (row) we pick the log - # likelihood for a given class. - log_likelihoods.append(log_proba[k][numpy.arange(samples_length), mapped_outputs_column]) - - results = d3m_dataframe(dict(enumerate(log_likelihoods)), generate_metadata=True) - results.columns = outputs.columns - - for k in range(self._clf.n_outputs_): - column_metadata = outputs.metadata.query_column(k) - if 'name' in column_metadata: - results.metadata = results.metadata.update_column(k, {'name': column_metadata['name']}) - - else: - results = d3m_dataframe(generate_metadata=True) - - return CallResult(results) - - - - def produce_feature_importances(self, *, timeout: float = None, iterations: int = None) -> CallResult[d3m_dataframe]: - output = d3m_dataframe(self._clf.feature_importances_.reshape((1, len(self._input_column_names)))) - output.columns = self._input_column_names - for i in range(len(self._input_column_names)): - output.metadata = output.metadata.update_column(i, {"name": self._input_column_names[i]}) - return CallResult(output) - - @classmethod - def _get_columns_to_fit(cls, inputs: Inputs, hyperparams: Hyperparams): - if not hyperparams['use_semantic_types']: - return inputs, list(range(len(inputs.columns))) - - inputs_metadata = inputs.metadata - - def can_produce_column(column_index: int) -> bool: - return cls._can_produce_column(inputs_metadata, column_index, hyperparams) - - columns_to_produce, columns_not_to_produce = base_utils.get_columns_to_use(inputs_metadata, - use_columns=hyperparams['use_inputs_columns'], - exclude_columns=hyperparams['exclude_inputs_columns'], - can_use_column=can_produce_column) - return inputs.iloc[:, columns_to_produce], columns_to_produce - # return columns_to_produce - - @classmethod - def _can_produce_column(cls, inputs_metadata: metadata_base.DataMetadata, column_index: int, hyperparams: Hyperparams) -> bool: - column_metadata = inputs_metadata.query((metadata_base.ALL_ELEMENTS, column_index)) - - accepted_structural_types = (int, float, numpy.integer, numpy.float64) - accepted_semantic_types = set() - accepted_semantic_types.add("https://metadata.datadrivendiscovery.org/types/Attribute") - if not issubclass(column_metadata['structural_type'], accepted_structural_types): - return False - - semantic_types = set(column_metadata.get('semantic_types', [])) - - if len(semantic_types) == 0: - cls.logger.warning("No semantic types found in column metadata") - return False - # Making sure all accepted_semantic_types are available in semantic_types - if len(accepted_semantic_types - semantic_types) == 0: - return True - - return False - - @classmethod - def _get_targets(cls, data: d3m_dataframe, hyperparams: Hyperparams): - if not hyperparams['use_semantic_types']: - return data, list(data.columns), list(range(len(data.columns))) - - metadata = data.metadata - - def can_produce_column(column_index: int) -> bool: - accepted_semantic_types = set() - accepted_semantic_types.add("https://metadata.datadrivendiscovery.org/types/TrueTarget") - column_metadata = metadata.query((metadata_base.ALL_ELEMENTS, column_index)) - semantic_types = set(column_metadata.get('semantic_types', [])) - if len(semantic_types) == 0: - cls.logger.warning("No semantic types found in column metadata") - return False - # Making sure all accepted_semantic_types are available in semantic_types - if len(accepted_semantic_types - semantic_types) == 0: - return True - return False - - target_column_indices, target_columns_not_to_produce = base_utils.get_columns_to_use(metadata, - use_columns=hyperparams[ - 'use_outputs_columns'], - exclude_columns= - hyperparams[ - 'exclude_outputs_columns'], - can_use_column=can_produce_column) - targets = [] - if target_column_indices: - targets = data.select_columns(target_column_indices) - target_column_names = [] - for idx in target_column_indices: - target_column_names.append(data.columns[idx]) - return targets, target_column_names, target_column_indices - - @classmethod - def _get_target_columns_metadata(cls, outputs_metadata: metadata_base.DataMetadata, hyperparams) -> List[OrderedDict]: - outputs_length = outputs_metadata.query((metadata_base.ALL_ELEMENTS,))['dimension']['length'] - - target_columns_metadata: List[OrderedDict] = [] - for column_index in range(outputs_length): - column_metadata = OrderedDict(outputs_metadata.query_column(column_index)) - - # Update semantic types and prepare it for predicted targets. - semantic_types = set(column_metadata.get('semantic_types', [])) - semantic_types_to_remove = set(["https://metadata.datadrivendiscovery.org/types/TrueTarget","https://metadata.datadrivendiscovery.org/types/SuggestedTarget",]) - add_semantic_types = set(["https://metadata.datadrivendiscovery.org/types/PredictedTarget",]) - add_semantic_types.add(hyperparams["return_semantic_type"]) - semantic_types = semantic_types - semantic_types_to_remove - semantic_types = semantic_types.union(add_semantic_types) - column_metadata['semantic_types'] = list(semantic_types) - - target_columns_metadata.append(column_metadata) - - return target_columns_metadata - - @classmethod - def _update_predictions_metadata(cls, inputs_metadata: metadata_base.DataMetadata, outputs: Optional[Outputs], - target_columns_metadata: List[OrderedDict]) -> metadata_base.DataMetadata: - outputs_metadata = metadata_base.DataMetadata().generate(value=outputs) - - for column_index, column_metadata in enumerate(target_columns_metadata): - column_metadata.pop("structural_type", None) - outputs_metadata = outputs_metadata.update_column(column_index, column_metadata) - - return outputs_metadata - - def _wrap_predictions(self, inputs: Inputs, predictions: ndarray) -> Outputs: - outputs = d3m_dataframe(predictions, generate_metadata=False) - outputs.metadata = self._update_predictions_metadata(inputs.metadata, outputs, self._target_columns_metadata) - return outputs - - - @classmethod - def _add_target_columns_metadata(cls, outputs_metadata: metadata_base.DataMetadata): - outputs_length = outputs_metadata.query((metadata_base.ALL_ELEMENTS,))['dimension']['length'] - - target_columns_metadata: List[OrderedDict] = [] - for column_index in range(outputs_length): - column_metadata = OrderedDict() - semantic_types = [] - semantic_types.append('https://metadata.datadrivendiscovery.org/types/PredictedTarget') - column_name = outputs_metadata.query((metadata_base.ALL_ELEMENTS, column_index)).get("name") - if column_name is None: - column_name = "output_{}".format(column_index) - column_metadata["semantic_types"] = semantic_types - column_metadata["name"] = str(column_name) - target_columns_metadata.append(column_metadata) - - return target_columns_metadata - - -SKExtraTreesClassifier.__doc__ = ExtraTreesClassifier.__doc__ \ No newline at end of file diff --git a/common-primitives/sklearn-wrap/sklearn_wrap/SKExtraTreesRegressor.py b/common-primitives/sklearn-wrap/sklearn_wrap/SKExtraTreesRegressor.py deleted file mode 100644 index 4e4b10c..0000000 --- a/common-primitives/sklearn-wrap/sklearn_wrap/SKExtraTreesRegressor.py +++ /dev/null @@ -1,607 +0,0 @@ -from typing import Any, Callable, List, Dict, Union, Optional, Sequence, Tuple -from numpy import ndarray -from collections import OrderedDict -from scipy import sparse -import os -import sklearn -import numpy -import typing - -# Custom import commands if any -from sklearn.ensemble.forest import ExtraTreesRegressor - - -from d3m.container.numpy import ndarray as d3m_ndarray -from d3m.container import DataFrame as d3m_dataframe -from d3m.metadata import hyperparams, params, base as metadata_base -from d3m import utils -from d3m.base import utils as base_utils -from d3m.exceptions import PrimitiveNotFittedError -from d3m.primitive_interfaces.base import CallResult, DockerContainer - -from d3m.primitive_interfaces.supervised_learning import SupervisedLearnerPrimitiveBase -from d3m.primitive_interfaces.base import ProbabilisticCompositionalityMixin, ContinueFitMixin -from d3m import exceptions -import pandas - - - -Inputs = d3m_dataframe -Outputs = d3m_dataframe - - -class Params(params.Params): - estimators_: Optional[List[sklearn.tree.ExtraTreeRegressor]] - n_features_: Optional[int] - n_outputs_: Optional[int] - oob_score_: Optional[float] - oob_prediction_: Optional[ndarray] - base_estimator_: Optional[object] - estimator_params: Optional[tuple] - class_weight: Optional[Union[str, dict, List[dict]]] - base_estimator: Optional[object] - input_column_names: Optional[Any] - target_names_: Optional[Sequence[Any]] - training_indices_: Optional[Sequence[int]] - target_column_indices_: Optional[Sequence[int]] - target_columns_metadata_: Optional[List[OrderedDict]] - - - -class Hyperparams(hyperparams.Hyperparams): - n_estimators = hyperparams.Bounded[int]( - default=10, - lower=1, - upper=None, - description='The number of trees in the forest.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - criterion = hyperparams.Enumeration[str]( - values=['mse', 'mae'], - default='mse', - description='The function to measure the quality of a split. Supported criteria are "mse" for the mean squared error, which is equal to variance reduction as feature selection criterion, and "mae" for the mean absolute error. .. versionadded:: 0.18 Mean Absolute Error (MAE) criterion.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - max_depth = hyperparams.Union( - configuration=OrderedDict({ - 'int': hyperparams.Bounded[int]( - lower=0, - upper=None, - default=5, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ), - 'none': hyperparams.Constant( - default=None, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ) - }), - default='none', - description='The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - min_samples_split = hyperparams.Union( - configuration=OrderedDict({ - 'float': hyperparams.Bounded[float]( - lower=0, - upper=1, - default=1.0, - description='It\'s a percentage and `ceil(min_samples_split * n_samples)` is the minimum number of samples for each split.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ), - 'int': hyperparams.Bounded[int]( - lower=0, - upper=None, - default=2, - description='Minimum number.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ) - }), - default='int', - description='The minimum number of samples required to split an internal node: - If int, then consider `min_samples_split` as the minimum number. - If float, then `min_samples_split` is a percentage and `ceil(min_samples_split * n_samples)` are the minimum number of samples for each split. .. versionchanged:: 0.18 Added float values for percentages.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - min_samples_leaf = hyperparams.Union( - configuration=OrderedDict({ - 'percent': hyperparams.Bounded[float]( - lower=0, - upper=0.5, - default=0.25, - description='It\'s a percentage and `ceil(min_samples_leaf * n_samples)` is the minimum number of samples for each node.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ), - 'absolute': hyperparams.Bounded[int]( - lower=1, - upper=None, - default=1, - description='Minimum number.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ) - }), - default='absolute', - description='The minimum number of samples required to be at a leaf node: - If int, then consider `min_samples_leaf` as the minimum number. - If float, then `min_samples_leaf` is a percentage and `ceil(min_samples_leaf * n_samples)` are the minimum number of samples for each node. .. versionchanged:: 0.18 Added float values for percentages.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - min_weight_fraction_leaf = hyperparams.Bounded[float]( - default=0, - lower=0, - upper=0.5, - description='The minimum weighted fraction of the sum total of weights (of all the input samples) required to be at a leaf node. Samples have equal weight when sample_weight is not provided.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - max_leaf_nodes = hyperparams.Union( - configuration=OrderedDict({ - 'int': hyperparams.Bounded[int]( - lower=0, - upper=None, - default=10, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ), - 'none': hyperparams.Constant( - default=None, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ) - }), - default='none', - description='Grow trees with ``max_leaf_nodes`` in best-first fashion. Best nodes are defined as relative reduction in impurity. If None then unlimited number of leaf nodes.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - max_features = hyperparams.Union( - configuration=OrderedDict({ - 'specified_int': hyperparams.Bounded[int]( - lower=0, - upper=None, - default=0, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ), - 'calculated': hyperparams.Enumeration[str]( - values=['auto', 'sqrt', 'log2'], - default='auto', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ), - 'none': hyperparams.Constant( - default=None, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ), - 'percent': hyperparams.Bounded[float]( - default=0.25, - lower=0, - upper=1, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ) - }), - default='calculated', - description='The number of features to consider when looking for the best split: - If int, then consider `max_features` features at each split. - If float, then `max_features` is a percentage and `int(max_features * n_features)` features are considered at each split. - If "auto", then `max_features=n_features`. - If "sqrt", then `max_features=sqrt(n_features)`. - If "log2", then `max_features=log2(n_features)`. - If None, then `max_features=n_features`. Note: the search for a split does not stop until at least one valid partition of the node samples is found, even if it requires to effectively inspect more than ``max_features`` features.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - min_impurity_decrease = hyperparams.Bounded[float]( - default=0.0, - lower=0.0, - upper=None, - description='A node will be split if this split induces a decrease of the impurity greater than or equal to this value. The weighted impurity decrease equation is the following:: N_t / N * (impurity - N_t_R / N_t * right_impurity - N_t_L / N_t * left_impurity) where ``N`` is the total number of samples, ``N_t`` is the number of samples at the current node, ``N_t_L`` is the number of samples in the left child, and ``N_t_R`` is the number of samples in the right child. ``N``, ``N_t``, ``N_t_R`` and ``N_t_L`` all refer to the weighted sum, if ``sample_weight`` is passed. .. versionadded:: 0.19 ', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - bootstrap = hyperparams.Enumeration[str]( - values=['bootstrap', 'bootstrap_with_oob_score', 'disabled'], - default='bootstrap', - description='Whether bootstrap samples are used when building trees.' - ' And whether to use out-of-bag samples to estimate the generalization accuracy.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ) - warm_start = hyperparams.UniformBool( - default=False, - description='When set to ``True``, reuse the solution of the previous call to fit and add more estimators to the ensemble, otherwise, just fit a whole new forest.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - n_jobs = hyperparams.Union( - configuration=OrderedDict({ - 'limit': hyperparams.Bounded[int]( - default=1, - lower=1, - upper=None, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ), - 'all_cores': hyperparams.Constant( - default=-1, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ) - }), - default='limit', - description='The number of jobs to run in parallel for both `fit` and `predict`. If -1, then the number of jobs is set to the number of cores.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/ResourcesUseParameter'] - ) - - use_inputs_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to force primitive to use as training input. If any specified column cannot be parsed, it is skipped.", - ) - use_outputs_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to force primitive to use as training target. If any specified column cannot be parsed, it is skipped.", - ) - exclude_inputs_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to not use as training inputs. Applicable only if \"use_columns\" is not provided.", - ) - exclude_outputs_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to not use as training target. Applicable only if \"use_columns\" is not provided.", - ) - return_result = hyperparams.Enumeration( - values=['append', 'replace', 'new'], - default='new', - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Should parsed columns be appended, should they replace original columns, or should only parsed columns be returned? This hyperparam is ignored if use_semantic_types is set to false.", - ) - use_semantic_types = hyperparams.UniformBool( - default=False, - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Controls whether semantic_types metadata will be used for filtering columns in input dataframe. Setting this to false makes the code ignore return_result and will produce only the output dataframe" - ) - add_index_columns = hyperparams.UniformBool( - default=False, - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Also include primary index columns if input data has them. Applicable only if \"return_result\" is set to \"new\".", - ) - error_on_no_input = hyperparams.UniformBool( - default=True, - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Throw an exception if no input column is selected/provided. Defaults to true to behave like sklearn. To prevent pipelines from breaking set this to False.", - ) - - return_semantic_type = hyperparams.Enumeration[str]( - values=['https://metadata.datadrivendiscovery.org/types/Attribute', 'https://metadata.datadrivendiscovery.org/types/ConstructedAttribute', 'https://metadata.datadrivendiscovery.org/types/PredictedTarget'], - default='https://metadata.datadrivendiscovery.org/types/PredictedTarget', - description='Decides what semantic type to attach to generated output', - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'] - ) - -class SKExtraTreesRegressor(SupervisedLearnerPrimitiveBase[Inputs, Outputs, Params, Hyperparams]): - """ - Primitive wrapping for sklearn ExtraTreesRegressor - `sklearn documentation `_ - - """ - - __author__ = "JPL MARVIN" - metadata = metadata_base.PrimitiveMetadata({ - "algorithm_types": [metadata_base.PrimitiveAlgorithmType.DECISION_TREE, ], - "name": "sklearn.ensemble.forest.ExtraTreesRegressor", - "primitive_family": metadata_base.PrimitiveFamily.REGRESSION, - "python_path": "d3m.primitives.regression.extra_trees.SKlearn", - "source": {'name': 'JPL', 'contact': 'mailto:shah@jpl.nasa.gov', 'uris': ['https://gitlab.com/datadrivendiscovery/sklearn-wrap/issues', 'https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.ExtraTreesRegressor.html']}, - "version": "2019.11.13", - "id": "35321059-2a1a-31fd-9509-5494efc751c7", - "hyperparams_to_tune": ['n_estimators', 'max_depth', 'min_samples_split', 'min_samples_leaf', 'max_features'], - 'installation': [ - {'type': metadata_base.PrimitiveInstallationType.PIP, - 'package_uri': 'git+https://gitlab.com/datadrivendiscovery/sklearn-wrap.git@{git_commit}#egg=sklearn_wrap'.format( - git_commit=utils.current_git_commit(os.path.dirname(__file__)), - ), - }] - }) - - def __init__(self, *, - hyperparams: Hyperparams, - random_seed: int = 0, - docker_containers: Dict[str, DockerContainer] = None, - _verbose: int = 0) -> None: - - super().__init__(hyperparams=hyperparams, random_seed=random_seed, docker_containers=docker_containers) - - # False - self._clf = ExtraTreesRegressor( - n_estimators=self.hyperparams['n_estimators'], - criterion=self.hyperparams['criterion'], - max_depth=self.hyperparams['max_depth'], - min_samples_split=self.hyperparams['min_samples_split'], - min_samples_leaf=self.hyperparams['min_samples_leaf'], - min_weight_fraction_leaf=self.hyperparams['min_weight_fraction_leaf'], - max_leaf_nodes=self.hyperparams['max_leaf_nodes'], - max_features=self.hyperparams['max_features'], - min_impurity_decrease=self.hyperparams['min_impurity_decrease'], - bootstrap=self.hyperparams['bootstrap'] in ['bootstrap', 'bootstrap_with_oob_score'], - oob_score=self.hyperparams['bootstrap'] in ['bootstrap_with_oob_score'], - warm_start=self.hyperparams['warm_start'], - n_jobs=self.hyperparams['n_jobs'], - random_state=self.random_seed, - verbose=_verbose - ) - - self._inputs = None - self._outputs = None - self._training_inputs = None - self._training_outputs = None - self._target_names = None - self._training_indices = None - self._target_column_indices = None - self._target_columns_metadata: List[OrderedDict] = None - self._input_column_names = None - self._fitted = False - self._new_training_data = False - - def set_training_data(self, *, inputs: Inputs, outputs: Outputs) -> None: - self._inputs = inputs - self._outputs = outputs - self._fitted = False - self._new_training_data = True - - def fit(self, *, timeout: float = None, iterations: int = None) -> CallResult[None]: - if self._inputs is None or self._outputs is None: - raise ValueError("Missing training data.") - - if not self._new_training_data: - return CallResult(None) - self._new_training_data = False - - self._training_inputs, self._training_indices = self._get_columns_to_fit(self._inputs, self.hyperparams) - self._training_outputs, self._target_names, self._target_column_indices = self._get_targets(self._outputs, self.hyperparams) - self._input_column_names = self._training_inputs.columns - - if len(self._training_indices) > 0 and len(self._target_column_indices) > 0: - self._target_columns_metadata = self._get_target_columns_metadata(self._training_outputs.metadata, self.hyperparams) - sk_training_output = self._training_outputs.values - - shape = sk_training_output.shape - if len(shape) == 2 and shape[1] == 1: - sk_training_output = numpy.ravel(sk_training_output) - - self._clf.fit(self._training_inputs, sk_training_output) - self._fitted = True - else: - if self.hyperparams['error_on_no_input']: - raise RuntimeError("No input columns were selected") - self.logger.warn("No input columns were selected") - - return CallResult(None) - - - - def produce(self, *, inputs: Inputs, timeout: float = None, iterations: int = None) -> CallResult[Outputs]: - sk_inputs, columns_to_use = self._get_columns_to_fit(inputs, self.hyperparams) - output = [] - if len(sk_inputs.columns): - try: - sk_output = self._clf.predict(sk_inputs) - except sklearn.exceptions.NotFittedError as error: - raise PrimitiveNotFittedError("Primitive not fitted.") from error - # For primitives that allow predicting without fitting like GaussianProcessRegressor - if not self._fitted: - raise PrimitiveNotFittedError("Primitive not fitted.") - if sparse.issparse(sk_output): - sk_output = sk_output.toarray() - output = self._wrap_predictions(inputs, sk_output) - output.columns = self._target_names - output = [output] - else: - if self.hyperparams['error_on_no_input']: - raise RuntimeError("No input columns were selected") - self.logger.warn("No input columns were selected") - outputs = base_utils.combine_columns(return_result=self.hyperparams['return_result'], - add_index_columns=self.hyperparams['add_index_columns'], - inputs=inputs, column_indices=self._target_column_indices, - columns_list=output) - - return CallResult(outputs) - - - def get_params(self) -> Params: - if not self._fitted: - return Params( - estimators_=None, - n_features_=None, - n_outputs_=None, - oob_score_=None, - oob_prediction_=None, - base_estimator_=None, - estimator_params=None, - class_weight=None, - base_estimator=None, - input_column_names=self._input_column_names, - training_indices_=self._training_indices, - target_names_=self._target_names, - target_column_indices_=self._target_column_indices, - target_columns_metadata_=self._target_columns_metadata - ) - - return Params( - estimators_=getattr(self._clf, 'estimators_', None), - n_features_=getattr(self._clf, 'n_features_', None), - n_outputs_=getattr(self._clf, 'n_outputs_', None), - oob_score_=getattr(self._clf, 'oob_score_', None), - oob_prediction_=getattr(self._clf, 'oob_prediction_', None), - base_estimator_=getattr(self._clf, 'base_estimator_', None), - estimator_params=getattr(self._clf, 'estimator_params', None), - class_weight=getattr(self._clf, 'class_weight', None), - base_estimator=getattr(self._clf, 'base_estimator', None), - input_column_names=self._input_column_names, - training_indices_=self._training_indices, - target_names_=self._target_names, - target_column_indices_=self._target_column_indices, - target_columns_metadata_=self._target_columns_metadata - ) - - def set_params(self, *, params: Params) -> None: - self._clf.estimators_ = params['estimators_'] - self._clf.n_features_ = params['n_features_'] - self._clf.n_outputs_ = params['n_outputs_'] - self._clf.oob_score_ = params['oob_score_'] - self._clf.oob_prediction_ = params['oob_prediction_'] - self._clf.base_estimator_ = params['base_estimator_'] - self._clf.estimator_params = params['estimator_params'] - self._clf.class_weight = params['class_weight'] - self._clf.base_estimator = params['base_estimator'] - self._input_column_names = params['input_column_names'] - self._training_indices = params['training_indices_'] - self._target_names = params['target_names_'] - self._target_column_indices = params['target_column_indices_'] - self._target_columns_metadata = params['target_columns_metadata_'] - - if params['estimators_'] is not None: - self._fitted = True - if params['n_features_'] is not None: - self._fitted = True - if params['n_outputs_'] is not None: - self._fitted = True - if params['oob_score_'] is not None: - self._fitted = True - if params['oob_prediction_'] is not None: - self._fitted = True - if params['base_estimator_'] is not None: - self._fitted = True - if params['estimator_params'] is not None: - self._fitted = True - if params['class_weight'] is not None: - self._fitted = True - if params['base_estimator'] is not None: - self._fitted = True - - - - - - def produce_feature_importances(self, *, timeout: float = None, iterations: int = None) -> CallResult[d3m_dataframe]: - output = d3m_dataframe(self._clf.feature_importances_.reshape((1, len(self._input_column_names)))) - output.columns = self._input_column_names - for i in range(len(self._input_column_names)): - output.metadata = output.metadata.update_column(i, {"name": self._input_column_names[i]}) - return CallResult(output) - - @classmethod - def _get_columns_to_fit(cls, inputs: Inputs, hyperparams: Hyperparams): - if not hyperparams['use_semantic_types']: - return inputs, list(range(len(inputs.columns))) - - inputs_metadata = inputs.metadata - - def can_produce_column(column_index: int) -> bool: - return cls._can_produce_column(inputs_metadata, column_index, hyperparams) - - columns_to_produce, columns_not_to_produce = base_utils.get_columns_to_use(inputs_metadata, - use_columns=hyperparams['use_inputs_columns'], - exclude_columns=hyperparams['exclude_inputs_columns'], - can_use_column=can_produce_column) - return inputs.iloc[:, columns_to_produce], columns_to_produce - # return columns_to_produce - - @classmethod - def _can_produce_column(cls, inputs_metadata: metadata_base.DataMetadata, column_index: int, hyperparams: Hyperparams) -> bool: - column_metadata = inputs_metadata.query((metadata_base.ALL_ELEMENTS, column_index)) - - accepted_structural_types = (int, float, numpy.integer, numpy.float64) - accepted_semantic_types = set() - accepted_semantic_types.add("https://metadata.datadrivendiscovery.org/types/Attribute") - if not issubclass(column_metadata['structural_type'], accepted_structural_types): - return False - - semantic_types = set(column_metadata.get('semantic_types', [])) - - if len(semantic_types) == 0: - cls.logger.warning("No semantic types found in column metadata") - return False - # Making sure all accepted_semantic_types are available in semantic_types - if len(accepted_semantic_types - semantic_types) == 0: - return True - - return False - - @classmethod - def _get_targets(cls, data: d3m_dataframe, hyperparams: Hyperparams): - if not hyperparams['use_semantic_types']: - return data, list(data.columns), list(range(len(data.columns))) - - metadata = data.metadata - - def can_produce_column(column_index: int) -> bool: - accepted_semantic_types = set() - accepted_semantic_types.add("https://metadata.datadrivendiscovery.org/types/TrueTarget") - column_metadata = metadata.query((metadata_base.ALL_ELEMENTS, column_index)) - semantic_types = set(column_metadata.get('semantic_types', [])) - if len(semantic_types) == 0: - cls.logger.warning("No semantic types found in column metadata") - return False - # Making sure all accepted_semantic_types are available in semantic_types - if len(accepted_semantic_types - semantic_types) == 0: - return True - return False - - target_column_indices, target_columns_not_to_produce = base_utils.get_columns_to_use(metadata, - use_columns=hyperparams[ - 'use_outputs_columns'], - exclude_columns= - hyperparams[ - 'exclude_outputs_columns'], - can_use_column=can_produce_column) - targets = [] - if target_column_indices: - targets = data.select_columns(target_column_indices) - target_column_names = [] - for idx in target_column_indices: - target_column_names.append(data.columns[idx]) - return targets, target_column_names, target_column_indices - - @classmethod - def _get_target_columns_metadata(cls, outputs_metadata: metadata_base.DataMetadata, hyperparams) -> List[OrderedDict]: - outputs_length = outputs_metadata.query((metadata_base.ALL_ELEMENTS,))['dimension']['length'] - - target_columns_metadata: List[OrderedDict] = [] - for column_index in range(outputs_length): - column_metadata = OrderedDict(outputs_metadata.query_column(column_index)) - - # Update semantic types and prepare it for predicted targets. - semantic_types = set(column_metadata.get('semantic_types', [])) - semantic_types_to_remove = set(["https://metadata.datadrivendiscovery.org/types/TrueTarget","https://metadata.datadrivendiscovery.org/types/SuggestedTarget",]) - add_semantic_types = set(["https://metadata.datadrivendiscovery.org/types/PredictedTarget",]) - add_semantic_types.add(hyperparams["return_semantic_type"]) - semantic_types = semantic_types - semantic_types_to_remove - semantic_types = semantic_types.union(add_semantic_types) - column_metadata['semantic_types'] = list(semantic_types) - - target_columns_metadata.append(column_metadata) - - return target_columns_metadata - - @classmethod - def _update_predictions_metadata(cls, inputs_metadata: metadata_base.DataMetadata, outputs: Optional[Outputs], - target_columns_metadata: List[OrderedDict]) -> metadata_base.DataMetadata: - outputs_metadata = metadata_base.DataMetadata().generate(value=outputs) - - for column_index, column_metadata in enumerate(target_columns_metadata): - column_metadata.pop("structural_type", None) - outputs_metadata = outputs_metadata.update_column(column_index, column_metadata) - - return outputs_metadata - - def _wrap_predictions(self, inputs: Inputs, predictions: ndarray) -> Outputs: - outputs = d3m_dataframe(predictions, generate_metadata=False) - outputs.metadata = self._update_predictions_metadata(inputs.metadata, outputs, self._target_columns_metadata) - return outputs - - - @classmethod - def _add_target_columns_metadata(cls, outputs_metadata: metadata_base.DataMetadata): - outputs_length = outputs_metadata.query((metadata_base.ALL_ELEMENTS,))['dimension']['length'] - - target_columns_metadata: List[OrderedDict] = [] - for column_index in range(outputs_length): - column_metadata = OrderedDict() - semantic_types = [] - semantic_types.append('https://metadata.datadrivendiscovery.org/types/PredictedTarget') - column_name = outputs_metadata.query((metadata_base.ALL_ELEMENTS, column_index)).get("name") - if column_name is None: - column_name = "output_{}".format(column_index) - column_metadata["semantic_types"] = semantic_types - column_metadata["name"] = str(column_name) - target_columns_metadata.append(column_metadata) - - return target_columns_metadata - - -SKExtraTreesRegressor.__doc__ = ExtraTreesRegressor.__doc__ \ No newline at end of file diff --git a/common-primitives/sklearn-wrap/sklearn_wrap/SKFastICA.py b/common-primitives/sklearn-wrap/sklearn_wrap/SKFastICA.py deleted file mode 100644 index f160a02..0000000 --- a/common-primitives/sklearn-wrap/sklearn_wrap/SKFastICA.py +++ /dev/null @@ -1,439 +0,0 @@ -from typing import Any, Callable, List, Dict, Union, Optional, Sequence, Tuple -from numpy import ndarray -from collections import OrderedDict -from scipy import sparse -import os -import sklearn -import numpy -import typing - -# Custom import commands if any -from sklearn.decomposition.fastica_ import FastICA - - -from d3m.container.numpy import ndarray as d3m_ndarray -from d3m.container import DataFrame as d3m_dataframe -from d3m.metadata import hyperparams, params, base as metadata_base -from d3m import utils -from d3m.base import utils as base_utils -from d3m.exceptions import PrimitiveNotFittedError -from d3m.primitive_interfaces.base import CallResult, DockerContainer -from d3m.primitive_interfaces.unsupervised_learning import UnsupervisedLearnerPrimitiveBase - - -Inputs = d3m_dataframe -Outputs = d3m_dataframe - - -class Params(params.Params): - n_iter_: Optional[int] - mixing_: Optional[ndarray] - components_: Optional[ndarray] - mean_: Optional[ndarray] - whitening_: Optional[ndarray] - input_column_names: Optional[Any] - target_names_: Optional[Sequence[Any]] - training_indices_: Optional[Sequence[int]] - target_column_indices_: Optional[Sequence[int]] - target_columns_metadata_: Optional[List[OrderedDict]] - - - -class Hyperparams(hyperparams.Hyperparams): - n_components = hyperparams.Union( - configuration=OrderedDict({ - 'int': hyperparams.Bounded[int]( - lower=0, - upper=None, - default=0, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ), - 'none': hyperparams.Constant( - default=None, - description='All components are used.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ) - }), - default='none', - description='Number of components to extract. If None no dimension reduction is performed.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - algorithm = hyperparams.Enumeration[str]( - default='parallel', - values=['parallel', 'deflation'], - description='Apply a parallel or deflational FASTICA algorithm.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - whiten = hyperparams.UniformBool( - default=True, - description='If True perform an initial whitening of the data. If False, the data is assumed to have already been preprocessed: it should be centered, normed and white. Otherwise you will get incorrect results. In this case the parameter n_components will be ignored.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - fun = hyperparams.Choice( - choices={ - 'logcosh': hyperparams.Hyperparams.define( - configuration=OrderedDict({ - 'alpha': hyperparams.Hyperparameter[float]( - default=1, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - }) - ), - 'exp': hyperparams.Hyperparams.define( - configuration=OrderedDict({}) - ), - 'cube': hyperparams.Hyperparams.define( - configuration=OrderedDict({}) - ) - }, - default='logcosh', - description='The functional form of the G function used in the approximation to neg-entropy. Could be either \'logcosh\', \'exp\', or \'cube\'. You can also provide your own function. It should return a tuple containing the value of the function, and of its derivative, in the point. Example: def my_g(x): return x ** 3, 3 * x ** 2', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - max_iter = hyperparams.Bounded[int]( - default=200, - lower=0, - upper=None, - description='Maximum number of iterations to perform. tol: float, optional A positive scalar giving the tolerance at which the un-mixing matrix is considered to have converged.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - tol = hyperparams.Bounded[float]( - default=0.0001, - lower=0, - upper=None, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - w_init = hyperparams.Union( - configuration=OrderedDict({ - 'ndarray': hyperparams.Hyperparameter[ndarray]( - default=numpy.array([]), - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ), - 'none': hyperparams.Constant( - default=None, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ) - }), - default='none', - description='Initial un-mixing array of dimension (n.comp,n.comp). If None (default) then an array of normal r.v.\'s is used.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'] - ) - - use_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to force primitive to operate on. If any specified column cannot be parsed, it is skipped.", - ) - exclude_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to not operate on. Applicable only if \"use_columns\" is not provided.", - ) - return_result = hyperparams.Enumeration( - values=['append', 'replace', 'new'], - default='new', - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Should parsed columns be appended, should they replace original columns, or should only parsed columns be returned? This hyperparam is ignored if use_semantic_types is set to false.", - ) - use_semantic_types = hyperparams.UniformBool( - default=False, - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Controls whether semantic_types metadata will be used for filtering columns in input dataframe. Setting this to false makes the code ignore return_result and will produce only the output dataframe" - ) - add_index_columns = hyperparams.UniformBool( - default=False, - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Also include primary index columns if input data has them. Applicable only if \"return_result\" is set to \"new\".", - ) - error_on_no_input = hyperparams.UniformBool( - default=True, - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Throw an exception if no input column is selected/provided. Defaults to true to behave like sklearn. To prevent pipelines from breaking set this to False.", - ) - - return_semantic_type = hyperparams.Enumeration[str]( - values=['https://metadata.datadrivendiscovery.org/types/Attribute', 'https://metadata.datadrivendiscovery.org/types/ConstructedAttribute'], - default='https://metadata.datadrivendiscovery.org/types/Attribute', - description='Decides what semantic type to attach to generated attributes', - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'] - ) - -class SKFastICA(UnsupervisedLearnerPrimitiveBase[Inputs, Outputs, Params, Hyperparams]): - """ - Primitive wrapping for sklearn FastICA - `sklearn documentation `_ - - """ - - __author__ = "JPL MARVIN" - metadata = metadata_base.PrimitiveMetadata({ - "algorithm_types": [metadata_base.PrimitiveAlgorithmType.PRINCIPAL_COMPONENT_ANALYSIS, ], - "name": "sklearn.decomposition.fastica_.FastICA", - "primitive_family": metadata_base.PrimitiveFamily.DATA_TRANSFORMATION, - "python_path": "d3m.primitives.data_transformation.fast_ica.SKlearn", - "source": {'name': 'JPL', 'contact': 'mailto:shah@jpl.nasa.gov', 'uris': ['https://gitlab.com/datadrivendiscovery/sklearn-wrap/issues', 'https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.FastICA.html']}, - "version": "2019.11.13", - "id": "03633ffa-425e-37d4-9f1c-bbb552f1e995", - "hyperparams_to_tune": ['n_components', 'algorithm'], - 'installation': [ - {'type': metadata_base.PrimitiveInstallationType.PIP, - 'package_uri': 'git+https://gitlab.com/datadrivendiscovery/sklearn-wrap.git@{git_commit}#egg=sklearn_wrap'.format( - git_commit=utils.current_git_commit(os.path.dirname(__file__)), - ), - }] - }) - - def __init__(self, *, - hyperparams: Hyperparams, - random_seed: int = 0, - docker_containers: Dict[str, DockerContainer] = None) -> None: - - super().__init__(hyperparams=hyperparams, random_seed=random_seed, docker_containers=docker_containers) - - # False - self._clf = FastICA( - n_components=self.hyperparams['n_components'], - algorithm=self.hyperparams['algorithm'], - whiten=self.hyperparams['whiten'], - fun=self.hyperparams['fun']['choice'], - fun_args=self.hyperparams['fun'], - max_iter=self.hyperparams['max_iter'], - tol=self.hyperparams['tol'], - w_init=self.hyperparams['w_init'], - random_state=self.random_seed, - ) - - self._inputs = None - self._outputs = None - self._training_inputs = None - self._training_outputs = None - self._target_names = None - self._training_indices = None - self._target_column_indices = None - self._target_columns_metadata: List[OrderedDict] = None - self._input_column_names = None - self._fitted = False - - - def set_training_data(self, *, inputs: Inputs) -> None: - self._inputs = inputs - self._fitted = False - - def fit(self, *, timeout: float = None, iterations: int = None) -> CallResult[None]: - if self._fitted: - return CallResult(None) - - self._training_inputs, self._training_indices = self._get_columns_to_fit(self._inputs, self.hyperparams) - self._input_column_names = self._training_inputs.columns - - if self._training_inputs is None: - return CallResult(None) - - if len(self._training_indices) > 0: - self._clf.fit(self._training_inputs) - self._fitted = True - else: - if self.hyperparams['error_on_no_input']: - raise RuntimeError("No input columns were selected") - self.logger.warn("No input columns were selected") - return CallResult(None) - - def produce(self, *, inputs: Inputs, timeout: float = None, iterations: int = None) -> CallResult[Outputs]: - if not self._fitted: - raise PrimitiveNotFittedError("Primitive not fitted.") - sk_inputs = inputs - if self.hyperparams['use_semantic_types']: - sk_inputs = inputs.iloc[:, self._training_indices] - output_columns = [] - if len(self._training_indices) > 0: - sk_output = self._clf.transform(sk_inputs) - if sparse.issparse(sk_output): - sk_output = sk_output.toarray() - outputs = self._wrap_predictions(inputs, sk_output) - if len(outputs.columns) == len(self._input_column_names): - outputs.columns = self._input_column_names - output_columns = [outputs] - else: - if self.hyperparams['error_on_no_input']: - raise RuntimeError("No input columns were selected") - self.logger.warn("No input columns were selected") - outputs = base_utils.combine_columns(return_result=self.hyperparams['return_result'], - add_index_columns=self.hyperparams['add_index_columns'], - inputs=inputs, column_indices=self._training_indices, - columns_list=output_columns) - return CallResult(outputs) - - - def get_params(self) -> Params: - if not self._fitted: - return Params( - n_iter_=None, - mixing_=None, - components_=None, - mean_=None, - whitening_=None, - input_column_names=self._input_column_names, - training_indices_=self._training_indices, - target_names_=self._target_names, - target_column_indices_=self._target_column_indices, - target_columns_metadata_=self._target_columns_metadata - ) - - return Params( - n_iter_=getattr(self._clf, 'n_iter_', None), - mixing_=getattr(self._clf, 'mixing_', None), - components_=getattr(self._clf, 'components_', None), - mean_=getattr(self._clf, 'mean_', None), - whitening_=getattr(self._clf, 'whitening_', None), - input_column_names=self._input_column_names, - training_indices_=self._training_indices, - target_names_=self._target_names, - target_column_indices_=self._target_column_indices, - target_columns_metadata_=self._target_columns_metadata - ) - - def set_params(self, *, params: Params) -> None: - self._clf.n_iter_ = params['n_iter_'] - self._clf.mixing_ = params['mixing_'] - self._clf.components_ = params['components_'] - self._clf.mean_ = params['mean_'] - self._clf.whitening_ = params['whitening_'] - self._input_column_names = params['input_column_names'] - self._training_indices = params['training_indices_'] - self._target_names = params['target_names_'] - self._target_column_indices = params['target_column_indices_'] - self._target_columns_metadata = params['target_columns_metadata_'] - - if params['n_iter_'] is not None: - self._fitted = True - if params['mixing_'] is not None: - self._fitted = True - if params['components_'] is not None: - self._fitted = True - if params['mean_'] is not None: - self._fitted = True - if params['whitening_'] is not None: - self._fitted = True - - - - - - @classmethod - def _get_columns_to_fit(cls, inputs: Inputs, hyperparams: Hyperparams): - if not hyperparams['use_semantic_types']: - return inputs, list(range(len(inputs.columns))) - - inputs_metadata = inputs.metadata - - def can_produce_column(column_index: int) -> bool: - return cls._can_produce_column(inputs_metadata, column_index, hyperparams) - - columns_to_produce, columns_not_to_produce = base_utils.get_columns_to_use(inputs_metadata, - use_columns=hyperparams['use_columns'], - exclude_columns=hyperparams['exclude_columns'], - can_use_column=can_produce_column) - return inputs.iloc[:, columns_to_produce], columns_to_produce - # return columns_to_produce - - @classmethod - def _can_produce_column(cls, inputs_metadata: metadata_base.DataMetadata, column_index: int, hyperparams: Hyperparams) -> bool: - column_metadata = inputs_metadata.query((metadata_base.ALL_ELEMENTS, column_index)) - - accepted_structural_types = (int, float, numpy.integer, numpy.float64) - accepted_semantic_types = set() - accepted_semantic_types.add("https://metadata.datadrivendiscovery.org/types/Attribute") - if not issubclass(column_metadata['structural_type'], accepted_structural_types): - return False - - semantic_types = set(column_metadata.get('semantic_types', [])) - - if len(semantic_types) == 0: - cls.logger.warning("No semantic types found in column metadata") - return False - - # Making sure all accepted_semantic_types are available in semantic_types - if len(accepted_semantic_types - semantic_types) == 0: - return True - - return False - - - @classmethod - def _get_target_columns_metadata(cls, outputs_metadata: metadata_base.DataMetadata, hyperparams) -> List[OrderedDict]: - outputs_length = outputs_metadata.query((metadata_base.ALL_ELEMENTS,))['dimension']['length'] - - target_columns_metadata: List[OrderedDict] = [] - for column_index in range(outputs_length): - column_metadata = OrderedDict(outputs_metadata.query_column(column_index)) - - # Update semantic types and prepare it for predicted targets. - semantic_types = set(column_metadata.get('semantic_types', [])) - semantic_types_to_remove = set([]) - add_semantic_types = [] - add_semantic_types.add(hyperparams["return_semantic_type"]) - semantic_types = semantic_types - semantic_types_to_remove - semantic_types = semantic_types.union(add_semantic_types) - column_metadata['semantic_types'] = list(semantic_types) - - target_columns_metadata.append(column_metadata) - - return target_columns_metadata - - @classmethod - def _update_predictions_metadata(cls, inputs_metadata: metadata_base.DataMetadata, outputs: Optional[Outputs], - target_columns_metadata: List[OrderedDict]) -> metadata_base.DataMetadata: - outputs_metadata = metadata_base.DataMetadata().generate(value=outputs) - - for column_index, column_metadata in enumerate(target_columns_metadata): - column_metadata.pop("structural_type", None) - outputs_metadata = outputs_metadata.update_column(column_index, column_metadata) - - return outputs_metadata - - def _wrap_predictions(self, inputs: Inputs, predictions: ndarray) -> Outputs: - outputs = d3m_dataframe(predictions, generate_metadata=True) - target_columns_metadata = self._copy_inputs_metadata(inputs.metadata, self._training_indices, outputs.metadata, self.hyperparams) - outputs.metadata = self._update_predictions_metadata(inputs.metadata, outputs, target_columns_metadata) - return outputs - - - @classmethod - def _copy_inputs_metadata(cls, inputs_metadata: metadata_base.DataMetadata, input_indices: List[int], - outputs_metadata: metadata_base.DataMetadata, hyperparams): - outputs_length = outputs_metadata.query((metadata_base.ALL_ELEMENTS,))['dimension']['length'] - target_columns_metadata: List[OrderedDict] = [] - for column_index in input_indices: - column_name = inputs_metadata.query((metadata_base.ALL_ELEMENTS, column_index)).get("name") - if column_name is None: - column_name = "output_{}".format(column_index) - - column_metadata = OrderedDict(inputs_metadata.query_column(column_index)) - semantic_types = set(column_metadata.get('semantic_types', [])) - semantic_types_to_remove = set([]) - add_semantic_types = set() - add_semantic_types.add(hyperparams["return_semantic_type"]) - semantic_types = semantic_types - semantic_types_to_remove - semantic_types = semantic_types.union(add_semantic_types) - column_metadata['semantic_types'] = list(semantic_types) - - column_metadata["name"] = str(column_name) - target_columns_metadata.append(column_metadata) - - # If outputs has more columns than index, add Attribute Type to all remaining - if outputs_length > len(input_indices): - for column_index in range(len(input_indices), outputs_length): - column_metadata = OrderedDict() - semantic_types = set() - semantic_types.add(hyperparams["return_semantic_type"]) - column_name = "output_{}".format(column_index) - column_metadata["semantic_types"] = list(semantic_types) - column_metadata["name"] = str(column_name) - target_columns_metadata.append(column_metadata) - - return target_columns_metadata - - -SKFastICA.__doc__ = FastICA.__doc__ \ No newline at end of file diff --git a/common-primitives/sklearn-wrap/sklearn_wrap/SKFeatureAgglomeration.py b/common-primitives/sklearn-wrap/sklearn_wrap/SKFeatureAgglomeration.py deleted file mode 100644 index 36c1411..0000000 --- a/common-primitives/sklearn-wrap/sklearn_wrap/SKFeatureAgglomeration.py +++ /dev/null @@ -1,361 +0,0 @@ -from typing import Any, Callable, List, Dict, Union, Optional, Sequence, Tuple -from numpy import ndarray -from collections import OrderedDict -from scipy import sparse -import os -import sklearn -import numpy -import typing - -# Custom import commands if any -from sklearn.cluster.hierarchical import FeatureAgglomeration -from numpy import mean as npmean - - -from d3m.container.numpy import ndarray as d3m_ndarray -from d3m.container import DataFrame as d3m_dataframe -from d3m.metadata import hyperparams, params, base as metadata_base -from d3m import utils -from d3m.base import utils as base_utils -from d3m.exceptions import PrimitiveNotFittedError -from d3m.primitive_interfaces.base import CallResult, DockerContainer -from d3m.primitive_interfaces.unsupervised_learning import UnsupervisedLearnerPrimitiveBase - - -Inputs = d3m_dataframe -Outputs = d3m_dataframe - - -class Params(params.Params): - labels_: Optional[ndarray] - n_leaves_: Optional[int] - children_: Optional[ndarray] - input_column_names: Optional[Any] - target_names_: Optional[Sequence[Any]] - training_indices_: Optional[Sequence[int]] - target_column_indices_: Optional[Sequence[int]] - target_columns_metadata_: Optional[List[OrderedDict]] - - - -class Hyperparams(hyperparams.Hyperparams): - n_clusters = hyperparams.Bounded[int]( - default=2, - lower=0, - upper=None, - description='The number of clusters to find.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - affinity = hyperparams.Enumeration[str]( - default='euclidean', - values=['euclidean', 'l1', 'l2', 'manhattan', 'cosine', 'precomputed'], - description='Metric used to compute the linkage. Can be "euclidean", "l1", "l2", "manhattan", "cosine", or \'precomputed\'. If linkage is "ward", only "euclidean" is accepted.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - compute_full_tree = hyperparams.Union( - configuration=OrderedDict({ - 'auto': hyperparams.Constant( - default='auto', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ), - 'bool': hyperparams.UniformBool( - default=False, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ) - }), - default='auto', - description='Stop early the construction of the tree at n_clusters. This is useful to decrease computation time if the number of clusters is not small compared to the number of features. This option is useful only when specifying a connectivity matrix. Note also that when varying the number of clusters and using caching, it may be advantageous to compute the full tree.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - linkage = hyperparams.Enumeration[str]( - default='ward', - values=['ward', 'complete', 'average', 'single'], - description='Which linkage criterion to use. The linkage criterion determines which distance to use between sets of features. The algorithm will merge the pairs of cluster that minimize this criterion. - ward minimizes the variance of the clusters being merged. - average uses the average of the distances of each feature of the two sets. - complete or maximum linkage uses the maximum distances between all features of the two sets.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - - use_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to force primitive to operate on. If any specified column cannot be parsed, it is skipped.", - ) - exclude_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to not operate on. Applicable only if \"use_columns\" is not provided.", - ) - return_result = hyperparams.Enumeration( - values=['append', 'replace', 'new'], - default='new', - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Should parsed columns be appended, should they replace original columns, or should only parsed columns be returned? This hyperparam is ignored if use_semantic_types is set to false.", - ) - use_semantic_types = hyperparams.UniformBool( - default=False, - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Controls whether semantic_types metadata will be used for filtering columns in input dataframe. Setting this to false makes the code ignore return_result and will produce only the output dataframe" - ) - add_index_columns = hyperparams.UniformBool( - default=False, - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Also include primary index columns if input data has them. Applicable only if \"return_result\" is set to \"new\".", - ) - error_on_no_input = hyperparams.UniformBool( - default=True, - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Throw an exception if no input column is selected/provided. Defaults to true to behave like sklearn. To prevent pipelines from breaking set this to False.", - ) - - return_semantic_type = hyperparams.Enumeration[str]( - values=['https://metadata.datadrivendiscovery.org/types/Attribute', 'https://metadata.datadrivendiscovery.org/types/ConstructedAttribute'], - default='https://metadata.datadrivendiscovery.org/types/Attribute', - description='Decides what semantic type to attach to generated attributes', - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'] - ) - -class SKFeatureAgglomeration(UnsupervisedLearnerPrimitiveBase[Inputs, Outputs, Params, Hyperparams]): - """ - Primitive wrapping for sklearn FeatureAgglomeration - `sklearn documentation `_ - - """ - - __author__ = "JPL MARVIN" - metadata = metadata_base.PrimitiveMetadata({ - "algorithm_types": [metadata_base.PrimitiveAlgorithmType.DATA_STREAM_CLUSTERING, ], - "name": "sklearn.cluster.hierarchical.FeatureAgglomeration", - "primitive_family": metadata_base.PrimitiveFamily.DATA_PREPROCESSING, - "python_path": "d3m.primitives.data_preprocessing.feature_agglomeration.SKlearn", - "source": {'name': 'JPL', 'contact': 'mailto:shah@jpl.nasa.gov', 'uris': ['https://gitlab.com/datadrivendiscovery/sklearn-wrap/issues', 'https://scikit-learn.org/stable/modules/generated/sklearn.cluster.FeatureAgglomeration.html']}, - "version": "2019.11.13", - "id": "f259b009-5e0f-37b1-b117-441aba2b65c8", - "hyperparams_to_tune": ['n_clusters', 'affinity', 'linkage'], - 'installation': [ - {'type': metadata_base.PrimitiveInstallationType.PIP, - 'package_uri': 'git+https://gitlab.com/datadrivendiscovery/sklearn-wrap.git@{git_commit}#egg=sklearn_wrap'.format( - git_commit=utils.current_git_commit(os.path.dirname(__file__)), - ), - }] - }) - - def __init__(self, *, - hyperparams: Hyperparams, - random_seed: int = 0, - docker_containers: Dict[str, DockerContainer] = None) -> None: - - super().__init__(hyperparams=hyperparams, random_seed=random_seed, docker_containers=docker_containers) - - # False - self._clf = FeatureAgglomeration( - n_clusters=self.hyperparams['n_clusters'], - affinity=self.hyperparams['affinity'], - compute_full_tree=self.hyperparams['compute_full_tree'], - linkage=self.hyperparams['linkage'], - ) - - self._inputs = None - self._outputs = None - self._training_inputs = None - self._training_outputs = None - self._target_names = None - self._training_indices = None - self._target_column_indices = None - self._target_columns_metadata: List[OrderedDict] = None - self._input_column_names = None - self._fitted = False - - - def set_training_data(self, *, inputs: Inputs) -> None: - self._inputs = inputs - self._fitted = False - - def fit(self, *, timeout: float = None, iterations: int = None) -> CallResult[None]: - if self._fitted: - return CallResult(None) - - self._training_inputs, self._training_indices = self._get_columns_to_fit(self._inputs, self.hyperparams) - self._input_column_names = self._training_inputs.columns - - if self._training_inputs is None: - return CallResult(None) - - if len(self._training_indices) > 0: - self._clf.fit(self._training_inputs) - self._fitted = True - else: - if self.hyperparams['error_on_no_input']: - raise RuntimeError("No input columns were selected") - self.logger.warn("No input columns were selected") - return CallResult(None) - - def produce(self, *, inputs: Inputs, timeout: float = None, iterations: int = None) -> CallResult[Outputs]: - if not self._fitted: - raise PrimitiveNotFittedError("Primitive not fitted.") - sk_inputs = inputs - if self.hyperparams['use_semantic_types']: - sk_inputs = inputs.iloc[:, self._training_indices] - output_columns = [] - if len(self._training_indices) > 0: - sk_output = self._clf.transform(sk_inputs) - if sparse.issparse(sk_output): - sk_output = sk_output.toarray() - outputs = self._wrap_predictions(inputs, sk_output) - if len(outputs.columns) == len(self._input_column_names): - outputs.columns = self._input_column_names - output_columns = [outputs] - else: - if self.hyperparams['error_on_no_input']: - raise RuntimeError("No input columns were selected") - self.logger.warn("No input columns were selected") - outputs = base_utils.combine_columns(return_result=self.hyperparams['return_result'], - add_index_columns=self.hyperparams['add_index_columns'], - inputs=inputs, column_indices=self._training_indices, - columns_list=output_columns) - return CallResult(outputs) - - - def get_params(self) -> Params: - if not self._fitted: - return Params( - labels_=None, - n_leaves_=None, - children_=None, - input_column_names=self._input_column_names, - training_indices_=self._training_indices, - target_names_=self._target_names, - target_column_indices_=self._target_column_indices, - target_columns_metadata_=self._target_columns_metadata - ) - - return Params( - labels_=getattr(self._clf, 'labels_', None), - n_leaves_=getattr(self._clf, 'n_leaves_', None), - children_=getattr(self._clf, 'children_', None), - input_column_names=self._input_column_names, - training_indices_=self._training_indices, - target_names_=self._target_names, - target_column_indices_=self._target_column_indices, - target_columns_metadata_=self._target_columns_metadata - ) - - def set_params(self, *, params: Params) -> None: - self._clf.labels_ = params['labels_'] - self._clf.n_leaves_ = params['n_leaves_'] - self._clf.children_ = params['children_'] - self._input_column_names = params['input_column_names'] - self._training_indices = params['training_indices_'] - self._target_names = params['target_names_'] - self._target_column_indices = params['target_column_indices_'] - self._target_columns_metadata = params['target_columns_metadata_'] - - if params['labels_'] is not None: - self._fitted = True - if params['n_leaves_'] is not None: - self._fitted = True - if params['children_'] is not None: - self._fitted = True - - - - - - @classmethod - def _get_columns_to_fit(cls, inputs: Inputs, hyperparams: Hyperparams): - if not hyperparams['use_semantic_types']: - return inputs, list(range(len(inputs.columns))) - - inputs_metadata = inputs.metadata - - def can_produce_column(column_index: int) -> bool: - return cls._can_produce_column(inputs_metadata, column_index, hyperparams) - - columns_to_produce, columns_not_to_produce = base_utils.get_columns_to_use(inputs_metadata, - use_columns=hyperparams['use_columns'], - exclude_columns=hyperparams['exclude_columns'], - can_use_column=can_produce_column) - return inputs.iloc[:, columns_to_produce], columns_to_produce - # return columns_to_produce - - @classmethod - def _can_produce_column(cls, inputs_metadata: metadata_base.DataMetadata, column_index: int, hyperparams: Hyperparams) -> bool: - column_metadata = inputs_metadata.query((metadata_base.ALL_ELEMENTS, column_index)) - - accepted_structural_types = (int, float, numpy.integer, numpy.float64) - accepted_semantic_types = set() - accepted_semantic_types.add("https://metadata.datadrivendiscovery.org/types/Attribute") - if not issubclass(column_metadata['structural_type'], accepted_structural_types): - return False - - semantic_types = set(column_metadata.get('semantic_types', [])) - - if len(semantic_types) == 0: - cls.logger.warning("No semantic types found in column metadata") - return False - - # Making sure all accepted_semantic_types are available in semantic_types - if len(accepted_semantic_types - semantic_types) == 0: - return True - - return False - - - @classmethod - def _get_target_columns_metadata(cls, outputs_metadata: metadata_base.DataMetadata, hyperparams) -> List[OrderedDict]: - outputs_length = outputs_metadata.query((metadata_base.ALL_ELEMENTS,))['dimension']['length'] - - target_columns_metadata: List[OrderedDict] = [] - for column_index in range(outputs_length): - column_metadata = OrderedDict(outputs_metadata.query_column(column_index)) - - # Update semantic types and prepare it for predicted targets. - semantic_types = set(column_metadata.get('semantic_types', [])) - semantic_types_to_remove = set([]) - add_semantic_types = [] - add_semantic_types.add(hyperparams["return_semantic_type"]) - semantic_types = semantic_types - semantic_types_to_remove - semantic_types = semantic_types.union(add_semantic_types) - column_metadata['semantic_types'] = list(semantic_types) - - target_columns_metadata.append(column_metadata) - - return target_columns_metadata - - @classmethod - def _update_predictions_metadata(cls, inputs_metadata: metadata_base.DataMetadata, outputs: Optional[Outputs], - target_columns_metadata: List[OrderedDict]) -> metadata_base.DataMetadata: - outputs_metadata = metadata_base.DataMetadata().generate(value=outputs) - - for column_index, column_metadata in enumerate(target_columns_metadata): - column_metadata.pop("structural_type", None) - outputs_metadata = outputs_metadata.update_column(column_index, column_metadata) - - return outputs_metadata - - def _wrap_predictions(self, inputs: Inputs, predictions: ndarray) -> Outputs: - outputs = d3m_dataframe(predictions, generate_metadata=True) - target_columns_metadata = self._add_target_columns_metadata(outputs.metadata, self.hyperparams) - outputs.metadata = self._update_predictions_metadata(inputs.metadata, outputs, target_columns_metadata) - return outputs - - - @classmethod - def _add_target_columns_metadata(cls, outputs_metadata: metadata_base.DataMetadata, hyperparams): - - outputs_length = outputs_metadata.query((metadata_base.ALL_ELEMENTS,))['dimension']['length'] - target_columns_metadata: List[OrderedDict] = [] - for column_index in range(outputs_length): - column_name = "output_{}".format(column_index) - column_metadata = OrderedDict() - semantic_types = set() - semantic_types.add(hyperparams["return_semantic_type"]) - column_metadata['semantic_types'] = list(semantic_types) - - column_metadata["name"] = str(column_name) - target_columns_metadata.append(column_metadata) - - return target_columns_metadata - - -SKFeatureAgglomeration.__doc__ = FeatureAgglomeration.__doc__ \ No newline at end of file diff --git a/common-primitives/sklearn-wrap/sklearn_wrap/SKGaussianNB.py b/common-primitives/sklearn-wrap/sklearn_wrap/SKGaussianNB.py deleted file mode 100644 index d132e05..0000000 --- a/common-primitives/sklearn-wrap/sklearn_wrap/SKGaussianNB.py +++ /dev/null @@ -1,492 +0,0 @@ -from typing import Any, Callable, List, Dict, Union, Optional, Sequence, Tuple -from numpy import ndarray -from collections import OrderedDict -from scipy import sparse -import os -import sklearn -import numpy -import typing - -# Custom import commands if any -from sklearn.naive_bayes import GaussianNB - - -from d3m.container.numpy import ndarray as d3m_ndarray -from d3m.container import DataFrame as d3m_dataframe -from d3m.metadata import hyperparams, params, base as metadata_base -from d3m import utils -from d3m.base import utils as base_utils -from d3m.exceptions import PrimitiveNotFittedError -from d3m.primitive_interfaces.base import CallResult, DockerContainer - -from d3m.primitive_interfaces.supervised_learning import SupervisedLearnerPrimitiveBase -from d3m.primitive_interfaces.base import ProbabilisticCompositionalityMixin, ContinueFitMixin -from d3m import exceptions -import pandas - - - -Inputs = d3m_dataframe -Outputs = d3m_dataframe - - -class Params(params.Params): - class_prior_: Optional[ndarray] - class_count_: Optional[ndarray] - theta_: Optional[ndarray] - sigma_: Optional[ndarray] - classes_: Optional[ndarray] - epsilon_: Optional[float] - input_column_names: Optional[Any] - target_names_: Optional[Sequence[Any]] - training_indices_: Optional[Sequence[int]] - target_column_indices_: Optional[Sequence[int]] - target_columns_metadata_: Optional[List[OrderedDict]] - - - -class Hyperparams(hyperparams.Hyperparams): - var_smoothing = hyperparams.Bounded[float]( - lower=0, - upper=None, - default=1e-09, - description='Portion of the largest variance of all features that is added to variances for calculation stability.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - - use_inputs_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to force primitive to use as training input. If any specified column cannot be parsed, it is skipped.", - ) - use_outputs_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to force primitive to use as training target. If any specified column cannot be parsed, it is skipped.", - ) - exclude_inputs_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to not use as training inputs. Applicable only if \"use_columns\" is not provided.", - ) - exclude_outputs_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to not use as training target. Applicable only if \"use_columns\" is not provided.", - ) - return_result = hyperparams.Enumeration( - values=['append', 'replace', 'new'], - default='new', - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Should parsed columns be appended, should they replace original columns, or should only parsed columns be returned? This hyperparam is ignored if use_semantic_types is set to false.", - ) - use_semantic_types = hyperparams.UniformBool( - default=False, - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Controls whether semantic_types metadata will be used for filtering columns in input dataframe. Setting this to false makes the code ignore return_result and will produce only the output dataframe" - ) - add_index_columns = hyperparams.UniformBool( - default=False, - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Also include primary index columns if input data has them. Applicable only if \"return_result\" is set to \"new\".", - ) - error_on_no_input = hyperparams.UniformBool( - default=True, - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Throw an exception if no input column is selected/provided. Defaults to true to behave like sklearn. To prevent pipelines from breaking set this to False.", - ) - - return_semantic_type = hyperparams.Enumeration[str]( - values=['https://metadata.datadrivendiscovery.org/types/Attribute', 'https://metadata.datadrivendiscovery.org/types/ConstructedAttribute', 'https://metadata.datadrivendiscovery.org/types/PredictedTarget'], - default='https://metadata.datadrivendiscovery.org/types/PredictedTarget', - description='Decides what semantic type to attach to generated output', - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'] - ) - -class SKGaussianNB(SupervisedLearnerPrimitiveBase[Inputs, Outputs, Params, Hyperparams], - ProbabilisticCompositionalityMixin[Inputs, Outputs, Params, Hyperparams], - ContinueFitMixin[Inputs, Outputs, Params, Hyperparams]): - """ - Primitive wrapping for sklearn GaussianNB - `sklearn documentation `_ - - """ - - __author__ = "JPL MARVIN" - metadata = metadata_base.PrimitiveMetadata({ - "algorithm_types": [metadata_base.PrimitiveAlgorithmType.NAIVE_BAYES_CLASSIFIER, ], - "name": "sklearn.naive_bayes.GaussianNB", - "primitive_family": metadata_base.PrimitiveFamily.CLASSIFICATION, - "python_path": "d3m.primitives.classification.gaussian_naive_bayes.SKlearn", - "source": {'name': 'JPL', 'contact': 'mailto:shah@jpl.nasa.gov', 'uris': ['https://gitlab.com/datadrivendiscovery/sklearn-wrap/issues', 'https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html']}, - "version": "2019.11.13", - "id": "464783a8-771e-340d-999b-ae90b9f84f0b", - "hyperparams_to_tune": ['var_smoothing'], - 'installation': [ - {'type': metadata_base.PrimitiveInstallationType.PIP, - 'package_uri': 'git+https://gitlab.com/datadrivendiscovery/sklearn-wrap.git@{git_commit}#egg=sklearn_wrap'.format( - git_commit=utils.current_git_commit(os.path.dirname(__file__)), - ), - }] - }) - - def __init__(self, *, - hyperparams: Hyperparams, - random_seed: int = 0, - docker_containers: Dict[str, DockerContainer] = None, - _priors: Union[ndarray, None] = None) -> None: - - super().__init__(hyperparams=hyperparams, random_seed=random_seed, docker_containers=docker_containers) - - # False - self._clf = GaussianNB( - var_smoothing=self.hyperparams['var_smoothing'], - priors=_priors - ) - - self._inputs = None - self._outputs = None - self._training_inputs = None - self._training_outputs = None - self._target_names = None - self._training_indices = None - self._target_column_indices = None - self._target_columns_metadata: List[OrderedDict] = None - self._input_column_names = None - self._fitted = False - self._new_training_data = False - - def set_training_data(self, *, inputs: Inputs, outputs: Outputs) -> None: - self._inputs = inputs - self._outputs = outputs - self._fitted = False - self._new_training_data = True - - def fit(self, *, timeout: float = None, iterations: int = None) -> CallResult[None]: - if self._inputs is None or self._outputs is None: - raise ValueError("Missing training data.") - - if not self._new_training_data: - return CallResult(None) - self._new_training_data = False - - self._training_inputs, self._training_indices = self._get_columns_to_fit(self._inputs, self.hyperparams) - self._training_outputs, self._target_names, self._target_column_indices = self._get_targets(self._outputs, self.hyperparams) - self._input_column_names = self._training_inputs.columns - - if len(self._training_indices) > 0 and len(self._target_column_indices) > 0: - self._target_columns_metadata = self._get_target_columns_metadata(self._training_outputs.metadata, self.hyperparams) - sk_training_output = self._training_outputs.values - - shape = sk_training_output.shape - if len(shape) == 2 and shape[1] == 1: - sk_training_output = numpy.ravel(sk_training_output) - - self._clf.fit(self._training_inputs, sk_training_output) - self._fitted = True - else: - if self.hyperparams['error_on_no_input']: - raise RuntimeError("No input columns were selected") - self.logger.warn("No input columns were selected") - - return CallResult(None) - - def continue_fit(self, *, timeout: float = None, iterations: int = None) -> CallResult[None]: - if self._training_inputs is None or self._training_outputs is None: - raise ValueError("Missing training data.") - - if not self._new_training_data: - return CallResult(None) - self._new_training_data = False - - if len(self._training_indices) > 0 and len(self._target_column_indices) > 0: - self._target_columns_metadata = self._get_target_columns_metadata(self._training_outputs.metadata, self.hyperparams) - sk_training_output = self._training_outputs.values - - shape = sk_training_output.shape - if len(shape) == 2 and shape[1] == 1: - sk_training_output = numpy.ravel(sk_training_output) - - self._clf.partial_fit(self._training_inputs, sk_training_output) - self._fitted = True - else: - if self.hyperparams['error_on_no_input']: - raise RuntimeError("No input columns were selected") - self.logger.warn("No input columns were selected") - - return CallResult(None) - - - def produce(self, *, inputs: Inputs, timeout: float = None, iterations: int = None) -> CallResult[Outputs]: - sk_inputs, columns_to_use = self._get_columns_to_fit(inputs, self.hyperparams) - output = [] - if len(sk_inputs.columns): - try: - sk_output = self._clf.predict(sk_inputs) - except sklearn.exceptions.NotFittedError as error: - raise PrimitiveNotFittedError("Primitive not fitted.") from error - # For primitives that allow predicting without fitting like GaussianProcessRegressor - if not self._fitted: - raise PrimitiveNotFittedError("Primitive not fitted.") - if sparse.issparse(sk_output): - sk_output = sk_output.toarray() - output = self._wrap_predictions(inputs, sk_output) - output.columns = self._target_names - output = [output] - else: - if self.hyperparams['error_on_no_input']: - raise RuntimeError("No input columns were selected") - self.logger.warn("No input columns were selected") - outputs = base_utils.combine_columns(return_result=self.hyperparams['return_result'], - add_index_columns=self.hyperparams['add_index_columns'], - inputs=inputs, column_indices=self._target_column_indices, - columns_list=output) - - return CallResult(outputs) - - - def get_params(self) -> Params: - if not self._fitted: - return Params( - class_prior_=None, - class_count_=None, - theta_=None, - sigma_=None, - classes_=None, - epsilon_=None, - input_column_names=self._input_column_names, - training_indices_=self._training_indices, - target_names_=self._target_names, - target_column_indices_=self._target_column_indices, - target_columns_metadata_=self._target_columns_metadata - ) - - return Params( - class_prior_=getattr(self._clf, 'class_prior_', None), - class_count_=getattr(self._clf, 'class_count_', None), - theta_=getattr(self._clf, 'theta_', None), - sigma_=getattr(self._clf, 'sigma_', None), - classes_=getattr(self._clf, 'classes_', None), - epsilon_=getattr(self._clf, 'epsilon_', None), - input_column_names=self._input_column_names, - training_indices_=self._training_indices, - target_names_=self._target_names, - target_column_indices_=self._target_column_indices, - target_columns_metadata_=self._target_columns_metadata - ) - - def set_params(self, *, params: Params) -> None: - self._clf.class_prior_ = params['class_prior_'] - self._clf.class_count_ = params['class_count_'] - self._clf.theta_ = params['theta_'] - self._clf.sigma_ = params['sigma_'] - self._clf.classes_ = params['classes_'] - self._clf.epsilon_ = params['epsilon_'] - self._input_column_names = params['input_column_names'] - self._training_indices = params['training_indices_'] - self._target_names = params['target_names_'] - self._target_column_indices = params['target_column_indices_'] - self._target_columns_metadata = params['target_columns_metadata_'] - - if params['class_prior_'] is not None: - self._fitted = True - if params['class_count_'] is not None: - self._fitted = True - if params['theta_'] is not None: - self._fitted = True - if params['sigma_'] is not None: - self._fitted = True - if params['classes_'] is not None: - self._fitted = True - if params['epsilon_'] is not None: - self._fitted = True - - - def log_likelihoods(self, *, - outputs: Outputs, - inputs: Inputs, - timeout: float = None, - iterations: int = None) -> CallResult[Sequence[float]]: - inputs = inputs.iloc[:, self._training_indices] # Get ndarray - outputs = outputs.iloc[:, self._target_column_indices] - - if len(inputs.columns) and len(outputs.columns): - - if outputs.shape[1] != self._clf.n_outputs_: - raise exceptions.InvalidArgumentValueError("\"outputs\" argument does not have the correct number of target columns.") - - log_proba = self._clf.predict_log_proba(inputs) - - # Making it always a list, even when only one target. - if self._clf.n_outputs_ == 1: - log_proba = [log_proba] - classes = [self._clf.classes_] - else: - classes = self._clf.classes_ - - samples_length = inputs.shape[0] - - log_likelihoods = [] - for k in range(self._clf.n_outputs_): - # We have to map each class to its internal (numerical) index used in the learner. - # This allows "outputs" to contain string classes. - outputs_column = outputs.iloc[:, k] - classes_map = pandas.Series(numpy.arange(len(classes[k])), index=classes[k]) - mapped_outputs_column = outputs_column.map(classes_map) - - # For each target column (column in "outputs"), for each sample (row) we pick the log - # likelihood for a given class. - log_likelihoods.append(log_proba[k][numpy.arange(samples_length), mapped_outputs_column]) - - results = d3m_dataframe(dict(enumerate(log_likelihoods)), generate_metadata=True) - results.columns = outputs.columns - - for k in range(self._clf.n_outputs_): - column_metadata = outputs.metadata.query_column(k) - if 'name' in column_metadata: - results.metadata = results.metadata.update_column(k, {'name': column_metadata['name']}) - - else: - results = d3m_dataframe(generate_metadata=True) - - return CallResult(results) - - - - - - @classmethod - def _get_columns_to_fit(cls, inputs: Inputs, hyperparams: Hyperparams): - if not hyperparams['use_semantic_types']: - return inputs, list(range(len(inputs.columns))) - - inputs_metadata = inputs.metadata - - def can_produce_column(column_index: int) -> bool: - return cls._can_produce_column(inputs_metadata, column_index, hyperparams) - - columns_to_produce, columns_not_to_produce = base_utils.get_columns_to_use(inputs_metadata, - use_columns=hyperparams['use_inputs_columns'], - exclude_columns=hyperparams['exclude_inputs_columns'], - can_use_column=can_produce_column) - return inputs.iloc[:, columns_to_produce], columns_to_produce - # return columns_to_produce - - @classmethod - def _can_produce_column(cls, inputs_metadata: metadata_base.DataMetadata, column_index: int, hyperparams: Hyperparams) -> bool: - column_metadata = inputs_metadata.query((metadata_base.ALL_ELEMENTS, column_index)) - - accepted_structural_types = (int, float, numpy.integer, numpy.float64) - accepted_semantic_types = set() - accepted_semantic_types.add("https://metadata.datadrivendiscovery.org/types/Attribute") - if not issubclass(column_metadata['structural_type'], accepted_structural_types): - return False - - semantic_types = set(column_metadata.get('semantic_types', [])) - - if len(semantic_types) == 0: - cls.logger.warning("No semantic types found in column metadata") - return False - # Making sure all accepted_semantic_types are available in semantic_types - if len(accepted_semantic_types - semantic_types) == 0: - return True - - return False - - @classmethod - def _get_targets(cls, data: d3m_dataframe, hyperparams: Hyperparams): - if not hyperparams['use_semantic_types']: - return data, list(data.columns), list(range(len(data.columns))) - - metadata = data.metadata - - def can_produce_column(column_index: int) -> bool: - accepted_semantic_types = set() - accepted_semantic_types.add("https://metadata.datadrivendiscovery.org/types/TrueTarget") - column_metadata = metadata.query((metadata_base.ALL_ELEMENTS, column_index)) - semantic_types = set(column_metadata.get('semantic_types', [])) - if len(semantic_types) == 0: - cls.logger.warning("No semantic types found in column metadata") - return False - # Making sure all accepted_semantic_types are available in semantic_types - if len(accepted_semantic_types - semantic_types) == 0: - return True - return False - - target_column_indices, target_columns_not_to_produce = base_utils.get_columns_to_use(metadata, - use_columns=hyperparams[ - 'use_outputs_columns'], - exclude_columns= - hyperparams[ - 'exclude_outputs_columns'], - can_use_column=can_produce_column) - targets = [] - if target_column_indices: - targets = data.select_columns(target_column_indices) - target_column_names = [] - for idx in target_column_indices: - target_column_names.append(data.columns[idx]) - return targets, target_column_names, target_column_indices - - @classmethod - def _get_target_columns_metadata(cls, outputs_metadata: metadata_base.DataMetadata, hyperparams) -> List[OrderedDict]: - outputs_length = outputs_metadata.query((metadata_base.ALL_ELEMENTS,))['dimension']['length'] - - target_columns_metadata: List[OrderedDict] = [] - for column_index in range(outputs_length): - column_metadata = OrderedDict(outputs_metadata.query_column(column_index)) - - # Update semantic types and prepare it for predicted targets. - semantic_types = set(column_metadata.get('semantic_types', [])) - semantic_types_to_remove = set(["https://metadata.datadrivendiscovery.org/types/TrueTarget","https://metadata.datadrivendiscovery.org/types/SuggestedTarget",]) - add_semantic_types = set(["https://metadata.datadrivendiscovery.org/types/PredictedTarget",]) - add_semantic_types.add(hyperparams["return_semantic_type"]) - semantic_types = semantic_types - semantic_types_to_remove - semantic_types = semantic_types.union(add_semantic_types) - column_metadata['semantic_types'] = list(semantic_types) - - target_columns_metadata.append(column_metadata) - - return target_columns_metadata - - @classmethod - def _update_predictions_metadata(cls, inputs_metadata: metadata_base.DataMetadata, outputs: Optional[Outputs], - target_columns_metadata: List[OrderedDict]) -> metadata_base.DataMetadata: - outputs_metadata = metadata_base.DataMetadata().generate(value=outputs) - - for column_index, column_metadata in enumerate(target_columns_metadata): - column_metadata.pop("structural_type", None) - outputs_metadata = outputs_metadata.update_column(column_index, column_metadata) - - return outputs_metadata - - def _wrap_predictions(self, inputs: Inputs, predictions: ndarray) -> Outputs: - outputs = d3m_dataframe(predictions, generate_metadata=False) - outputs.metadata = self._update_predictions_metadata(inputs.metadata, outputs, self._target_columns_metadata) - return outputs - - - @classmethod - def _add_target_columns_metadata(cls, outputs_metadata: metadata_base.DataMetadata): - outputs_length = outputs_metadata.query((metadata_base.ALL_ELEMENTS,))['dimension']['length'] - - target_columns_metadata: List[OrderedDict] = [] - for column_index in range(outputs_length): - column_metadata = OrderedDict() - semantic_types = [] - semantic_types.append('https://metadata.datadrivendiscovery.org/types/PredictedTarget') - column_name = outputs_metadata.query((metadata_base.ALL_ELEMENTS, column_index)).get("name") - if column_name is None: - column_name = "output_{}".format(column_index) - column_metadata["semantic_types"] = semantic_types - column_metadata["name"] = str(column_name) - target_columns_metadata.append(column_metadata) - - return target_columns_metadata - - -SKGaussianNB.__doc__ = GaussianNB.__doc__ \ No newline at end of file diff --git a/common-primitives/sklearn-wrap/sklearn_wrap/SKGaussianProcessRegressor.py b/common-primitives/sklearn-wrap/sklearn_wrap/SKGaussianProcessRegressor.py deleted file mode 100644 index ff8417e..0000000 --- a/common-primitives/sklearn-wrap/sklearn_wrap/SKGaussianProcessRegressor.py +++ /dev/null @@ -1,463 +0,0 @@ -from typing import Any, Callable, List, Dict, Union, Optional, Sequence, Tuple -from numpy import ndarray -from collections import OrderedDict -from scipy import sparse -import os -import sklearn -import numpy -import typing - -# Custom import commands if any -from sklearn.gaussian_process.gpr import GaussianProcessRegressor - - -from d3m.container.numpy import ndarray as d3m_ndarray -from d3m.container import DataFrame as d3m_dataframe -from d3m.metadata import hyperparams, params, base as metadata_base -from d3m import utils -from d3m.base import utils as base_utils -from d3m.exceptions import PrimitiveNotFittedError -from d3m.primitive_interfaces.base import CallResult, DockerContainer - -from d3m.primitive_interfaces.supervised_learning import SupervisedLearnerPrimitiveBase -from d3m.primitive_interfaces.base import ProbabilisticCompositionalityMixin, ContinueFitMixin -from d3m import exceptions -import pandas - - - -Inputs = d3m_dataframe -Outputs = d3m_dataframe - - -class Params(params.Params): - X_train_: Optional[ndarray] - y_train_: Optional[ndarray] - kernel_: Optional[Callable] - alpha_: Optional[ndarray] - log_marginal_likelihood_value_: Optional[float] - _y_train_mean: Optional[ndarray] - _rng: Optional[numpy.random.mtrand.RandomState] - L_: Optional[ndarray] - _K_inv: Optional[object] - input_column_names: Optional[Any] - target_names_: Optional[Sequence[Any]] - training_indices_: Optional[Sequence[int]] - target_column_indices_: Optional[Sequence[int]] - target_columns_metadata_: Optional[List[OrderedDict]] - - - -class Hyperparams(hyperparams.Hyperparams): - alpha = hyperparams.Union( - configuration=OrderedDict({ - 'float': hyperparams.Hyperparameter[float]( - default=1e-10, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ), - 'ndarray': hyperparams.Hyperparameter[ndarray]( - default=numpy.array([]), - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ) - }), - default='float', - description='Value added to the diagonal of the kernel matrix during fitting. Larger values correspond to increased noise level in the observations and reduce potential numerical issue during fitting. If an array is passed, it must have the same number of entries as the data used for fitting and is used as datapoint-dependent noise level. Note that this is equivalent to adding a WhiteKernel with c=alpha. Allowing to specify the noise level directly as a parameter is mainly for convenience and for consistency with Ridge.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - optimizer = hyperparams.Constant( - default='fmin_l_bfgs_b', - description='Can either be one of the internally supported optimizers for optimizing the kernel\'s parameters, specified by a string, or an externally defined optimizer passed as a callable. If a callable is passed, it must have the signature:: def optimizer(obj_func, initial_theta, bounds): # * \'obj_func\' is the objective function to be maximized, which # takes the hyperparameters theta as parameter and an # optional flag eval_gradient, which determines if the # gradient is returned additionally to the function value # * \'initial_theta\': the initial value for theta, which can be # used by local optimizers # * \'bounds\': the bounds on the values of theta .... # Returned are the best found hyperparameters theta and # the corresponding value of the target function. return theta_opt, func_min Per default, the \'fmin_l_bfgs_b\' algorithm from scipy.optimize is used. If None is passed, the kernel\'s parameters are kept fixed. Available internal optimizers are:: \'fmin_l_bfgs_b\'', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - n_restarts_optimizer = hyperparams.Bounded[int]( - default=0, - lower=0, - upper=None, - description='The number of restarts of the optimizer for finding the kernel\'s parameters which maximize the log-marginal likelihood. The first run of the optimizer is performed from the kernel\'s initial parameters, the remaining ones (if any) from thetas sampled log-uniform randomly from the space of allowed theta-values. If greater than 0, all bounds must be finite. Note that n_restarts_optimizer == 0 implies that one run is performed.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - normalize_y = hyperparams.UniformBool( - default=False, - description='Whether the target values y are normalized, i.e., the mean of the observed target values become zero. This parameter should be set to True if the target values\' mean is expected to differ considerable from zero. When enabled, the normalization effectively modifies the GP\'s prior based on the data, which contradicts the likelihood principle; normalization is thus disabled per default. copy_X_train : bool, optional (default: True) If True, a persistent copy of the training data is stored in the object. Otherwise, just a reference to the training data is stored, which might cause predictions to change if the data is modified externally.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - - use_inputs_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to force primitive to use as training input. If any specified column cannot be parsed, it is skipped.", - ) - use_outputs_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to force primitive to use as training target. If any specified column cannot be parsed, it is skipped.", - ) - exclude_inputs_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to not use as training inputs. Applicable only if \"use_columns\" is not provided.", - ) - exclude_outputs_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to not use as training target. Applicable only if \"use_columns\" is not provided.", - ) - return_result = hyperparams.Enumeration( - values=['append', 'replace', 'new'], - default='new', - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Should parsed columns be appended, should they replace original columns, or should only parsed columns be returned? This hyperparam is ignored if use_semantic_types is set to false.", - ) - use_semantic_types = hyperparams.UniformBool( - default=False, - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Controls whether semantic_types metadata will be used for filtering columns in input dataframe. Setting this to false makes the code ignore return_result and will produce only the output dataframe" - ) - add_index_columns = hyperparams.UniformBool( - default=False, - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Also include primary index columns if input data has them. Applicable only if \"return_result\" is set to \"new\".", - ) - error_on_no_input = hyperparams.UniformBool( - default=True, - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Throw an exception if no input column is selected/provided. Defaults to true to behave like sklearn. To prevent pipelines from breaking set this to False.", - ) - - return_semantic_type = hyperparams.Enumeration[str]( - values=['https://metadata.datadrivendiscovery.org/types/Attribute', 'https://metadata.datadrivendiscovery.org/types/ConstructedAttribute', 'https://metadata.datadrivendiscovery.org/types/PredictedTarget'], - default='https://metadata.datadrivendiscovery.org/types/PredictedTarget', - description='Decides what semantic type to attach to generated output', - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'] - ) - -class SKGaussianProcessRegressor(SupervisedLearnerPrimitiveBase[Inputs, Outputs, Params, Hyperparams]): - """ - Primitive wrapping for sklearn GaussianProcessRegressor - `sklearn documentation `_ - - """ - - __author__ = "JPL MARVIN" - metadata = metadata_base.PrimitiveMetadata({ - "algorithm_types": [metadata_base.PrimitiveAlgorithmType.GAUSSIAN_PROCESS, ], - "name": "sklearn.gaussian_process.gpr.GaussianProcessRegressor", - "primitive_family": metadata_base.PrimitiveFamily.REGRESSION, - "python_path": "d3m.primitives.regression.gaussian_process.SKlearn", - "source": {'name': 'JPL', 'contact': 'mailto:shah@jpl.nasa.gov', 'uris': ['https://gitlab.com/datadrivendiscovery/sklearn-wrap/issues', 'https://scikit-learn.org/stable/modules/generated/sklearn.gaussian_process.GaussianProcessRegressor.html']}, - "version": "2019.11.13", - "id": "3894e630-d67b-35d9-ab78-233e264f6324", - "hyperparams_to_tune": ['alpha'], - 'installation': [ - {'type': metadata_base.PrimitiveInstallationType.PIP, - 'package_uri': 'git+https://gitlab.com/datadrivendiscovery/sklearn-wrap.git@{git_commit}#egg=sklearn_wrap'.format( - git_commit=utils.current_git_commit(os.path.dirname(__file__)), - ), - }] - }) - - def __init__(self, *, - hyperparams: Hyperparams, - random_seed: int = 0, - docker_containers: Dict[str, DockerContainer] = None) -> None: - - super().__init__(hyperparams=hyperparams, random_seed=random_seed, docker_containers=docker_containers) - - # False - self._clf = GaussianProcessRegressor( - alpha=self.hyperparams['alpha'], - optimizer=self.hyperparams['optimizer'], - n_restarts_optimizer=self.hyperparams['n_restarts_optimizer'], - normalize_y=self.hyperparams['normalize_y'], - random_state=self.random_seed, - ) - - self._inputs = None - self._outputs = None - self._training_inputs = None - self._training_outputs = None - self._target_names = None - self._training_indices = None - self._target_column_indices = None - self._target_columns_metadata: List[OrderedDict] = None - self._input_column_names = None - self._fitted = False - self._new_training_data = False - - def set_training_data(self, *, inputs: Inputs, outputs: Outputs) -> None: - self._inputs = inputs - self._outputs = outputs - self._fitted = False - self._new_training_data = True - - def fit(self, *, timeout: float = None, iterations: int = None) -> CallResult[None]: - if self._inputs is None or self._outputs is None: - raise ValueError("Missing training data.") - - if not self._new_training_data: - return CallResult(None) - self._new_training_data = False - - self._training_inputs, self._training_indices = self._get_columns_to_fit(self._inputs, self.hyperparams) - self._training_outputs, self._target_names, self._target_column_indices = self._get_targets(self._outputs, self.hyperparams) - self._input_column_names = self._training_inputs.columns - - if len(self._training_indices) > 0 and len(self._target_column_indices) > 0: - self._target_columns_metadata = self._get_target_columns_metadata(self._training_outputs.metadata, self.hyperparams) - sk_training_output = self._training_outputs.values - - shape = sk_training_output.shape - if len(shape) == 2 and shape[1] == 1: - sk_training_output = numpy.ravel(sk_training_output) - - self._clf.fit(self._training_inputs, sk_training_output) - self._fitted = True - else: - if self.hyperparams['error_on_no_input']: - raise RuntimeError("No input columns were selected") - self.logger.warn("No input columns were selected") - - return CallResult(None) - - - - def produce(self, *, inputs: Inputs, timeout: float = None, iterations: int = None) -> CallResult[Outputs]: - sk_inputs, columns_to_use = self._get_columns_to_fit(inputs, self.hyperparams) - output = [] - if len(sk_inputs.columns): - try: - sk_output = self._clf.predict(sk_inputs) - except sklearn.exceptions.NotFittedError as error: - raise PrimitiveNotFittedError("Primitive not fitted.") from error - # For primitives that allow predicting without fitting like GaussianProcessRegressor - if not self._fitted: - raise PrimitiveNotFittedError("Primitive not fitted.") - if sparse.issparse(sk_output): - sk_output = sk_output.toarray() - output = self._wrap_predictions(inputs, sk_output) - output.columns = self._target_names - output = [output] - else: - if self.hyperparams['error_on_no_input']: - raise RuntimeError("No input columns were selected") - self.logger.warn("No input columns were selected") - outputs = base_utils.combine_columns(return_result=self.hyperparams['return_result'], - add_index_columns=self.hyperparams['add_index_columns'], - inputs=inputs, column_indices=self._target_column_indices, - columns_list=output) - - return CallResult(outputs) - - - def get_params(self) -> Params: - if not self._fitted: - return Params( - X_train_=None, - y_train_=None, - kernel_=None, - alpha_=None, - log_marginal_likelihood_value_=None, - _y_train_mean=None, - _rng=None, - L_=None, - _K_inv=None, - input_column_names=self._input_column_names, - training_indices_=self._training_indices, - target_names_=self._target_names, - target_column_indices_=self._target_column_indices, - target_columns_metadata_=self._target_columns_metadata - ) - - return Params( - X_train_=getattr(self._clf, 'X_train_', None), - y_train_=getattr(self._clf, 'y_train_', None), - kernel_=getattr(self._clf, 'kernel_', None), - alpha_=getattr(self._clf, 'alpha_', None), - log_marginal_likelihood_value_=getattr(self._clf, 'log_marginal_likelihood_value_', None), - _y_train_mean=getattr(self._clf, '_y_train_mean', None), - _rng=getattr(self._clf, '_rng', None), - L_=getattr(self._clf, 'L_', None), - _K_inv=getattr(self._clf, '_K_inv', None), - input_column_names=self._input_column_names, - training_indices_=self._training_indices, - target_names_=self._target_names, - target_column_indices_=self._target_column_indices, - target_columns_metadata_=self._target_columns_metadata - ) - - def set_params(self, *, params: Params) -> None: - self._clf.X_train_ = params['X_train_'] - self._clf.y_train_ = params['y_train_'] - self._clf.kernel_ = params['kernel_'] - self._clf.alpha_ = params['alpha_'] - self._clf.log_marginal_likelihood_value_ = params['log_marginal_likelihood_value_'] - self._clf._y_train_mean = params['_y_train_mean'] - self._clf._rng = params['_rng'] - self._clf.L_ = params['L_'] - self._clf._K_inv = params['_K_inv'] - self._input_column_names = params['input_column_names'] - self._training_indices = params['training_indices_'] - self._target_names = params['target_names_'] - self._target_column_indices = params['target_column_indices_'] - self._target_columns_metadata = params['target_columns_metadata_'] - - if params['X_train_'] is not None: - self._fitted = True - if params['y_train_'] is not None: - self._fitted = True - if params['kernel_'] is not None: - self._fitted = True - if params['alpha_'] is not None: - self._fitted = True - if params['log_marginal_likelihood_value_'] is not None: - self._fitted = True - if params['_y_train_mean'] is not None: - self._fitted = True - if params['_rng'] is not None: - self._fitted = True - if params['L_'] is not None: - self._fitted = True - if params['_K_inv'] is not None: - self._fitted = True - - - - - - - - @classmethod - def _get_columns_to_fit(cls, inputs: Inputs, hyperparams: Hyperparams): - if not hyperparams['use_semantic_types']: - return inputs, list(range(len(inputs.columns))) - - inputs_metadata = inputs.metadata - - def can_produce_column(column_index: int) -> bool: - return cls._can_produce_column(inputs_metadata, column_index, hyperparams) - - columns_to_produce, columns_not_to_produce = base_utils.get_columns_to_use(inputs_metadata, - use_columns=hyperparams['use_inputs_columns'], - exclude_columns=hyperparams['exclude_inputs_columns'], - can_use_column=can_produce_column) - return inputs.iloc[:, columns_to_produce], columns_to_produce - # return columns_to_produce - - @classmethod - def _can_produce_column(cls, inputs_metadata: metadata_base.DataMetadata, column_index: int, hyperparams: Hyperparams) -> bool: - column_metadata = inputs_metadata.query((metadata_base.ALL_ELEMENTS, column_index)) - - accepted_structural_types = (int, float, numpy.integer, numpy.float64) - accepted_semantic_types = set() - accepted_semantic_types.add("https://metadata.datadrivendiscovery.org/types/Attribute") - if not issubclass(column_metadata['structural_type'], accepted_structural_types): - return False - - semantic_types = set(column_metadata.get('semantic_types', [])) - - if len(semantic_types) == 0: - cls.logger.warning("No semantic types found in column metadata") - return False - # Making sure all accepted_semantic_types are available in semantic_types - if len(accepted_semantic_types - semantic_types) == 0: - return True - - return False - - @classmethod - def _get_targets(cls, data: d3m_dataframe, hyperparams: Hyperparams): - if not hyperparams['use_semantic_types']: - return data, list(data.columns), list(range(len(data.columns))) - - metadata = data.metadata - - def can_produce_column(column_index: int) -> bool: - accepted_semantic_types = set() - accepted_semantic_types.add("https://metadata.datadrivendiscovery.org/types/TrueTarget") - column_metadata = metadata.query((metadata_base.ALL_ELEMENTS, column_index)) - semantic_types = set(column_metadata.get('semantic_types', [])) - if len(semantic_types) == 0: - cls.logger.warning("No semantic types found in column metadata") - return False - # Making sure all accepted_semantic_types are available in semantic_types - if len(accepted_semantic_types - semantic_types) == 0: - return True - return False - - target_column_indices, target_columns_not_to_produce = base_utils.get_columns_to_use(metadata, - use_columns=hyperparams[ - 'use_outputs_columns'], - exclude_columns= - hyperparams[ - 'exclude_outputs_columns'], - can_use_column=can_produce_column) - targets = [] - if target_column_indices: - targets = data.select_columns(target_column_indices) - target_column_names = [] - for idx in target_column_indices: - target_column_names.append(data.columns[idx]) - return targets, target_column_names, target_column_indices - - @classmethod - def _get_target_columns_metadata(cls, outputs_metadata: metadata_base.DataMetadata, hyperparams) -> List[OrderedDict]: - outputs_length = outputs_metadata.query((metadata_base.ALL_ELEMENTS,))['dimension']['length'] - - target_columns_metadata: List[OrderedDict] = [] - for column_index in range(outputs_length): - column_metadata = OrderedDict(outputs_metadata.query_column(column_index)) - - # Update semantic types and prepare it for predicted targets. - semantic_types = set(column_metadata.get('semantic_types', [])) - semantic_types_to_remove = set(["https://metadata.datadrivendiscovery.org/types/TrueTarget","https://metadata.datadrivendiscovery.org/types/SuggestedTarget",]) - add_semantic_types = set(["https://metadata.datadrivendiscovery.org/types/PredictedTarget",]) - add_semantic_types.add(hyperparams["return_semantic_type"]) - semantic_types = semantic_types - semantic_types_to_remove - semantic_types = semantic_types.union(add_semantic_types) - column_metadata['semantic_types'] = list(semantic_types) - - target_columns_metadata.append(column_metadata) - - return target_columns_metadata - - @classmethod - def _update_predictions_metadata(cls, inputs_metadata: metadata_base.DataMetadata, outputs: Optional[Outputs], - target_columns_metadata: List[OrderedDict]) -> metadata_base.DataMetadata: - outputs_metadata = metadata_base.DataMetadata().generate(value=outputs) - - for column_index, column_metadata in enumerate(target_columns_metadata): - column_metadata.pop("structural_type", None) - outputs_metadata = outputs_metadata.update_column(column_index, column_metadata) - - return outputs_metadata - - def _wrap_predictions(self, inputs: Inputs, predictions: ndarray) -> Outputs: - outputs = d3m_dataframe(predictions, generate_metadata=False) - outputs.metadata = self._update_predictions_metadata(inputs.metadata, outputs, self._target_columns_metadata) - return outputs - - - @classmethod - def _add_target_columns_metadata(cls, outputs_metadata: metadata_base.DataMetadata): - outputs_length = outputs_metadata.query((metadata_base.ALL_ELEMENTS,))['dimension']['length'] - - target_columns_metadata: List[OrderedDict] = [] - for column_index in range(outputs_length): - column_metadata = OrderedDict() - semantic_types = [] - semantic_types.append('https://metadata.datadrivendiscovery.org/types/PredictedTarget') - column_name = outputs_metadata.query((metadata_base.ALL_ELEMENTS, column_index)).get("name") - if column_name is None: - column_name = "output_{}".format(column_index) - column_metadata["semantic_types"] = semantic_types - column_metadata["name"] = str(column_name) - target_columns_metadata.append(column_metadata) - - return target_columns_metadata - - -SKGaussianProcessRegressor.__doc__ = GaussianProcessRegressor.__doc__ \ No newline at end of file diff --git a/common-primitives/sklearn-wrap/sklearn_wrap/SKGaussianRandomProjection.py b/common-primitives/sklearn-wrap/sklearn_wrap/SKGaussianRandomProjection.py deleted file mode 100644 index 867d904..0000000 --- a/common-primitives/sklearn-wrap/sklearn_wrap/SKGaussianRandomProjection.py +++ /dev/null @@ -1,344 +0,0 @@ -from typing import Any, Callable, List, Dict, Union, Optional, Sequence, Tuple -from numpy import ndarray -from collections import OrderedDict -from scipy import sparse -import os -import sklearn -import numpy -import typing - -# Custom import commands if any -from sklearn.random_projection import GaussianRandomProjection - - -from d3m.container.numpy import ndarray as d3m_ndarray -from d3m.container import DataFrame as d3m_dataframe -from d3m.metadata import hyperparams, params, base as metadata_base -from d3m import utils -from d3m.base import utils as base_utils -from d3m.exceptions import PrimitiveNotFittedError -from d3m.primitive_interfaces.base import CallResult, DockerContainer -from d3m.primitive_interfaces.unsupervised_learning import UnsupervisedLearnerPrimitiveBase - - -Inputs = d3m_dataframe -Outputs = d3m_dataframe - - -class Params(params.Params): - n_component_: Optional[int] - components_: Optional[Union[ndarray, sparse.spmatrix]] - input_column_names: Optional[Any] - target_names_: Optional[Sequence[Any]] - training_indices_: Optional[Sequence[int]] - target_column_indices_: Optional[Sequence[int]] - target_columns_metadata_: Optional[List[OrderedDict]] - - - -class Hyperparams(hyperparams.Hyperparams): - n_components = hyperparams.Union( - configuration=OrderedDict({ - 'int': hyperparams.Bounded[int]( - lower=0, - upper=None, - default=100, - description='Number of components to keep.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ), - 'auto': hyperparams.Constant( - default='auto', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ) - }), - default='auto', - description='Dimensionality of the target projection space. n_components can be automatically adjusted according to the number of samples in the dataset and the bound given by the Johnson-Lindenstrauss lemma. In that case the quality of the embedding is controlled by the ``eps`` parameter. It should be noted that Johnson-Lindenstrauss lemma can yield very conservative estimated of the required number of components as it makes no assumption on the structure of the dataset.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - eps = hyperparams.Bounded[float]( - default=0.1, - lower=0, - upper=1, - description='Parameter to control the quality of the embedding according to the Johnson-Lindenstrauss lemma when n_components is set to \'auto\'. Smaller values lead to better embedding and higher number of dimensions (n_components) in the target projection space.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - - use_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to force primitive to operate on. If any specified column cannot be parsed, it is skipped.", - ) - exclude_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to not operate on. Applicable only if \"use_columns\" is not provided.", - ) - return_result = hyperparams.Enumeration( - values=['append', 'replace', 'new'], - default='new', - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Should parsed columns be appended, should they replace original columns, or should only parsed columns be returned? This hyperparam is ignored if use_semantic_types is set to false.", - ) - use_semantic_types = hyperparams.UniformBool( - default=False, - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Controls whether semantic_types metadata will be used for filtering columns in input dataframe. Setting this to false makes the code ignore return_result and will produce only the output dataframe" - ) - add_index_columns = hyperparams.UniformBool( - default=False, - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Also include primary index columns if input data has them. Applicable only if \"return_result\" is set to \"new\".", - ) - error_on_no_input = hyperparams.UniformBool( - default=True, - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Throw an exception if no input column is selected/provided. Defaults to true to behave like sklearn. To prevent pipelines from breaking set this to False.", - ) - - return_semantic_type = hyperparams.Enumeration[str]( - values=['https://metadata.datadrivendiscovery.org/types/Attribute', 'https://metadata.datadrivendiscovery.org/types/ConstructedAttribute'], - default='https://metadata.datadrivendiscovery.org/types/Attribute', - description='Decides what semantic type to attach to generated attributes', - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'] - ) - -class SKGaussianRandomProjection(UnsupervisedLearnerPrimitiveBase[Inputs, Outputs, Params, Hyperparams]): - """ - Primitive wrapping for sklearn GaussianRandomProjection - `sklearn documentation `_ - - """ - - __author__ = "JPL MARVIN" - metadata = metadata_base.PrimitiveMetadata({ - "algorithm_types": [metadata_base.PrimitiveAlgorithmType.RANDOM_PROJECTION, ], - "name": "sklearn.random_projection.GaussianRandomProjection", - "primitive_family": metadata_base.PrimitiveFamily.DATA_TRANSFORMATION, - "python_path": "d3m.primitives.data_transformation.gaussian_random_projection.SKlearn", - "source": {'name': 'JPL', 'contact': 'mailto:shah@jpl.nasa.gov', 'uris': ['https://gitlab.com/datadrivendiscovery/sklearn-wrap/issues', 'https://scikit-learn.org/stable/modules/generated/sklearn.random_projection.GaussianRandomProjection.html']}, - "version": "2019.11.13", - "id": "fc933ab9-baaf-47ca-a373-bdd33081f5fa", - "hyperparams_to_tune": ['n_components'], - 'installation': [ - {'type': metadata_base.PrimitiveInstallationType.PIP, - 'package_uri': 'git+https://gitlab.com/datadrivendiscovery/sklearn-wrap.git@{git_commit}#egg=sklearn_wrap'.format( - git_commit=utils.current_git_commit(os.path.dirname(__file__)), - ), - }] - }) - - def __init__(self, *, - hyperparams: Hyperparams, - random_seed: int = 0, - docker_containers: Dict[str, DockerContainer] = None) -> None: - - super().__init__(hyperparams=hyperparams, random_seed=random_seed, docker_containers=docker_containers) - - # False - self._clf = GaussianRandomProjection( - n_components=self.hyperparams['n_components'], - eps=self.hyperparams['eps'], - random_state=self.random_seed, - ) - - self._inputs = None - self._outputs = None - self._training_inputs = None - self._training_outputs = None - self._target_names = None - self._training_indices = None - self._target_column_indices = None - self._target_columns_metadata: List[OrderedDict] = None - self._input_column_names = None - self._fitted = False - - - def set_training_data(self, *, inputs: Inputs) -> None: - self._inputs = inputs - self._fitted = False - - def fit(self, *, timeout: float = None, iterations: int = None) -> CallResult[None]: - if self._fitted: - return CallResult(None) - - self._training_inputs, self._training_indices = self._get_columns_to_fit(self._inputs, self.hyperparams) - self._input_column_names = self._training_inputs.columns - - if self._training_inputs is None: - return CallResult(None) - - if len(self._training_indices) > 0: - self._clf.fit(self._training_inputs) - self._fitted = True - else: - if self.hyperparams['error_on_no_input']: - raise RuntimeError("No input columns were selected") - self.logger.warn("No input columns were selected") - return CallResult(None) - - def produce(self, *, inputs: Inputs, timeout: float = None, iterations: int = None) -> CallResult[Outputs]: - if not self._fitted: - raise PrimitiveNotFittedError("Primitive not fitted.") - sk_inputs = inputs - if self.hyperparams['use_semantic_types']: - sk_inputs = inputs.iloc[:, self._training_indices] - output_columns = [] - if len(self._training_indices) > 0: - sk_output = self._clf.transform(sk_inputs) - if sparse.issparse(sk_output): - sk_output = sk_output.toarray() - outputs = self._wrap_predictions(inputs, sk_output) - if len(outputs.columns) == len(self._input_column_names): - outputs.columns = self._input_column_names - output_columns = [outputs] - else: - if self.hyperparams['error_on_no_input']: - raise RuntimeError("No input columns were selected") - self.logger.warn("No input columns were selected") - outputs = base_utils.combine_columns(return_result=self.hyperparams['return_result'], - add_index_columns=self.hyperparams['add_index_columns'], - inputs=inputs, column_indices=self._training_indices, - columns_list=output_columns) - return CallResult(outputs) - - - def get_params(self) -> Params: - if not self._fitted: - return Params( - n_component_=None, - components_=None, - input_column_names=self._input_column_names, - training_indices_=self._training_indices, - target_names_=self._target_names, - target_column_indices_=self._target_column_indices, - target_columns_metadata_=self._target_columns_metadata - ) - - return Params( - n_component_=getattr(self._clf, 'n_component_', None), - components_=getattr(self._clf, 'components_', None), - input_column_names=self._input_column_names, - training_indices_=self._training_indices, - target_names_=self._target_names, - target_column_indices_=self._target_column_indices, - target_columns_metadata_=self._target_columns_metadata - ) - - def set_params(self, *, params: Params) -> None: - self._clf.n_component_ = params['n_component_'] - self._clf.components_ = params['components_'] - self._input_column_names = params['input_column_names'] - self._training_indices = params['training_indices_'] - self._target_names = params['target_names_'] - self._target_column_indices = params['target_column_indices_'] - self._target_columns_metadata = params['target_columns_metadata_'] - - if params['n_component_'] is not None: - self._fitted = True - if params['components_'] is not None: - self._fitted = True - - - - - - @classmethod - def _get_columns_to_fit(cls, inputs: Inputs, hyperparams: Hyperparams): - if not hyperparams['use_semantic_types']: - return inputs, list(range(len(inputs.columns))) - - inputs_metadata = inputs.metadata - - def can_produce_column(column_index: int) -> bool: - return cls._can_produce_column(inputs_metadata, column_index, hyperparams) - - columns_to_produce, columns_not_to_produce = base_utils.get_columns_to_use(inputs_metadata, - use_columns=hyperparams['use_columns'], - exclude_columns=hyperparams['exclude_columns'], - can_use_column=can_produce_column) - return inputs.iloc[:, columns_to_produce], columns_to_produce - # return columns_to_produce - - @classmethod - def _can_produce_column(cls, inputs_metadata: metadata_base.DataMetadata, column_index: int, hyperparams: Hyperparams) -> bool: - column_metadata = inputs_metadata.query((metadata_base.ALL_ELEMENTS, column_index)) - - accepted_structural_types = (int, float, numpy.integer, numpy.float64) - accepted_semantic_types = set() - accepted_semantic_types.add("https://metadata.datadrivendiscovery.org/types/Attribute") - if not issubclass(column_metadata['structural_type'], accepted_structural_types): - return False - - semantic_types = set(column_metadata.get('semantic_types', [])) - - if len(semantic_types) == 0: - cls.logger.warning("No semantic types found in column metadata") - return False - - # Making sure all accepted_semantic_types are available in semantic_types - if len(accepted_semantic_types - semantic_types) == 0: - return True - - return False - - - @classmethod - def _get_target_columns_metadata(cls, outputs_metadata: metadata_base.DataMetadata, hyperparams) -> List[OrderedDict]: - outputs_length = outputs_metadata.query((metadata_base.ALL_ELEMENTS,))['dimension']['length'] - - target_columns_metadata: List[OrderedDict] = [] - for column_index in range(outputs_length): - column_metadata = OrderedDict(outputs_metadata.query_column(column_index)) - - # Update semantic types and prepare it for predicted targets. - semantic_types = set(column_metadata.get('semantic_types', [])) - semantic_types_to_remove = set([]) - add_semantic_types = [] - add_semantic_types.add(hyperparams["return_semantic_type"]) - semantic_types = semantic_types - semantic_types_to_remove - semantic_types = semantic_types.union(add_semantic_types) - column_metadata['semantic_types'] = list(semantic_types) - - target_columns_metadata.append(column_metadata) - - return target_columns_metadata - - @classmethod - def _update_predictions_metadata(cls, inputs_metadata: metadata_base.DataMetadata, outputs: Optional[Outputs], - target_columns_metadata: List[OrderedDict]) -> metadata_base.DataMetadata: - outputs_metadata = metadata_base.DataMetadata().generate(value=outputs) - - for column_index, column_metadata in enumerate(target_columns_metadata): - column_metadata.pop("structural_type", None) - outputs_metadata = outputs_metadata.update_column(column_index, column_metadata) - - return outputs_metadata - - def _wrap_predictions(self, inputs: Inputs, predictions: ndarray) -> Outputs: - outputs = d3m_dataframe(predictions, generate_metadata=True) - target_columns_metadata = self._add_target_columns_metadata(outputs.metadata, self.hyperparams) - outputs.metadata = self._update_predictions_metadata(inputs.metadata, outputs, target_columns_metadata) - return outputs - - - @classmethod - def _add_target_columns_metadata(cls, outputs_metadata: metadata_base.DataMetadata, hyperparams): - - outputs_length = outputs_metadata.query((metadata_base.ALL_ELEMENTS,))['dimension']['length'] - target_columns_metadata: List[OrderedDict] = [] - for column_index in range(outputs_length): - column_name = "output_{}".format(column_index) - column_metadata = OrderedDict() - semantic_types = set() - semantic_types.add(hyperparams["return_semantic_type"]) - column_metadata['semantic_types'] = list(semantic_types) - - column_metadata["name"] = str(column_name) - target_columns_metadata.append(column_metadata) - - return target_columns_metadata - - -SKGaussianRandomProjection.__doc__ = GaussianRandomProjection.__doc__ \ No newline at end of file diff --git a/common-primitives/sklearn-wrap/sklearn_wrap/SKGenericUnivariateSelect.py b/common-primitives/sklearn-wrap/sklearn_wrap/SKGenericUnivariateSelect.py deleted file mode 100644 index b0c45ad..0000000 --- a/common-primitives/sklearn-wrap/sklearn_wrap/SKGenericUnivariateSelect.py +++ /dev/null @@ -1,443 +0,0 @@ -from typing import Any, Callable, List, Dict, Union, Optional, Sequence, Tuple -from numpy import ndarray -from collections import OrderedDict -from scipy import sparse -import os -import sklearn -import numpy -import typing - -# Custom import commands if any -from sklearn.feature_selection.univariate_selection import GenericUnivariateSelect -from sklearn.feature_selection import f_classif, f_regression, chi2 - - -from d3m.container.numpy import ndarray as d3m_ndarray -from d3m.container import DataFrame as d3m_dataframe -from d3m.metadata import hyperparams, params, base as metadata_base -from d3m import utils -from d3m.base import utils as base_utils -from d3m.exceptions import PrimitiveNotFittedError -from d3m.primitive_interfaces.base import CallResult, DockerContainer - -from d3m.primitive_interfaces.supervised_learning import SupervisedLearnerPrimitiveBase -from d3m.primitive_interfaces.base import ProbabilisticCompositionalityMixin, ContinueFitMixin -from d3m import exceptions -import pandas - - - -Inputs = d3m_dataframe -Outputs = d3m_dataframe - - -class Params(params.Params): - scores_: Optional[ndarray] - pvalues_: Optional[ndarray] - input_column_names: Optional[Any] - target_names_: Optional[Sequence[Any]] - training_indices_: Optional[Sequence[int]] - target_column_indices_: Optional[Sequence[int]] - target_columns_metadata_: Optional[List[OrderedDict]] - - - -class Hyperparams(hyperparams.Hyperparams): - score_func = hyperparams.Enumeration[str]( - default='f_classif', - values=['f_classif', 'f_regression', 'chi2'], - description='Function taking two arrays X and y, and returning a pair of arrays (scores, pvalues). For modes \'percentile\' or \'kbest\' it can return a single array scores.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - mode = hyperparams.Enumeration[str]( - default='percentile', - values=['percentile', 'k_best', 'fpr', 'fdr', 'fwe'], - description='Feature selection mode.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - param = hyperparams.Union( - configuration=OrderedDict({ - 'float': hyperparams.Hyperparameter[float]( - default=1e-05, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ), - 'int': hyperparams.Hyperparameter[int]( - default=0, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ) - }), - default='float', - description='Parameter of the corresponding mode.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - - use_inputs_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to force primitive to use as training input. If any specified column cannot be parsed, it is skipped.", - ) - use_outputs_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to force primitive to use as training target. If any specified column cannot be parsed, it is skipped.", - ) - exclude_inputs_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to not use as training inputs. Applicable only if \"use_columns\" is not provided.", - ) - exclude_outputs_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to not use as training target. Applicable only if \"use_columns\" is not provided.", - ) - return_result = hyperparams.Enumeration( - values=['update_semantic_types', 'replace', 'new'], - default='new', - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Should parsed columns be appended, should they replace original columns, or should only parsed columns be returned? This hyperparam is ignored if use_semantic_types is set to false.", -) - use_semantic_types = hyperparams.UniformBool( - default=False, - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Controls whether semantic_types metadata will be used for filtering columns in input dataframe. Setting this to false makes the code ignore return_result and will produce only the output dataframe" - ) - add_index_columns = hyperparams.UniformBool( - default=False, - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Also include primary index columns if input data has them. Applicable only if \"return_result\" is set to \"new\".", - ) - error_on_no_input = hyperparams.UniformBool( - default=True, - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Throw an exception if no input column is selected/provided. Defaults to true to behave like sklearn. To prevent pipelines from breaking set this to False.", - ) - - return_semantic_type = hyperparams.Enumeration[str]( - values=['https://metadata.datadrivendiscovery.org/types/Attribute', 'https://metadata.datadrivendiscovery.org/types/ConstructedAttribute', 'https://metadata.datadrivendiscovery.org/types/PredictedTarget'], - default='https://metadata.datadrivendiscovery.org/types/PredictedTarget', - description='Decides what semantic type to attach to generated output', - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'] - ) - -class SKGenericUnivariateSelect(SupervisedLearnerPrimitiveBase[Inputs, Outputs, Params, Hyperparams]): - """ - Primitive wrapping for sklearn GenericUnivariateSelect - `sklearn documentation `_ - - """ - - __author__ = "JPL MARVIN" - metadata = metadata_base.PrimitiveMetadata({ - "algorithm_types": [metadata_base.PrimitiveAlgorithmType.STATISTICAL_MOMENT_ANALYSIS, ], - "name": "sklearn.feature_selection.univariate_selection.GenericUnivariateSelect", - "primitive_family": metadata_base.PrimitiveFamily.FEATURE_SELECTION, - "python_path": "d3m.primitives.feature_selection.generic_univariate_select.SKlearn", - "source": {'name': 'JPL', 'contact': 'mailto:shah@jpl.nasa.gov', 'uris': ['https://gitlab.com/datadrivendiscovery/sklearn-wrap/issues', 'https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.GenericUnivariateSelect.html']}, - "version": "2019.11.13", - "id": "1055a114-5c94-33b0-9100-675fd0200e72", - "hyperparams_to_tune": ['mode'], - 'installation': [ - {'type': metadata_base.PrimitiveInstallationType.PIP, - 'package_uri': 'git+https://gitlab.com/datadrivendiscovery/sklearn-wrap.git@{git_commit}#egg=sklearn_wrap'.format( - git_commit=utils.current_git_commit(os.path.dirname(__file__)), - ), - }] - }) - - def __init__(self, *, - hyperparams: Hyperparams, - random_seed: int = 0, - docker_containers: Dict[str, DockerContainer] = None) -> None: - - super().__init__(hyperparams=hyperparams, random_seed=random_seed, docker_containers=docker_containers) - - # False - self._clf = GenericUnivariateSelect( - score_func=eval(self.hyperparams['score_func']), - mode=self.hyperparams['mode'], - param=self.hyperparams['param'], - ) - - self._inputs = None - self._outputs = None - self._training_inputs = None - self._training_outputs = None - self._target_names = None - self._training_indices = None - self._target_column_indices = None - self._target_columns_metadata: List[OrderedDict] = None - self._input_column_names = None - self._fitted = False - self._new_training_data = False - - def set_training_data(self, *, inputs: Inputs, outputs: Outputs) -> None: - self._inputs = inputs - self._outputs = outputs - self._fitted = False - self._new_training_data = True - - def fit(self, *, timeout: float = None, iterations: int = None) -> CallResult[None]: - if self._fitted: - return CallResult(None) - - self._training_inputs, self._training_indices = self._get_columns_to_fit(self._inputs, self.hyperparams) - self._training_outputs, self._target_names, self._target_column_indices = self._get_targets(self._outputs, self.hyperparams) - self._input_column_names = self._training_inputs.columns - - if self._training_inputs is None or self._training_outputs is None: - raise ValueError("Missing training data.") - - if len(self._training_indices) > 0 and len(self._target_column_indices) > 0: - sk_training_output = self._training_outputs.values - - shape = sk_training_output.shape - if len(shape) == 2 and shape[1] == 1: - sk_training_output = numpy.ravel(sk_training_output) - - self._clf.fit(self._training_inputs, sk_training_output) - self._fitted = True - else: - if self.hyperparams['error_on_no_input']: - raise RuntimeError("No input columns were selected") - self.logger.warn("No input columns were selected") - - return CallResult(None) - - def produce(self, *, inputs: Inputs, timeout: float = None, iterations: int = None) -> CallResult[Outputs]: - sk_inputs, columns_to_use = self._get_columns_to_fit(inputs, self.hyperparams) - output = [] - if len(sk_inputs.columns): - try: - sk_output = self._clf.transform(sk_inputs) - except sklearn.exceptions.NotFittedError as error: - raise PrimitiveNotFittedError("Primitive not fitted.") from error - if sparse.issparse(sk_output): - sk_output = sk_output.toarray() - target_columns_metadata = self._copy_columns_metadata(inputs.iloc[:, self._training_indices].metadata, - self.produce_support().value) - output = self._wrap_predictions(inputs, sk_output, target_columns_metadata) - output.columns = [inputs.columns[idx] for idx in range(len(inputs.columns)) if idx in self.produce_support().value] - output = [output] - else: - if self.hyperparams['error_on_no_input']: - raise RuntimeError("No input columns were selected") - self.logger.warn("No input columns were selected") - - if self.hyperparams['return_result'] == 'update_semantic_types': - temp_inputs = inputs.copy() - columns_not_selected = sorted(set(range(len(temp_inputs.columns))) - set(self.produce_support().value)) - - for idx in columns_not_selected: - temp_inputs.metadata = temp_inputs.metadata.remove_semantic_type((metadata_base.ALL_ELEMENTS, idx), - 'https://metadata.datadrivendiscovery.org/types/Attribute') - - temp_inputs = temp_inputs.select_columns(self._training_indices) - outputs = base_utils.combine_columns(return_result='replace', - add_index_columns=self.hyperparams['add_index_columns'], - inputs=inputs, column_indices=self._training_indices, - columns_list=[temp_inputs]) - return CallResult(outputs) - - outputs = base_utils.combine_columns(return_result=self.hyperparams['return_result'], - add_index_columns=self.hyperparams['add_index_columns'], - inputs=inputs, column_indices=self._training_indices, - columns_list=output) - - return CallResult(outputs) - - def produce_support(self, *, timeout: float = None, iterations: int = None) -> CallResult[Any]: - all_indices = self._training_indices - selected_indices = self._clf.get_support(indices=True).tolist() - indices = [all_indices[index] for index in selected_indices] - return CallResult(indices) - - - def get_params(self) -> Params: - if not self._fitted: - return Params( - scores_=None, - pvalues_=None, - input_column_names=self._input_column_names, - training_indices_=self._training_indices, - target_names_=self._target_names, - target_column_indices_=self._target_column_indices, - target_columns_metadata_=self._target_columns_metadata - ) - - return Params( - scores_=getattr(self._clf, 'scores_', None), - pvalues_=getattr(self._clf, 'pvalues_', None), - input_column_names=self._input_column_names, - training_indices_=self._training_indices, - target_names_=self._target_names, - target_column_indices_=self._target_column_indices, - target_columns_metadata_=self._target_columns_metadata - ) - - def set_params(self, *, params: Params) -> None: - self._clf.scores_ = params['scores_'] - self._clf.pvalues_ = params['pvalues_'] - self._input_column_names = params['input_column_names'] - self._training_indices = params['training_indices_'] - self._target_names = params['target_names_'] - self._target_column_indices = params['target_column_indices_'] - self._target_columns_metadata = params['target_columns_metadata_'] - - if params['scores_'] is not None: - self._fitted = True - if params['pvalues_'] is not None: - self._fitted = True - - - - - - - - @classmethod - def _get_columns_to_fit(cls, inputs: Inputs, hyperparams: Hyperparams): - if not hyperparams['use_semantic_types']: - return inputs, list(range(len(inputs.columns))) - - inputs_metadata = inputs.metadata - - def can_produce_column(column_index: int) -> bool: - return cls._can_produce_column(inputs_metadata, column_index, hyperparams) - - columns_to_produce, columns_not_to_produce = base_utils.get_columns_to_use(inputs_metadata, - use_columns=hyperparams['use_inputs_columns'], - exclude_columns=hyperparams['exclude_inputs_columns'], - can_use_column=can_produce_column) - return inputs.iloc[:, columns_to_produce], columns_to_produce - # return columns_to_produce - - @classmethod - def _can_produce_column(cls, inputs_metadata: metadata_base.DataMetadata, column_index: int, hyperparams: Hyperparams) -> bool: - column_metadata = inputs_metadata.query((metadata_base.ALL_ELEMENTS, column_index)) - - accepted_structural_types = (int, float, numpy.integer, numpy.float64) - accepted_semantic_types = set() - accepted_semantic_types.add("https://metadata.datadrivendiscovery.org/types/Attribute") - if not issubclass(column_metadata['structural_type'], accepted_structural_types): - return False - - semantic_types = set(column_metadata.get('semantic_types', [])) - - if len(semantic_types) == 0: - cls.logger.warning("No semantic types found in column metadata") - return False - # Making sure all accepted_semantic_types are available in semantic_types - if len(accepted_semantic_types - semantic_types) == 0: - return True - - return False - - @classmethod - def _get_targets(cls, data: d3m_dataframe, hyperparams: Hyperparams): - if not hyperparams['use_semantic_types']: - return data, list(data.columns), list(range(len(data.columns))) - - metadata = data.metadata - - def can_produce_column(column_index: int) -> bool: - accepted_semantic_types = set() - accepted_semantic_types.add("https://metadata.datadrivendiscovery.org/types/TrueTarget") - column_metadata = metadata.query((metadata_base.ALL_ELEMENTS, column_index)) - semantic_types = set(column_metadata.get('semantic_types', [])) - if len(semantic_types) == 0: - cls.logger.warning("No semantic types found in column metadata") - return False - # Making sure all accepted_semantic_types are available in semantic_types - if len(accepted_semantic_types - semantic_types) == 0: - return True - return False - - target_column_indices, target_columns_not_to_produce = base_utils.get_columns_to_use(metadata, - use_columns=hyperparams[ - 'use_outputs_columns'], - exclude_columns= - hyperparams[ - 'exclude_outputs_columns'], - can_use_column=can_produce_column) - targets = [] - if target_column_indices: - targets = data.select_columns(target_column_indices) - target_column_names = [] - for idx in target_column_indices: - target_column_names.append(data.columns[idx]) - return targets, target_column_names, target_column_indices - - @classmethod - def _get_target_columns_metadata(cls, outputs_metadata: metadata_base.DataMetadata, hyperparams) -> List[OrderedDict]: - outputs_length = outputs_metadata.query((metadata_base.ALL_ELEMENTS,))['dimension']['length'] - - target_columns_metadata: List[OrderedDict] = [] - for column_index in range(outputs_length): - column_metadata = OrderedDict(outputs_metadata.query_column(column_index)) - - # Update semantic types and prepare it for predicted targets. - semantic_types = set(column_metadata.get('semantic_types', [])) - semantic_types_to_remove = set([]) - add_semantic_types = [] - add_semantic_types.add(hyperparams["return_semantic_type"]) - semantic_types = semantic_types - semantic_types_to_remove - semantic_types = semantic_types.union(add_semantic_types) - column_metadata['semantic_types'] = list(semantic_types) - - target_columns_metadata.append(column_metadata) - - return target_columns_metadata - - @classmethod - def _update_predictions_metadata(cls, inputs_metadata: metadata_base.DataMetadata, outputs: Optional[Outputs], - target_columns_metadata: List[OrderedDict]) -> metadata_base.DataMetadata: - outputs_metadata = metadata_base.DataMetadata().generate(value=outputs) - - for column_index, column_metadata in enumerate(target_columns_metadata): - if len(target_columns_metadata) == 1: - name = column_metadata.get("name") - for idx in range(len(outputs.columns)): - outputs_metadata = outputs_metadata.update_column(idx, column_metadata) - if len(outputs.columns) > 1: - # Updating column names. - outputs_metadata = outputs_metadata.update((metadata_base.ALL_ELEMENTS, idx), {'name': "{}_{}".format(name, idx)}) - else: - outputs_metadata = outputs_metadata.update_column(column_index, column_metadata) - - return outputs_metadata - - - def _wrap_predictions(self, inputs: Inputs, predictions: ndarray, target_columns_metadata) -> Outputs: - outputs = d3m_dataframe(predictions, generate_metadata=False) - outputs.metadata = self._update_predictions_metadata(inputs.metadata, outputs, target_columns_metadata) - return outputs - - - - @classmethod - def _copy_columns_metadata(cls, inputs_metadata: metadata_base.DataMetadata, column_indices) -> List[OrderedDict]: - outputs_length = inputs_metadata.query((metadata_base.ALL_ELEMENTS,))['dimension']['length'] - - target_columns_metadata: List[OrderedDict] = [] - for column_index in column_indices: - column_name = inputs_metadata.query((metadata_base.ALL_ELEMENTS, column_index)).get("name") - column_metadata = OrderedDict(inputs_metadata.query_column(column_index)) - semantic_types = set(column_metadata.get('semantic_types', [])) - semantic_types_to_remove = set([]) - add_semantic_types = [] - semantic_types = semantic_types - semantic_types_to_remove - semantic_types = semantic_types.union(add_semantic_types) - column_metadata['semantic_types'] = list(semantic_types) - - column_metadata["name"] = str(column_name) - target_columns_metadata.append(column_metadata) - - return target_columns_metadata - - -SKGenericUnivariateSelect.__doc__ = GenericUnivariateSelect.__doc__ \ No newline at end of file diff --git a/common-primitives/sklearn-wrap/sklearn_wrap/SKGradientBoostingClassifier.py b/common-primitives/sklearn-wrap/sklearn_wrap/SKGradientBoostingClassifier.py deleted file mode 100644 index 0c92268..0000000 --- a/common-primitives/sklearn-wrap/sklearn_wrap/SKGradientBoostingClassifier.py +++ /dev/null @@ -1,707 +0,0 @@ -from typing import Any, Callable, List, Dict, Union, Optional, Sequence, Tuple -from numpy import ndarray -from collections import OrderedDict -from scipy import sparse -import os -import sklearn -import numpy -import typing - -# Custom import commands if any -from sklearn.ensemble.gradient_boosting import GradientBoostingClassifier -import sys - - -from d3m.container.numpy import ndarray as d3m_ndarray -from d3m.container import DataFrame as d3m_dataframe -from d3m.metadata import hyperparams, params, base as metadata_base -from d3m import utils -from d3m.base import utils as base_utils -from d3m.exceptions import PrimitiveNotFittedError -from d3m.primitive_interfaces.base import CallResult, DockerContainer - -from d3m.primitive_interfaces.supervised_learning import SupervisedLearnerPrimitiveBase -from d3m.primitive_interfaces.base import ProbabilisticCompositionalityMixin, ContinueFitMixin -from d3m import exceptions -import pandas - - - -Inputs = d3m_dataframe -Outputs = d3m_dataframe - - -class Params(params.Params): - oob_improvement_: Optional[ndarray] - train_score_: Optional[ndarray] - loss_: Optional[object] - init_: Optional[object] - estimators_: Optional[ndarray] - n_features_: Optional[int] - classes_: Optional[ndarray] - max_features_: Optional[int] - n_classes_: Optional[Union[int, List[int]]] - alpha: Optional[float] - _rng: Optional[object] - n_estimators_: Optional[int] - input_column_names: Optional[Any] - target_names_: Optional[Sequence[Any]] - training_indices_: Optional[Sequence[int]] - target_column_indices_: Optional[Sequence[int]] - target_columns_metadata_: Optional[List[OrderedDict]] - - - -class Hyperparams(hyperparams.Hyperparams): - loss = hyperparams.Enumeration[str]( - default='deviance', - values=['deviance', 'exponential'], - description='loss function to be optimized. \'deviance\' refers to deviance (= logistic regression) for classification with probabilistic outputs. For loss \'exponential\' gradient boosting recovers the AdaBoost algorithm.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - learning_rate = hyperparams.Bounded[float]( - default=0.1, - lower=0, - upper=None, - description='learning rate shrinks the contribution of each tree by `learning_rate`. There is a trade-off between learning_rate and n_estimators.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - n_estimators = hyperparams.Bounded[int]( - default=100, - lower=1, - upper=None, - description='The number of boosting stages to perform. Gradient boosting is fairly robust to over-fitting so a large number usually results in better performance.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - max_depth = hyperparams.Bounded[int]( - default=3, - lower=0, - upper=None, - description='maximum depth of the individual regression estimators. The maximum depth limits the number of nodes in the tree. Tune this parameter for best performance; the best value depends on the interaction of the input variables.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - criterion = hyperparams.Enumeration[str]( - default='friedman_mse', - values=['friedman_mse', 'mse', 'mae'], - description='The function to measure the quality of a split. Supported criteria are "friedman_mse" for the mean squared error with improvement score by Friedman, "mse" for mean squared error, and "mae" for the mean absolute error. The default value of "friedman_mse" is generally the best as it can provide a better approximation in some cases. .. versionadded:: 0.18', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - min_samples_split = hyperparams.Union( - configuration=OrderedDict({ - 'absolute': hyperparams.Bounded[int]( - default=2, - lower=1, - upper=None, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ), - 'percent': hyperparams.Bounded[float]( - default=0.25, - lower=0, - upper=1, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ) - }), - default='absolute', - description='The minimum number of samples required to split an internal node: - If int, then consider `min_samples_split` as the minimum number. - If float, then `min_samples_split` is a percentage and `ceil(min_samples_split * n_samples)` are the minimum number of samples for each split. .. versionchanged:: 0.18 Added float values for percentages.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - min_samples_leaf = hyperparams.Union( - configuration=OrderedDict({ - 'absolute': hyperparams.Bounded[int]( - default=1, - lower=1, - upper=None, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ), - 'percent': hyperparams.Bounded[float]( - default=0.25, - lower=0, - upper=0.5, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ) - }), - default='absolute', - description='The minimum number of samples required to be at a leaf node: - If int, then consider `min_samples_leaf` as the minimum number. - If float, then `min_samples_leaf` is a percentage and `ceil(min_samples_leaf * n_samples)` are the minimum number of samples for each node. .. versionchanged:: 0.18 Added float values for percentages.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - min_weight_fraction_leaf = hyperparams.Bounded[float]( - default=0, - lower=0, - upper=0.5, - description='The minimum weighted fraction of the sum total of weights (of all the input samples) required to be at a leaf node. Samples have equal weight when sample_weight is not provided.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - subsample = hyperparams.Bounded[float]( - default=1.0, - lower=0, - upper=None, - description='The fraction of samples to be used for fitting the individual base learners. If smaller than 1.0 this results in Stochastic Gradient Boosting. `subsample` interacts with the parameter `n_estimators`. Choosing `subsample < 1.0` leads to a reduction of variance and an increase in bias.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - max_features = hyperparams.Union( - configuration=OrderedDict({ - 'specified_int': hyperparams.Bounded[int]( - lower=0, - upper=None, - default=0, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ), - 'calculated': hyperparams.Enumeration[str]( - values=['auto', 'sqrt', 'log2'], - default='auto', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ), - 'none': hyperparams.Constant( - default=None, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ), - 'percent': hyperparams.Bounded[float]( - default=0.25, - lower=0, - upper=1, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ) - }), - default='none', - description='The number of features to consider when looking for the best split: - If int, then consider `max_features` features at each split. - If float, then `max_features` is a percentage and `int(max_features * n_features)` features are considered at each split. - If "auto", then `max_features=sqrt(n_features)`. - If "sqrt", then `max_features=sqrt(n_features)`. - If "log2", then `max_features=log2(n_features)`. - If None, then `max_features=n_features`. Choosing `max_features < n_features` leads to a reduction of variance and an increase in bias. Note: the search for a split does not stop until at least one valid partition of the node samples is found, even if it requires to effectively inspect more than ``max_features`` features.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - max_leaf_nodes = hyperparams.Union( - configuration=OrderedDict({ - 'int': hyperparams.Bounded[int]( - default=10, - lower=0, - upper=None, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ), - 'none': hyperparams.Constant( - default=None, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ) - }), - default='none', - description='Grow trees with ``max_leaf_nodes`` in best-first fashion. Best nodes are defined as relative reduction in impurity. If None then unlimited number of leaf nodes.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - min_impurity_decrease = hyperparams.Bounded[float]( - default=0.0, - lower=0.0, - upper=None, - description='A node will be split if this split induces a decrease of the impurity greater than or equal to this value. The weighted impurity decrease equation is the following:: N_t / N * (impurity - N_t_R / N_t * right_impurity - N_t_L / N_t * left_impurity) where ``N`` is the total number of samples, ``N_t`` is the number of samples at the current node, ``N_t_L`` is the number of samples in the left child, and ``N_t_R`` is the number of samples in the right child. ``N``, ``N_t``, ``N_t_R`` and ``N_t_L`` all refer to the weighted sum, if ``sample_weight`` is passed. .. versionadded:: 0.19', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - warm_start = hyperparams.UniformBool( - default=False, - description='When set to ``True``, reuse the solution of the previous call to fit and add more estimators to the ensemble, otherwise, just erase the previous solution.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - presort = hyperparams.Union( - configuration=OrderedDict({ - 'bool': hyperparams.UniformBool( - default=False, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ), - 'auto': hyperparams.Constant( - default='auto', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ) - }), - default='auto', - description='Whether to presort the data to speed up the finding of best splits in fitting. Auto mode by default will use presorting on dense data and default to normal sorting on sparse data. Setting presort to true on sparse data will raise an error. .. versionadded:: 0.17 *presort* parameter.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'] - ) - validation_fraction = hyperparams.Bounded[float]( - default=0.1, - lower=0, - upper=1, - description='The proportion of training data to set aside as validation set for early stopping. Must be between 0 and 1. Only used if ``n_iter_no_change`` is set to an integer.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - n_iter_no_change = hyperparams.Union( - configuration=OrderedDict({ - 'int': hyperparams.Bounded[int]( - default=5, - lower=0, - upper=None, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ), - 'none': hyperparams.Constant( - default=None, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ) - }), - default='none', - description='``n_iter_no_change`` is used to decide if early stopping will be used to terminate training when validation score is not improving. By default it is set to None to disable early stopping. If set to a number, it will set aside ``validation_fraction`` size of the training data as validation and terminate training when validation score is not improving in all of the previous ``n_iter_no_change`` numbers of iterations.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - tol = hyperparams.Bounded[float]( - default=0.0001, - lower=0, - upper=None, - description='Tolerance for the early stopping. When the loss is not improving by at least tol for ``n_iter_no_change`` iterations (if set to a number), the training stops.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - - use_inputs_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to force primitive to use as training input. If any specified column cannot be parsed, it is skipped.", - ) - use_outputs_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to force primitive to use as training target. If any specified column cannot be parsed, it is skipped.", - ) - exclude_inputs_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to not use as training inputs. Applicable only if \"use_columns\" is not provided.", - ) - exclude_outputs_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to not use as training target. Applicable only if \"use_columns\" is not provided.", - ) - return_result = hyperparams.Enumeration( - values=['append', 'replace', 'new'], - default='new', - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Should parsed columns be appended, should they replace original columns, or should only parsed columns be returned? This hyperparam is ignored if use_semantic_types is set to false.", - ) - use_semantic_types = hyperparams.UniformBool( - default=False, - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Controls whether semantic_types metadata will be used for filtering columns in input dataframe. Setting this to false makes the code ignore return_result and will produce only the output dataframe" - ) - add_index_columns = hyperparams.UniformBool( - default=False, - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Also include primary index columns if input data has them. Applicable only if \"return_result\" is set to \"new\".", - ) - error_on_no_input = hyperparams.UniformBool( - default=True, - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Throw an exception if no input column is selected/provided. Defaults to true to behave like sklearn. To prevent pipelines from breaking set this to False.", - ) - - return_semantic_type = hyperparams.Enumeration[str]( - values=['https://metadata.datadrivendiscovery.org/types/Attribute', 'https://metadata.datadrivendiscovery.org/types/ConstructedAttribute', 'https://metadata.datadrivendiscovery.org/types/PredictedTarget'], - default='https://metadata.datadrivendiscovery.org/types/PredictedTarget', - description='Decides what semantic type to attach to generated output', - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'] - ) - -class SKGradientBoostingClassifier(SupervisedLearnerPrimitiveBase[Inputs, Outputs, Params, Hyperparams], - ProbabilisticCompositionalityMixin[Inputs, Outputs, Params, Hyperparams]): - """ - Primitive wrapping for sklearn GradientBoostingClassifier - `sklearn documentation `_ - - """ - - __author__ = "JPL MARVIN" - metadata = metadata_base.PrimitiveMetadata({ - "algorithm_types": [metadata_base.PrimitiveAlgorithmType.GRADIENT_BOOSTING, ], - "name": "sklearn.ensemble.gradient_boosting.GradientBoostingClassifier", - "primitive_family": metadata_base.PrimitiveFamily.CLASSIFICATION, - "python_path": "d3m.primitives.classification.gradient_boosting.SKlearn", - "source": {'name': 'JPL', 'contact': 'mailto:shah@jpl.nasa.gov', 'uris': ['https://gitlab.com/datadrivendiscovery/sklearn-wrap/issues', 'https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html']}, - "version": "2019.11.13", - "id": "01d2c086-91bf-3ca5-b023-5139cf239c77", - "hyperparams_to_tune": ['n_estimators', 'learning_rate', 'max_depth', 'min_samples_leaf', 'min_samples_split', 'max_features'], - 'installation': [ - {'type': metadata_base.PrimitiveInstallationType.PIP, - 'package_uri': 'git+https://gitlab.com/datadrivendiscovery/sklearn-wrap.git@{git_commit}#egg=sklearn_wrap'.format( - git_commit=utils.current_git_commit(os.path.dirname(__file__)), - ), - }] - }) - - def __init__(self, *, - hyperparams: Hyperparams, - random_seed: int = 0, - docker_containers: Dict[str, DockerContainer] = None, - _verbose: int = 0) -> None: - - super().__init__(hyperparams=hyperparams, random_seed=random_seed, docker_containers=docker_containers) - - # False - self._clf = GradientBoostingClassifier( - loss=self.hyperparams['loss'], - learning_rate=self.hyperparams['learning_rate'], - n_estimators=self.hyperparams['n_estimators'], - max_depth=self.hyperparams['max_depth'], - criterion=self.hyperparams['criterion'], - min_samples_split=self.hyperparams['min_samples_split'], - min_samples_leaf=self.hyperparams['min_samples_leaf'], - min_weight_fraction_leaf=self.hyperparams['min_weight_fraction_leaf'], - subsample=self.hyperparams['subsample'], - max_features=self.hyperparams['max_features'], - max_leaf_nodes=self.hyperparams['max_leaf_nodes'], - min_impurity_decrease=self.hyperparams['min_impurity_decrease'], - warm_start=self.hyperparams['warm_start'], - presort=self.hyperparams['presort'], - validation_fraction=self.hyperparams['validation_fraction'], - n_iter_no_change=self.hyperparams['n_iter_no_change'], - tol=self.hyperparams['tol'], - verbose=_verbose, - random_state=self.random_seed, - ) - - self._inputs = None - self._outputs = None - self._training_inputs = None - self._training_outputs = None - self._target_names = None - self._training_indices = None - self._target_column_indices = None - self._target_columns_metadata: List[OrderedDict] = None - self._input_column_names = None - self._fitted = False - self._new_training_data = False - - def set_training_data(self, *, inputs: Inputs, outputs: Outputs) -> None: - self._inputs = inputs - self._outputs = outputs - self._fitted = False - self._new_training_data = True - - def fit(self, *, timeout: float = None, iterations: int = None) -> CallResult[None]: - if self._inputs is None or self._outputs is None: - raise ValueError("Missing training data.") - - if not self._new_training_data: - return CallResult(None) - self._new_training_data = False - - self._training_inputs, self._training_indices = self._get_columns_to_fit(self._inputs, self.hyperparams) - self._training_outputs, self._target_names, self._target_column_indices = self._get_targets(self._outputs, self.hyperparams) - self._input_column_names = self._training_inputs.columns - - if len(self._training_indices) > 0 and len(self._target_column_indices) > 0: - self._target_columns_metadata = self._get_target_columns_metadata(self._training_outputs.metadata, self.hyperparams) - sk_training_output = self._training_outputs.values - - shape = sk_training_output.shape - if len(shape) == 2 and shape[1] == 1: - sk_training_output = numpy.ravel(sk_training_output) - - self._clf.fit(self._training_inputs, sk_training_output) - self._fitted = True - else: - if self.hyperparams['error_on_no_input']: - raise RuntimeError("No input columns were selected") - self.logger.warn("No input columns were selected") - - return CallResult(None) - - - - def produce(self, *, inputs: Inputs, timeout: float = None, iterations: int = None) -> CallResult[Outputs]: - sk_inputs, columns_to_use = self._get_columns_to_fit(inputs, self.hyperparams) - output = [] - if len(sk_inputs.columns): - try: - sk_output = self._clf.predict(sk_inputs) - except sklearn.exceptions.NotFittedError as error: - raise PrimitiveNotFittedError("Primitive not fitted.") from error - # For primitives that allow predicting without fitting like GaussianProcessRegressor - if not self._fitted: - raise PrimitiveNotFittedError("Primitive not fitted.") - if sparse.issparse(sk_output): - sk_output = sk_output.toarray() - output = self._wrap_predictions(inputs, sk_output) - output.columns = self._target_names - output = [output] - else: - if self.hyperparams['error_on_no_input']: - raise RuntimeError("No input columns were selected") - self.logger.warn("No input columns were selected") - outputs = base_utils.combine_columns(return_result=self.hyperparams['return_result'], - add_index_columns=self.hyperparams['add_index_columns'], - inputs=inputs, column_indices=self._target_column_indices, - columns_list=output) - - return CallResult(outputs) - - - def get_params(self) -> Params: - if not self._fitted: - return Params( - oob_improvement_=None, - train_score_=None, - loss_=None, - init_=None, - estimators_=None, - n_features_=None, - classes_=None, - max_features_=None, - n_classes_=None, - alpha=None, - _rng=None, - n_estimators_=None, - input_column_names=self._input_column_names, - training_indices_=self._training_indices, - target_names_=self._target_names, - target_column_indices_=self._target_column_indices, - target_columns_metadata_=self._target_columns_metadata - ) - - return Params( - oob_improvement_=getattr(self._clf, 'oob_improvement_', None), - train_score_=getattr(self._clf, 'train_score_', None), - loss_=getattr(self._clf, 'loss_', None), - init_=getattr(self._clf, 'init_', None), - estimators_=getattr(self._clf, 'estimators_', None), - n_features_=getattr(self._clf, 'n_features_', None), - classes_=getattr(self._clf, 'classes_', None), - max_features_=getattr(self._clf, 'max_features_', None), - n_classes_=getattr(self._clf, 'n_classes_', None), - alpha=getattr(self._clf, 'alpha', None), - _rng=getattr(self._clf, '_rng', None), - n_estimators_=getattr(self._clf, 'n_estimators_', None), - input_column_names=self._input_column_names, - training_indices_=self._training_indices, - target_names_=self._target_names, - target_column_indices_=self._target_column_indices, - target_columns_metadata_=self._target_columns_metadata - ) - - def set_params(self, *, params: Params) -> None: - self._clf.oob_improvement_ = params['oob_improvement_'] - self._clf.train_score_ = params['train_score_'] - self._clf.loss_ = params['loss_'] - self._clf.init_ = params['init_'] - self._clf.estimators_ = params['estimators_'] - self._clf.n_features_ = params['n_features_'] - self._clf.classes_ = params['classes_'] - self._clf.max_features_ = params['max_features_'] - self._clf.n_classes_ = params['n_classes_'] - self._clf.alpha = params['alpha'] - self._clf._rng = params['_rng'] - self._clf.n_estimators_ = params['n_estimators_'] - self._input_column_names = params['input_column_names'] - self._training_indices = params['training_indices_'] - self._target_names = params['target_names_'] - self._target_column_indices = params['target_column_indices_'] - self._target_columns_metadata = params['target_columns_metadata_'] - - if params['oob_improvement_'] is not None: - self._fitted = True - if params['train_score_'] is not None: - self._fitted = True - if params['loss_'] is not None: - self._fitted = True - if params['init_'] is not None: - self._fitted = True - if params['estimators_'] is not None: - self._fitted = True - if params['n_features_'] is not None: - self._fitted = True - if params['classes_'] is not None: - self._fitted = True - if params['max_features_'] is not None: - self._fitted = True - if params['n_classes_'] is not None: - self._fitted = True - if params['alpha'] is not None: - self._fitted = True - if params['_rng'] is not None: - self._fitted = True - if params['n_estimators_'] is not None: - self._fitted = True - - - def log_likelihoods(self, *, - outputs: Outputs, - inputs: Inputs, - timeout: float = None, - iterations: int = None) -> CallResult[Sequence[float]]: - inputs = inputs.iloc[:, self._training_indices] # Get ndarray - outputs = outputs.iloc[:, self._target_column_indices] - - if len(inputs.columns) and len(outputs.columns): - - if outputs.shape[1] != self._clf.n_outputs_: - raise exceptions.InvalidArgumentValueError("\"outputs\" argument does not have the correct number of target columns.") - - log_proba = self._clf.predict_log_proba(inputs) - - # Making it always a list, even when only one target. - if self._clf.n_outputs_ == 1: - log_proba = [log_proba] - classes = [self._clf.classes_] - else: - classes = self._clf.classes_ - - samples_length = inputs.shape[0] - - log_likelihoods = [] - for k in range(self._clf.n_outputs_): - # We have to map each class to its internal (numerical) index used in the learner. - # This allows "outputs" to contain string classes. - outputs_column = outputs.iloc[:, k] - classes_map = pandas.Series(numpy.arange(len(classes[k])), index=classes[k]) - mapped_outputs_column = outputs_column.map(classes_map) - - # For each target column (column in "outputs"), for each sample (row) we pick the log - # likelihood for a given class. - log_likelihoods.append(log_proba[k][numpy.arange(samples_length), mapped_outputs_column]) - - results = d3m_dataframe(dict(enumerate(log_likelihoods)), generate_metadata=True) - results.columns = outputs.columns - - for k in range(self._clf.n_outputs_): - column_metadata = outputs.metadata.query_column(k) - if 'name' in column_metadata: - results.metadata = results.metadata.update_column(k, {'name': column_metadata['name']}) - - else: - results = d3m_dataframe(generate_metadata=True) - - return CallResult(results) - - - - def produce_feature_importances(self, *, timeout: float = None, iterations: int = None) -> CallResult[d3m_dataframe]: - output = d3m_dataframe(self._clf.feature_importances_.reshape((1, len(self._input_column_names)))) - output.columns = self._input_column_names - for i in range(len(self._input_column_names)): - output.metadata = output.metadata.update_column(i, {"name": self._input_column_names[i]}) - return CallResult(output) - - @classmethod - def _get_columns_to_fit(cls, inputs: Inputs, hyperparams: Hyperparams): - if not hyperparams['use_semantic_types']: - return inputs, list(range(len(inputs.columns))) - - inputs_metadata = inputs.metadata - - def can_produce_column(column_index: int) -> bool: - return cls._can_produce_column(inputs_metadata, column_index, hyperparams) - - columns_to_produce, columns_not_to_produce = base_utils.get_columns_to_use(inputs_metadata, - use_columns=hyperparams['use_inputs_columns'], - exclude_columns=hyperparams['exclude_inputs_columns'], - can_use_column=can_produce_column) - return inputs.iloc[:, columns_to_produce], columns_to_produce - # return columns_to_produce - - @classmethod - def _can_produce_column(cls, inputs_metadata: metadata_base.DataMetadata, column_index: int, hyperparams: Hyperparams) -> bool: - column_metadata = inputs_metadata.query((metadata_base.ALL_ELEMENTS, column_index)) - - accepted_structural_types = (int, float, numpy.integer, numpy.float64) - accepted_semantic_types = set() - accepted_semantic_types.add("https://metadata.datadrivendiscovery.org/types/Attribute") - if not issubclass(column_metadata['structural_type'], accepted_structural_types): - return False - - semantic_types = set(column_metadata.get('semantic_types', [])) - - if len(semantic_types) == 0: - cls.logger.warning("No semantic types found in column metadata") - return False - # Making sure all accepted_semantic_types are available in semantic_types - if len(accepted_semantic_types - semantic_types) == 0: - return True - - return False - - @classmethod - def _get_targets(cls, data: d3m_dataframe, hyperparams: Hyperparams): - if not hyperparams['use_semantic_types']: - return data, list(data.columns), list(range(len(data.columns))) - - metadata = data.metadata - - def can_produce_column(column_index: int) -> bool: - accepted_semantic_types = set() - accepted_semantic_types.add("https://metadata.datadrivendiscovery.org/types/TrueTarget") - column_metadata = metadata.query((metadata_base.ALL_ELEMENTS, column_index)) - semantic_types = set(column_metadata.get('semantic_types', [])) - if len(semantic_types) == 0: - cls.logger.warning("No semantic types found in column metadata") - return False - # Making sure all accepted_semantic_types are available in semantic_types - if len(accepted_semantic_types - semantic_types) == 0: - return True - return False - - target_column_indices, target_columns_not_to_produce = base_utils.get_columns_to_use(metadata, - use_columns=hyperparams[ - 'use_outputs_columns'], - exclude_columns= - hyperparams[ - 'exclude_outputs_columns'], - can_use_column=can_produce_column) - targets = [] - if target_column_indices: - targets = data.select_columns(target_column_indices) - target_column_names = [] - for idx in target_column_indices: - target_column_names.append(data.columns[idx]) - return targets, target_column_names, target_column_indices - - @classmethod - def _get_target_columns_metadata(cls, outputs_metadata: metadata_base.DataMetadata, hyperparams) -> List[OrderedDict]: - outputs_length = outputs_metadata.query((metadata_base.ALL_ELEMENTS,))['dimension']['length'] - - target_columns_metadata: List[OrderedDict] = [] - for column_index in range(outputs_length): - column_metadata = OrderedDict(outputs_metadata.query_column(column_index)) - - # Update semantic types and prepare it for predicted targets. - semantic_types = set(column_metadata.get('semantic_types', [])) - semantic_types_to_remove = set(["https://metadata.datadrivendiscovery.org/types/TrueTarget","https://metadata.datadrivendiscovery.org/types/SuggestedTarget",]) - add_semantic_types = set(["https://metadata.datadrivendiscovery.org/types/PredictedTarget",]) - add_semantic_types.add(hyperparams["return_semantic_type"]) - semantic_types = semantic_types - semantic_types_to_remove - semantic_types = semantic_types.union(add_semantic_types) - column_metadata['semantic_types'] = list(semantic_types) - - target_columns_metadata.append(column_metadata) - - return target_columns_metadata - - @classmethod - def _update_predictions_metadata(cls, inputs_metadata: metadata_base.DataMetadata, outputs: Optional[Outputs], - target_columns_metadata: List[OrderedDict]) -> metadata_base.DataMetadata: - outputs_metadata = metadata_base.DataMetadata().generate(value=outputs) - - for column_index, column_metadata in enumerate(target_columns_metadata): - column_metadata.pop("structural_type", None) - outputs_metadata = outputs_metadata.update_column(column_index, column_metadata) - - return outputs_metadata - - def _wrap_predictions(self, inputs: Inputs, predictions: ndarray) -> Outputs: - outputs = d3m_dataframe(predictions, generate_metadata=False) - outputs.metadata = self._update_predictions_metadata(inputs.metadata, outputs, self._target_columns_metadata) - return outputs - - - @classmethod - def _add_target_columns_metadata(cls, outputs_metadata: metadata_base.DataMetadata): - outputs_length = outputs_metadata.query((metadata_base.ALL_ELEMENTS,))['dimension']['length'] - - target_columns_metadata: List[OrderedDict] = [] - for column_index in range(outputs_length): - column_metadata = OrderedDict() - semantic_types = [] - semantic_types.append('https://metadata.datadrivendiscovery.org/types/PredictedTarget') - column_name = outputs_metadata.query((metadata_base.ALL_ELEMENTS, column_index)).get("name") - if column_name is None: - column_name = "output_{}".format(column_index) - column_metadata["semantic_types"] = semantic_types - column_metadata["name"] = str(column_name) - target_columns_metadata.append(column_metadata) - - return target_columns_metadata - - -SKGradientBoostingClassifier.__doc__ = GradientBoostingClassifier.__doc__ \ No newline at end of file diff --git a/common-primitives/sklearn-wrap/sklearn_wrap/SKGradientBoostingRegressor.py b/common-primitives/sklearn-wrap/sklearn_wrap/SKGradientBoostingRegressor.py deleted file mode 100644 index 7ec68f0..0000000 --- a/common-primitives/sklearn-wrap/sklearn_wrap/SKGradientBoostingRegressor.py +++ /dev/null @@ -1,673 +0,0 @@ -from typing import Any, Callable, List, Dict, Union, Optional, Sequence, Tuple -from numpy import ndarray -from collections import OrderedDict -from scipy import sparse -import os -import sklearn -import numpy -import typing - -# Custom import commands if any -from sklearn.ensemble.gradient_boosting import GradientBoostingRegressor - - -from d3m.container.numpy import ndarray as d3m_ndarray -from d3m.container import DataFrame as d3m_dataframe -from d3m.metadata import hyperparams, params, base as metadata_base -from d3m import utils -from d3m.base import utils as base_utils -from d3m.exceptions import PrimitiveNotFittedError -from d3m.primitive_interfaces.base import CallResult, DockerContainer - -from d3m.primitive_interfaces.supervised_learning import SupervisedLearnerPrimitiveBase -from d3m.primitive_interfaces.base import ProbabilisticCompositionalityMixin, ContinueFitMixin -from d3m import exceptions -import pandas - - - -Inputs = d3m_dataframe -Outputs = d3m_dataframe - - -class Params(params.Params): - oob_improvement_: Optional[ndarray] - train_score_: Optional[ndarray] - loss_: Optional[object] - estimators_: Optional[object] - n_features_: Optional[int] - init_: Optional[object] - max_features_: Optional[int] - n_classes_: Optional[Union[int, List[int]]] - _rng: Optional[object] - n_estimators_: Optional[int] - input_column_names: Optional[Any] - target_names_: Optional[Sequence[Any]] - training_indices_: Optional[Sequence[int]] - target_column_indices_: Optional[Sequence[int]] - target_columns_metadata_: Optional[List[OrderedDict]] - - - -class Hyperparams(hyperparams.Hyperparams): - loss = hyperparams.Choice( - choices={ - 'ls': hyperparams.Hyperparams.define( - configuration=OrderedDict({}) - ), - 'lad': hyperparams.Hyperparams.define( - configuration=OrderedDict({}) - ), - 'huber': hyperparams.Hyperparams.define( - configuration=OrderedDict({ - 'alpha': hyperparams.Constant( - default=0.9, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - }) - ), - 'quantile': hyperparams.Hyperparams.define( - configuration=OrderedDict({ - 'alpha': hyperparams.Constant( - default=0.9, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - }) - ) - }, - default='ls', - description='loss function to be optimized. \'ls\' refers to least squares regression. \'lad\' (least absolute deviation) is a highly robust loss function solely based on order information of the input variables. \'huber\' is a combination of the two. \'quantile\' allows quantile regression (use `alpha` to specify the quantile).', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - learning_rate = hyperparams.Bounded[float]( - lower=0, - upper=None, - default=0.1, - description='learning rate shrinks the contribution of each tree by `learning_rate`. There is a trade-off between learning_rate and n_estimators.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - n_estimators = hyperparams.Bounded[int]( - lower=1, - upper=None, - default=100, - description='The number of boosting stages to perform. Gradient boosting is fairly robust to over-fitting so a large number usually results in better performance.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - max_depth = hyperparams.Bounded[int]( - lower=0, - upper=None, - default=3, - description='maximum depth of the individual regression estimators. The maximum depth limits the number of nodes in the tree. Tune this parameter for best performance; the best value depends on the interaction of the input variables.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - criterion = hyperparams.Enumeration[str]( - values=['friedman_mse', 'mse', 'mae'], - default='friedman_mse', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - description='The function to measure the quality of a split. Supported criteria are "friedman_mse" for the mean squared error with improvement score by Friedman, "mse" for mean squared error, and "mae" for the mean absolute error. The default value of "friedman_mse" is generally the best as it can provide a better approximation in some cases. .. versionadded:: 0.18' - ) - min_samples_split = hyperparams.Union( - configuration=OrderedDict({ - 'float': hyperparams.Bounded[float]( - lower=0, - upper=1, - default=1.0, - description='It\'s a percentage and `ceil(min_samples_split * n_samples)` is the minimum number of samples for each split.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ), - 'int': hyperparams.Bounded[int]( - lower=0, - upper=None, - default=2, - description='Minimum number.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ) - }), - default='int', - description='The minimum number of samples required to split an internal node: - If int, then consider `min_samples_split` as the minimum number. - If float, then `min_samples_split` is a percentage and `ceil(min_samples_split * n_samples)` are the minimum number of samples for each split. .. versionchanged:: 0.18 Added float values for percentages.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - min_samples_leaf = hyperparams.Union( - configuration=OrderedDict({ - 'percent': hyperparams.Bounded[float]( - lower=0, - upper=0.5, - default=0.25, - description='It\'s a percentage and `ceil(min_samples_leaf * n_samples)` is the minimum number of samples for each node.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ), - 'absolute': hyperparams.Bounded[int]( - lower=1, - upper=None, - default=1, - description='Minimum number.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ) - }), - default='absolute', - description='The minimum number of samples required to be at a leaf node: - If int, then consider `min_samples_leaf` as the minimum number. - If float, then `min_samples_leaf` is a percentage and `ceil(min_samples_leaf * n_samples)` are the minimum number of samples for each node. .. versionchanged:: 0.18 Added float values for percentages.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - min_weight_fraction_leaf = hyperparams.Bounded[float]( - default=0, - lower=0, - upper=0.5, - description='The minimum weighted fraction of the sum total of weights (of all the input samples) required to be at a leaf node. Samples have equal weight when sample_weight is not provided.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - subsample = hyperparams.Bounded[int]( - default=1, - lower=0, - upper=None, - description='The fraction of samples to be used for fitting the individual base learners. If smaller than 1.0 this results in Stochastic Gradient Boosting. `subsample` interacts with the parameter `n_estimators`. Choosing `subsample < 1.0` leads to a reduction of variance and an increase in bias.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - max_features = hyperparams.Union( - configuration=OrderedDict({ - 'specified_int': hyperparams.Bounded[int]( - lower=0, - upper=None, - default=0, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ), - 'calculated': hyperparams.Enumeration[str]( - values=['auto', 'sqrt', 'log2'], - default='auto', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ), - 'none': hyperparams.Constant( - default=None, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ), - 'percent': hyperparams.Bounded[float]( - default=0.25, - lower=0, - upper=1, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ) - }), - default='none', - description='The number of features to consider when looking for the best split: - If int, then consider `max_features` features at each split. - If float, then `max_features` is a percentage and `int(max_features * n_features)` features are considered at each split. - If "auto", then `max_features=n_features`. - If "sqrt", then `max_features=sqrt(n_features)`. - If "log2", then `max_features=log2(n_features)`. - If None, then `max_features=n_features`. Choosing `max_features < n_features` leads to a reduction of variance and an increase in bias. Note: the search for a split does not stop until at least one valid partition of the node samples is found, even if it requires to effectively inspect more than ``max_features`` features.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - max_leaf_nodes = hyperparams.Union( - configuration=OrderedDict({ - 'int': hyperparams.Bounded[int]( - lower=0, - upper=None, - default=10, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ), - 'none': hyperparams.Constant( - default=None, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ) - }), - default='none', - description='Grow trees with ``max_leaf_nodes`` in best-first fashion. Best nodes are defined as relative reduction in impurity. If None then unlimited number of leaf nodes.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - min_impurity_decrease = hyperparams.Bounded[float]( - default=0.0, - lower=0.0, - upper=None, - description='A node will be split if this split induces a decrease of the impurity greater than or equal to this value. The weighted impurity decrease equation is the following:: N_t / N * (impurity - N_t_R / N_t * right_impurity - N_t_L / N_t * left_impurity) where ``N`` is the total number of samples, ``N_t`` is the number of samples at the current node, ``N_t_L`` is the number of samples in the left child, and ``N_t_R`` is the number of samples in the right child. ``N``, ``N_t``, ``N_t_R`` and ``N_t_L`` all refer to the weighted sum, if ``sample_weight`` is passed. .. versionadded:: 0.19 ', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - warm_start = hyperparams.UniformBool( - default=False, - description='When set to ``True``, reuse the solution of the previous call to fit and add more estimators to the ensemble, otherwise, just erase the previous solution.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - presort = hyperparams.Union( - configuration=OrderedDict({ - 'bool': hyperparams.UniformBool( - default=False, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ), - 'auto': hyperparams.Constant( - default='auto', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ) - }), - default='auto', - description='Whether to presort the data to speed up the finding of best splits in fitting. Auto mode by default will use presorting on dense data and default to normal sorting on sparse data. Setting presort to true on sparse data will raise an error. .. versionadded:: 0.17 optional parameter *presort*.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'] - ) - validation_fraction = hyperparams.Bounded[float]( - default=0.1, - lower=0, - upper=1, - description='The proportion of training data to set aside as validation set for early stopping. Must be between 0 and 1. Only used if ``n_iter_no_change`` is set to an integer.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - n_iter_no_change = hyperparams.Union( - configuration=OrderedDict({ - 'int': hyperparams.Bounded[int]( - default=5, - lower=0, - upper=None, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ), - 'none': hyperparams.Constant( - default=None, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ) - }), - default='none', - description='``n_iter_no_change`` is used to decide if early stopping will be used to terminate training when validation score is not improving. By default it is set to None to disable early stopping. If set to a number, it will set aside ``validation_fraction`` size of the training data as validation and terminate training when validation score is not improving in all of the previous ``n_iter_no_change`` numbers of iterations.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - tol = hyperparams.Bounded[float]( - default=0.0001, - lower=0, - upper=None, - description='Tolerance for the early stopping. When the loss is not improving by at least tol for ``n_iter_no_change`` iterations (if set to a number), the training stops.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - - use_inputs_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to force primitive to use as training input. If any specified column cannot be parsed, it is skipped.", - ) - use_outputs_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to force primitive to use as training target. If any specified column cannot be parsed, it is skipped.", - ) - exclude_inputs_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to not use as training inputs. Applicable only if \"use_columns\" is not provided.", - ) - exclude_outputs_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to not use as training target. Applicable only if \"use_columns\" is not provided.", - ) - return_result = hyperparams.Enumeration( - values=['append', 'replace', 'new'], - default='new', - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Should parsed columns be appended, should they replace original columns, or should only parsed columns be returned? This hyperparam is ignored if use_semantic_types is set to false.", - ) - use_semantic_types = hyperparams.UniformBool( - default=False, - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Controls whether semantic_types metadata will be used for filtering columns in input dataframe. Setting this to false makes the code ignore return_result and will produce only the output dataframe" - ) - add_index_columns = hyperparams.UniformBool( - default=False, - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Also include primary index columns if input data has them. Applicable only if \"return_result\" is set to \"new\".", - ) - error_on_no_input = hyperparams.UniformBool( - default=True, - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Throw an exception if no input column is selected/provided. Defaults to true to behave like sklearn. To prevent pipelines from breaking set this to False.", - ) - - return_semantic_type = hyperparams.Enumeration[str]( - values=['https://metadata.datadrivendiscovery.org/types/Attribute', 'https://metadata.datadrivendiscovery.org/types/ConstructedAttribute', 'https://metadata.datadrivendiscovery.org/types/PredictedTarget'], - default='https://metadata.datadrivendiscovery.org/types/PredictedTarget', - description='Decides what semantic type to attach to generated output', - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'] - ) - -class SKGradientBoostingRegressor(SupervisedLearnerPrimitiveBase[Inputs, Outputs, Params, Hyperparams]): - """ - Primitive wrapping for sklearn GradientBoostingRegressor - `sklearn documentation `_ - - """ - - __author__ = "JPL MARVIN" - metadata = metadata_base.PrimitiveMetadata({ - "algorithm_types": [metadata_base.PrimitiveAlgorithmType.GRADIENT_BOOSTING, ], - "name": "sklearn.ensemble.gradient_boosting.GradientBoostingRegressor", - "primitive_family": metadata_base.PrimitiveFamily.REGRESSION, - "python_path": "d3m.primitives.regression.gradient_boosting.SKlearn", - "source": {'name': 'JPL', 'contact': 'mailto:shah@jpl.nasa.gov', 'uris': ['https://gitlab.com/datadrivendiscovery/sklearn-wrap/issues', 'https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingRegressor.html']}, - "version": "2019.11.13", - "id": "2a031907-6b2c-3390-b365-921f89c8816a", - "hyperparams_to_tune": ['n_estimators', 'learning_rate', 'max_depth', 'min_samples_leaf', 'min_samples_split', 'max_features'], - 'installation': [ - {'type': metadata_base.PrimitiveInstallationType.PIP, - 'package_uri': 'git+https://gitlab.com/datadrivendiscovery/sklearn-wrap.git@{git_commit}#egg=sklearn_wrap'.format( - git_commit=utils.current_git_commit(os.path.dirname(__file__)), - ), - }] - }) - - def __init__(self, *, - hyperparams: Hyperparams, - random_seed: int = 0, - docker_containers: Dict[str, DockerContainer] = None, - _verbose: int = 0) -> None: - - super().__init__(hyperparams=hyperparams, random_seed=random_seed, docker_containers=docker_containers) - - # False - self._clf = GradientBoostingRegressor( - loss=self.hyperparams['loss']['choice'], - alpha=self.hyperparams['loss'].get('alpha', 0.9), - learning_rate=self.hyperparams['learning_rate'], - n_estimators=self.hyperparams['n_estimators'], - max_depth=self.hyperparams['max_depth'], - criterion=self.hyperparams['criterion'], - min_samples_split=self.hyperparams['min_samples_split'], - min_samples_leaf=self.hyperparams['min_samples_leaf'], - min_weight_fraction_leaf=self.hyperparams['min_weight_fraction_leaf'], - subsample=self.hyperparams['subsample'], - max_features=self.hyperparams['max_features'], - max_leaf_nodes=self.hyperparams['max_leaf_nodes'], - min_impurity_decrease=self.hyperparams['min_impurity_decrease'], - warm_start=self.hyperparams['warm_start'], - presort=self.hyperparams['presort'], - validation_fraction=self.hyperparams['validation_fraction'], - n_iter_no_change=self.hyperparams['n_iter_no_change'], - tol=self.hyperparams['tol'], - verbose=_verbose, - random_state=self.random_seed, - ) - - self._inputs = None - self._outputs = None - self._training_inputs = None - self._training_outputs = None - self._target_names = None - self._training_indices = None - self._target_column_indices = None - self._target_columns_metadata: List[OrderedDict] = None - self._input_column_names = None - self._fitted = False - self._new_training_data = False - - def set_training_data(self, *, inputs: Inputs, outputs: Outputs) -> None: - self._inputs = inputs - self._outputs = outputs - self._fitted = False - self._new_training_data = True - - def fit(self, *, timeout: float = None, iterations: int = None) -> CallResult[None]: - if self._inputs is None or self._outputs is None: - raise ValueError("Missing training data.") - - if not self._new_training_data: - return CallResult(None) - self._new_training_data = False - - self._training_inputs, self._training_indices = self._get_columns_to_fit(self._inputs, self.hyperparams) - self._training_outputs, self._target_names, self._target_column_indices = self._get_targets(self._outputs, self.hyperparams) - self._input_column_names = self._training_inputs.columns - - if len(self._training_indices) > 0 and len(self._target_column_indices) > 0: - self._target_columns_metadata = self._get_target_columns_metadata(self._training_outputs.metadata, self.hyperparams) - sk_training_output = self._training_outputs.values - - shape = sk_training_output.shape - if len(shape) == 2 and shape[1] == 1: - sk_training_output = numpy.ravel(sk_training_output) - - self._clf.fit(self._training_inputs, sk_training_output) - self._fitted = True - else: - if self.hyperparams['error_on_no_input']: - raise RuntimeError("No input columns were selected") - self.logger.warn("No input columns were selected") - - return CallResult(None) - - - - def produce(self, *, inputs: Inputs, timeout: float = None, iterations: int = None) -> CallResult[Outputs]: - sk_inputs, columns_to_use = self._get_columns_to_fit(inputs, self.hyperparams) - output = [] - if len(sk_inputs.columns): - try: - sk_output = self._clf.predict(sk_inputs) - except sklearn.exceptions.NotFittedError as error: - raise PrimitiveNotFittedError("Primitive not fitted.") from error - # For primitives that allow predicting without fitting like GaussianProcessRegressor - if not self._fitted: - raise PrimitiveNotFittedError("Primitive not fitted.") - if sparse.issparse(sk_output): - sk_output = sk_output.toarray() - output = self._wrap_predictions(inputs, sk_output) - output.columns = self._target_names - output = [output] - else: - if self.hyperparams['error_on_no_input']: - raise RuntimeError("No input columns were selected") - self.logger.warn("No input columns were selected") - outputs = base_utils.combine_columns(return_result=self.hyperparams['return_result'], - add_index_columns=self.hyperparams['add_index_columns'], - inputs=inputs, column_indices=self._target_column_indices, - columns_list=output) - - return CallResult(outputs) - - - def get_params(self) -> Params: - if not self._fitted: - return Params( - oob_improvement_=None, - train_score_=None, - loss_=None, - estimators_=None, - n_features_=None, - init_=None, - max_features_=None, - n_classes_=None, - _rng=None, - n_estimators_=None, - input_column_names=self._input_column_names, - training_indices_=self._training_indices, - target_names_=self._target_names, - target_column_indices_=self._target_column_indices, - target_columns_metadata_=self._target_columns_metadata - ) - - return Params( - oob_improvement_=getattr(self._clf, 'oob_improvement_', None), - train_score_=getattr(self._clf, 'train_score_', None), - loss_=getattr(self._clf, 'loss_', None), - estimators_=getattr(self._clf, 'estimators_', None), - n_features_=getattr(self._clf, 'n_features_', None), - init_=getattr(self._clf, 'init_', None), - max_features_=getattr(self._clf, 'max_features_', None), - n_classes_=getattr(self._clf, 'n_classes_', None), - _rng=getattr(self._clf, '_rng', None), - n_estimators_=getattr(self._clf, 'n_estimators_', None), - input_column_names=self._input_column_names, - training_indices_=self._training_indices, - target_names_=self._target_names, - target_column_indices_=self._target_column_indices, - target_columns_metadata_=self._target_columns_metadata - ) - - def set_params(self, *, params: Params) -> None: - self._clf.oob_improvement_ = params['oob_improvement_'] - self._clf.train_score_ = params['train_score_'] - self._clf.loss_ = params['loss_'] - self._clf.estimators_ = params['estimators_'] - self._clf.n_features_ = params['n_features_'] - self._clf.init_ = params['init_'] - self._clf.max_features_ = params['max_features_'] - self._clf.n_classes_ = params['n_classes_'] - self._clf._rng = params['_rng'] - self._clf.n_estimators_ = params['n_estimators_'] - self._input_column_names = params['input_column_names'] - self._training_indices = params['training_indices_'] - self._target_names = params['target_names_'] - self._target_column_indices = params['target_column_indices_'] - self._target_columns_metadata = params['target_columns_metadata_'] - - if params['oob_improvement_'] is not None: - self._fitted = True - if params['train_score_'] is not None: - self._fitted = True - if params['loss_'] is not None: - self._fitted = True - if params['estimators_'] is not None: - self._fitted = True - if params['n_features_'] is not None: - self._fitted = True - if params['init_'] is not None: - self._fitted = True - if params['max_features_'] is not None: - self._fitted = True - if params['n_classes_'] is not None: - self._fitted = True - if params['_rng'] is not None: - self._fitted = True - if params['n_estimators_'] is not None: - self._fitted = True - - - - - - def produce_feature_importances(self, *, timeout: float = None, iterations: int = None) -> CallResult[d3m_dataframe]: - output = d3m_dataframe(self._clf.feature_importances_.reshape((1, len(self._input_column_names)))) - output.columns = self._input_column_names - for i in range(len(self._input_column_names)): - output.metadata = output.metadata.update_column(i, {"name": self._input_column_names[i]}) - return CallResult(output) - - @classmethod - def _get_columns_to_fit(cls, inputs: Inputs, hyperparams: Hyperparams): - if not hyperparams['use_semantic_types']: - return inputs, list(range(len(inputs.columns))) - - inputs_metadata = inputs.metadata - - def can_produce_column(column_index: int) -> bool: - return cls._can_produce_column(inputs_metadata, column_index, hyperparams) - - columns_to_produce, columns_not_to_produce = base_utils.get_columns_to_use(inputs_metadata, - use_columns=hyperparams['use_inputs_columns'], - exclude_columns=hyperparams['exclude_inputs_columns'], - can_use_column=can_produce_column) - return inputs.iloc[:, columns_to_produce], columns_to_produce - # return columns_to_produce - - @classmethod - def _can_produce_column(cls, inputs_metadata: metadata_base.DataMetadata, column_index: int, hyperparams: Hyperparams) -> bool: - column_metadata = inputs_metadata.query((metadata_base.ALL_ELEMENTS, column_index)) - - accepted_structural_types = (int, float, numpy.integer, numpy.float64) - accepted_semantic_types = set() - accepted_semantic_types.add("https://metadata.datadrivendiscovery.org/types/Attribute") - if not issubclass(column_metadata['structural_type'], accepted_structural_types): - return False - - semantic_types = set(column_metadata.get('semantic_types', [])) - - if len(semantic_types) == 0: - cls.logger.warning("No semantic types found in column metadata") - return False - # Making sure all accepted_semantic_types are available in semantic_types - if len(accepted_semantic_types - semantic_types) == 0: - return True - - return False - - @classmethod - def _get_targets(cls, data: d3m_dataframe, hyperparams: Hyperparams): - if not hyperparams['use_semantic_types']: - return data, list(data.columns), list(range(len(data.columns))) - - metadata = data.metadata - - def can_produce_column(column_index: int) -> bool: - accepted_semantic_types = set() - accepted_semantic_types.add("https://metadata.datadrivendiscovery.org/types/TrueTarget") - column_metadata = metadata.query((metadata_base.ALL_ELEMENTS, column_index)) - semantic_types = set(column_metadata.get('semantic_types', [])) - if len(semantic_types) == 0: - cls.logger.warning("No semantic types found in column metadata") - return False - # Making sure all accepted_semantic_types are available in semantic_types - if len(accepted_semantic_types - semantic_types) == 0: - return True - return False - - target_column_indices, target_columns_not_to_produce = base_utils.get_columns_to_use(metadata, - use_columns=hyperparams[ - 'use_outputs_columns'], - exclude_columns= - hyperparams[ - 'exclude_outputs_columns'], - can_use_column=can_produce_column) - targets = [] - if target_column_indices: - targets = data.select_columns(target_column_indices) - target_column_names = [] - for idx in target_column_indices: - target_column_names.append(data.columns[idx]) - return targets, target_column_names, target_column_indices - - @classmethod - def _get_target_columns_metadata(cls, outputs_metadata: metadata_base.DataMetadata, hyperparams) -> List[OrderedDict]: - outputs_length = outputs_metadata.query((metadata_base.ALL_ELEMENTS,))['dimension']['length'] - - target_columns_metadata: List[OrderedDict] = [] - for column_index in range(outputs_length): - column_metadata = OrderedDict(outputs_metadata.query_column(column_index)) - - # Update semantic types and prepare it for predicted targets. - semantic_types = set(column_metadata.get('semantic_types', [])) - semantic_types_to_remove = set(["https://metadata.datadrivendiscovery.org/types/TrueTarget","https://metadata.datadrivendiscovery.org/types/SuggestedTarget",]) - add_semantic_types = set(["https://metadata.datadrivendiscovery.org/types/PredictedTarget",]) - add_semantic_types.add(hyperparams["return_semantic_type"]) - semantic_types = semantic_types - semantic_types_to_remove - semantic_types = semantic_types.union(add_semantic_types) - column_metadata['semantic_types'] = list(semantic_types) - - target_columns_metadata.append(column_metadata) - - return target_columns_metadata - - @classmethod - def _update_predictions_metadata(cls, inputs_metadata: metadata_base.DataMetadata, outputs: Optional[Outputs], - target_columns_metadata: List[OrderedDict]) -> metadata_base.DataMetadata: - outputs_metadata = metadata_base.DataMetadata().generate(value=outputs) - - for column_index, column_metadata in enumerate(target_columns_metadata): - column_metadata.pop("structural_type", None) - outputs_metadata = outputs_metadata.update_column(column_index, column_metadata) - - return outputs_metadata - - def _wrap_predictions(self, inputs: Inputs, predictions: ndarray) -> Outputs: - outputs = d3m_dataframe(predictions, generate_metadata=False) - outputs.metadata = self._update_predictions_metadata(inputs.metadata, outputs, self._target_columns_metadata) - return outputs - - - @classmethod - def _add_target_columns_metadata(cls, outputs_metadata: metadata_base.DataMetadata): - outputs_length = outputs_metadata.query((metadata_base.ALL_ELEMENTS,))['dimension']['length'] - - target_columns_metadata: List[OrderedDict] = [] - for column_index in range(outputs_length): - column_metadata = OrderedDict() - semantic_types = [] - semantic_types.append('https://metadata.datadrivendiscovery.org/types/PredictedTarget') - column_name = outputs_metadata.query((metadata_base.ALL_ELEMENTS, column_index)).get("name") - if column_name is None: - column_name = "output_{}".format(column_index) - column_metadata["semantic_types"] = semantic_types - column_metadata["name"] = str(column_name) - target_columns_metadata.append(column_metadata) - - return target_columns_metadata - - -SKGradientBoostingRegressor.__doc__ = GradientBoostingRegressor.__doc__ \ No newline at end of file diff --git a/common-primitives/sklearn-wrap/sklearn_wrap/SKImputer.py b/common-primitives/sklearn-wrap/sklearn_wrap/SKImputer.py deleted file mode 100644 index 203a3ca..0000000 --- a/common-primitives/sklearn-wrap/sklearn_wrap/SKImputer.py +++ /dev/null @@ -1,391 +0,0 @@ -from typing import Any, Callable, List, Dict, Union, Optional, Sequence, Tuple -from numpy import ndarray -from collections import OrderedDict -from scipy import sparse -import os -import sklearn -import numpy -import typing - -# Custom import commands if any -from sklearn.impute import SimpleImputer -from sklearn.impute._base import _get_mask - - -from d3m.container.numpy import ndarray as d3m_ndarray -from d3m.container import DataFrame as d3m_dataframe -from d3m.metadata import hyperparams, params, base as metadata_base -from d3m import utils -from d3m.base import utils as base_utils -from d3m.exceptions import PrimitiveNotFittedError -from d3m.primitive_interfaces.base import CallResult, DockerContainer -from d3m.primitive_interfaces.unsupervised_learning import UnsupervisedLearnerPrimitiveBase - - -Inputs = d3m_dataframe -Outputs = d3m_dataframe - - -class Params(params.Params): - statistics_: Optional[ndarray] - indicator_: Optional[sklearn.base.BaseEstimator] - input_column_names: Optional[Any] - target_names_: Optional[Sequence[Any]] - training_indices_: Optional[Sequence[int]] - target_column_indices_: Optional[Sequence[int]] - target_columns_metadata_: Optional[List[OrderedDict]] - - - -class Hyperparams(hyperparams.Hyperparams): - missing_values = hyperparams.Union( - configuration=OrderedDict({ - 'int': hyperparams.Hyperparameter[int]( - default=0, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ), - 'float': hyperparams.Hyperparameter[float]( - default=numpy.nan, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ) - }), - default='float', - description='The placeholder for the missing values. All occurrences of `missing_values` will be imputed.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'] - ) - strategy = hyperparams.Enumeration[str]( - default='mean', - values=['median', 'most_frequent', 'mean', 'constant'], - description='The imputation strategy. - If "mean", then replace missing values using the mean along each column. Can only be used with numeric data. - If "median", then replace missing values using the median along each column. Can only be used with numeric data. - If "most_frequent", then replace missing using the most frequent value along each column. Can be used with strings or numeric data. - If "constant", then replace missing values with fill_value. Can be used with strings or numeric data. .. versionadded:: 0.20 strategy="constant" for fixed value imputation.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - add_indicator = hyperparams.UniformBool( - default=False, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - fill_value = hyperparams.Union( - configuration=OrderedDict({ - 'int': hyperparams.Hyperparameter[int]( - default=0, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ), - 'none': hyperparams.Constant( - default=None, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ) - }), - default='none', - description='When strategy == "constant", fill_value is used to replace all occurrences of missing_values. If left to the default, fill_value will be 0 when imputing numerical data and "missing_value" for strings or object data types.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - - use_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to force primitive to operate on. If any specified column cannot be parsed, it is skipped.", - ) - exclude_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to not operate on. Applicable only if \"use_columns\" is not provided.", - ) - return_result = hyperparams.Enumeration( - values=['append', 'replace', 'new'], - default='new', - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Should parsed columns be appended, should they replace original columns, or should only parsed columns be returned? This hyperparam is ignored if use_semantic_types is set to false.", - ) - use_semantic_types = hyperparams.UniformBool( - default=False, - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Controls whether semantic_types metadata will be used for filtering columns in input dataframe. Setting this to false makes the code ignore return_result and will produce only the output dataframe" - ) - add_index_columns = hyperparams.UniformBool( - default=False, - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Also include primary index columns if input data has them. Applicable only if \"return_result\" is set to \"new\".", - ) - error_on_no_input = hyperparams.UniformBool( - default=True, - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Throw an exception if no input column is selected/provided. Defaults to true to behave like sklearn. To prevent pipelines from breaking set this to False.", - ) - - return_semantic_type = hyperparams.Enumeration[str]( - values=['https://metadata.datadrivendiscovery.org/types/Attribute', 'https://metadata.datadrivendiscovery.org/types/ConstructedAttribute'], - default='https://metadata.datadrivendiscovery.org/types/Attribute', - description='Decides what semantic type to attach to generated attributes', - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'] - ) - -class SKImputer(UnsupervisedLearnerPrimitiveBase[Inputs, Outputs, Params, Hyperparams]): - """ - Primitive wrapping for sklearn SimpleImputer - `sklearn documentation `_ - - """ - - __author__ = "JPL MARVIN" - metadata = metadata_base.PrimitiveMetadata({ - "algorithm_types": [metadata_base.PrimitiveAlgorithmType.IMPUTATION, ], - "name": "sklearn.impute.SimpleImputer", - "primitive_family": metadata_base.PrimitiveFamily.DATA_CLEANING, - "python_path": "d3m.primitives.data_cleaning.imputer.SKlearn", - "source": {'name': 'JPL', 'contact': 'mailto:shah@jpl.nasa.gov', 'uris': ['https://gitlab.com/datadrivendiscovery/sklearn-wrap/issues', 'https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html']}, - "version": "2019.11.13", - "id": "d016df89-de62-3c53-87ed-c06bb6a23cde", - "hyperparams_to_tune": ['strategy'], - 'installation': [ - {'type': metadata_base.PrimitiveInstallationType.PIP, - 'package_uri': 'git+https://gitlab.com/datadrivendiscovery/sklearn-wrap.git@{git_commit}#egg=sklearn_wrap'.format( - git_commit=utils.current_git_commit(os.path.dirname(__file__)), - ), - }] - }) - - def __init__(self, *, - hyperparams: Hyperparams, - random_seed: int = 0, - docker_containers: Dict[str, DockerContainer] = None, - _verbose: int = 0) -> None: - - super().__init__(hyperparams=hyperparams, random_seed=random_seed, docker_containers=docker_containers) - - # False - self._clf = SimpleImputer( - missing_values=self.hyperparams['missing_values'], - strategy=self.hyperparams['strategy'], - add_indicator=self.hyperparams['add_indicator'], - fill_value=self.hyperparams['fill_value'], - verbose=_verbose - ) - - self._inputs = None - self._outputs = None - self._training_inputs = None - self._training_outputs = None - self._target_names = None - self._training_indices = None - self._target_column_indices = None - self._target_columns_metadata: List[OrderedDict] = None - self._input_column_names = None - self._fitted = False - - - def set_training_data(self, *, inputs: Inputs) -> None: - self._inputs = inputs - self._fitted = False - - def fit(self, *, timeout: float = None, iterations: int = None) -> CallResult[None]: - if self._fitted: - return CallResult(None) - - self._training_inputs, self._training_indices, _ = self._get_columns_to_fit(self._inputs, self.hyperparams) - self._input_column_names = self._training_inputs.columns - - if self._training_inputs is None: - return CallResult(None) - - if len(self._training_indices) > 0: - self._clf.fit(self._training_inputs) - self._fitted = True - else: - if self.hyperparams['error_on_no_input']: - raise RuntimeError("No input columns were selected") - self.logger.warn("No input columns were selected") - return CallResult(None) - - def produce(self, *, inputs: Inputs, timeout: float = None, iterations: int = None) -> CallResult[Outputs]: - sk_inputs, columns_to_use, _ = self._get_columns_to_fit(inputs, self.hyperparams) - output = [] - if len(sk_inputs.columns): - try: - sk_output = self._clf.transform(sk_inputs) - except sklearn.exceptions.NotFittedError as error: - raise PrimitiveNotFittedError("Primitive not fitted.") from error - if sparse.issparse(sk_output): - sk_output = sk_output.toarray() - target_columns_metadata = self._copy_columns_metadata(inputs.metadata, self._training_indices, self.hyperparams) - output = self._wrap_predictions(inputs, sk_output, target_columns_metadata) - - output.columns = [inputs.columns[idx] for idx in range(len(inputs.columns)) if idx in self._training_indices] - output = [output] - else: - if self.hyperparams['error_on_no_input']: - raise RuntimeError("No input columns were selected") - self.logger.warn("No input columns were selected") - _, _, dropped_cols = self._get_columns_to_fit(inputs, self.hyperparams) - outputs = base_utils.combine_columns(return_result=self.hyperparams['return_result'], - add_index_columns=self.hyperparams['add_index_columns'], - inputs=inputs, column_indices=self._training_indices + dropped_cols, - columns_list=output) - return CallResult(outputs) - - - def get_params(self) -> Params: - if not self._fitted: - return Params( - statistics_=None, - indicator_=None, - input_column_names=self._input_column_names, - training_indices_=self._training_indices, - target_names_=self._target_names, - target_column_indices_=self._target_column_indices, - target_columns_metadata_=self._target_columns_metadata - ) - - return Params( - statistics_=getattr(self._clf, 'statistics_', None), - indicator_=getattr(self._clf, 'indicator_', None), - input_column_names=self._input_column_names, - training_indices_=self._training_indices, - target_names_=self._target_names, - target_column_indices_=self._target_column_indices, - target_columns_metadata_=self._target_columns_metadata - ) - - def set_params(self, *, params: Params) -> None: - self._clf.statistics_ = params['statistics_'] - self._clf.indicator_ = params['indicator_'] - self._input_column_names = params['input_column_names'] - self._training_indices = params['training_indices_'] - self._target_names = params['target_names_'] - self._target_column_indices = params['target_column_indices_'] - self._target_columns_metadata = params['target_columns_metadata_'] - - if params['statistics_'] is not None: - self._fitted = True - if params['indicator_'] is not None: - self._fitted = True - - - - - - @classmethod - def _get_columns_to_fit(cls, inputs: Inputs, hyperparams: Hyperparams): - - if not hyperparams['use_semantic_types']: - columns_to_produce = list(range(len(inputs.columns))) - - else: - inputs_metadata = inputs.metadata - - def can_produce_column(column_index: int) -> bool: - return cls._can_produce_column(inputs_metadata, column_index, hyperparams) - - columns_to_produce, columns_not_to_produce = base_utils.get_columns_to_use(inputs_metadata, - use_columns=hyperparams['use_columns'], - exclude_columns=hyperparams['exclude_columns'], - can_use_column=can_produce_column) - - columns_to_drop = cls._get_columns_to_drop(inputs, columns_to_produce, hyperparams) - for col in columns_to_drop: - columns_to_produce.remove(col) - - return inputs.iloc[:, columns_to_produce], columns_to_produce, columns_to_drop - - @classmethod - def _get_columns_to_drop(cls, inputs: Inputs, column_indices: List[int], hyperparams: Hyperparams): - """ - Check for columns that contain missing_values that need to be imputed - If strategy is constant and missin_values is nan, then all nan columns will not be dropped - :param inputs: - :param column_indices: - :return: - """ - columns_to_remove = [] - if hyperparams['strategy'] != "constant": - for _, col in enumerate(column_indices): - inp = inputs.iloc[:, [col]].values - mask = _get_mask(inp, hyperparams['missing_values']) - if mask.all(): - columns_to_remove.append(col) - return columns_to_remove - - @classmethod - def _can_produce_column(cls, inputs_metadata: metadata_base.DataMetadata, column_index: int, hyperparams: Hyperparams) -> bool: - column_metadata = inputs_metadata.query((metadata_base.ALL_ELEMENTS, column_index)) - - accepted_structural_types = (int, float, numpy.integer, numpy.float64) - accepted_semantic_types = set() - accepted_semantic_types.add("https://metadata.datadrivendiscovery.org/types/Attribute") - if not issubclass(column_metadata['structural_type'], accepted_structural_types): - return False - - semantic_types = set(column_metadata.get('semantic_types', [])) - - if len(semantic_types) == 0: - cls.logger.warning("No semantic types found in column metadata") - return False - # Making sure all accepted_semantic_types are available in semantic_types - if len(accepted_semantic_types - semantic_types) == 0: - return True - - return False - - - @classmethod - def _get_target_columns_metadata(cls, outputs_metadata: metadata_base.DataMetadata, hyperparams) -> List[OrderedDict]: - outputs_length = outputs_metadata.query((metadata_base.ALL_ELEMENTS,))['dimension']['length'] - - target_columns_metadata: List[OrderedDict] = [] - for column_index in range(outputs_length): - column_metadata = OrderedDict(outputs_metadata.query_column(column_index)) - - # Update semantic types and prepare it for predicted targets. - semantic_types = set(column_metadata.get('semantic_types', [])) - semantic_types_to_remove = set([]) - add_semantic_types = [] - add_semantic_types.add(hyperparams["return_semantic_type"]) - semantic_types = semantic_types - semantic_types_to_remove - semantic_types = semantic_types.union(add_semantic_types) - column_metadata['semantic_types'] = list(semantic_types) - - target_columns_metadata.append(column_metadata) - - return target_columns_metadata - - @classmethod - def _update_predictions_metadata(cls, inputs_metadata: metadata_base.DataMetadata, outputs: Optional[Outputs], - target_columns_metadata: List[OrderedDict]) -> metadata_base.DataMetadata: - outputs_metadata = metadata_base.DataMetadata().generate(value=outputs) - - for column_index, column_metadata in enumerate(target_columns_metadata): - column_metadata.pop("structural_type", None) - outputs_metadata = outputs_metadata.update_column(column_index, column_metadata) - - return outputs_metadata - - def _wrap_predictions(self, inputs: Inputs, predictions: ndarray, target_columns_metadata) -> Outputs: - outputs = d3m_dataframe(predictions, generate_metadata=False) - outputs.metadata = self._update_predictions_metadata(inputs.metadata, outputs, target_columns_metadata) - return outputs - - - - @classmethod - def _copy_columns_metadata(cls, inputs_metadata: metadata_base.DataMetadata, column_indices, hyperparams) -> List[OrderedDict]: - outputs_length = inputs_metadata.query((metadata_base.ALL_ELEMENTS,))['dimension']['length'] - - target_columns_metadata: List[OrderedDict] = [] - for column_index in column_indices: - column_name = inputs_metadata.query((metadata_base.ALL_ELEMENTS, column_index)).get("name") - column_metadata = OrderedDict(inputs_metadata.query_column(column_index)) - semantic_types = set(column_metadata.get('semantic_types', [])) - semantic_types_to_remove = set([]) - add_semantic_types = set() - add_semantic_types.add(hyperparams["return_semantic_type"]) - semantic_types = semantic_types - semantic_types_to_remove - semantic_types = semantic_types.union(add_semantic_types) - column_metadata['semantic_types'] = list(semantic_types) - - column_metadata["name"] = str(column_name) - target_columns_metadata.append(column_metadata) - - return target_columns_metadata - - -SKImputer.__doc__ = SimpleImputer.__doc__ \ No newline at end of file diff --git a/common-primitives/sklearn-wrap/sklearn_wrap/SKKNeighborsClassifier.py b/common-primitives/sklearn-wrap/sklearn_wrap/SKKNeighborsClassifier.py deleted file mode 100644 index 75d5f2f..0000000 --- a/common-primitives/sklearn-wrap/sklearn_wrap/SKKNeighborsClassifier.py +++ /dev/null @@ -1,497 +0,0 @@ -from typing import Any, Callable, List, Dict, Union, Optional, Sequence, Tuple -from numpy import ndarray -from collections import OrderedDict -from scipy import sparse -import os -import sklearn -import numpy -import typing - -# Custom import commands if any -from sklearn.neighbors.classification import KNeighborsClassifier - - -from d3m.container.numpy import ndarray as d3m_ndarray -from d3m.container import DataFrame as d3m_dataframe -from d3m.metadata import hyperparams, params, base as metadata_base -from d3m import utils -from d3m.base import utils as base_utils -from d3m.exceptions import PrimitiveNotFittedError -from d3m.primitive_interfaces.base import CallResult, DockerContainer - -from d3m.primitive_interfaces.supervised_learning import SupervisedLearnerPrimitiveBase -from d3m.primitive_interfaces.base import ProbabilisticCompositionalityMixin, ContinueFitMixin -from d3m import exceptions -import pandas - - - -Inputs = d3m_dataframe -Outputs = d3m_dataframe - - -class Params(params.Params): - _fit_method: Optional[str] - _fit_X: Optional[ndarray] - _tree: Optional[object] - classes_: Optional[ndarray] - _y: Optional[ndarray] - outputs_2d_: Optional[bool] - effective_metric_: Optional[str] - effective_metric_params_: Optional[Dict] - radius: Optional[float] - input_column_names: Optional[Any] - target_names_: Optional[Sequence[Any]] - training_indices_: Optional[Sequence[int]] - target_column_indices_: Optional[Sequence[int]] - target_columns_metadata_: Optional[List[OrderedDict]] - - - -class Hyperparams(hyperparams.Hyperparams): - n_neighbors = hyperparams.Bounded[int]( - default=5, - lower=0, - upper=None, - description='Number of neighbors to use by default for :meth:`k_neighbors` queries.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - weights = hyperparams.Enumeration[str]( - values=['uniform', 'distance'], - default='uniform', - description='weight function used in prediction. Possible values: - \'uniform\' : uniform weights. All points in each neighborhood are weighted equally. - \'distance\' : weight points by the inverse of their distance. in this case, closer neighbors of a query point will have a greater influence than neighbors which are further away. - [callable] : a user-defined function which accepts an array of distances, and returns an array of the same shape containing the weights.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - algorithm = hyperparams.Enumeration[str]( - values=['auto', 'ball_tree', 'kd_tree', 'brute'], - default='auto', - description='Algorithm used to compute the nearest neighbors: - \'ball_tree\' will use :class:`BallTree` - \'kd_tree\' will use :class:`KDTree` - \'brute\' will use a brute-force search. - \'auto\' will attempt to decide the most appropriate algorithm based on the values passed to :meth:`fit` method. Note: fitting on sparse input will override the setting of this parameter, using brute force.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - leaf_size = hyperparams.Bounded[int]( - default=30, - lower=0, - upper=None, - description='Leaf size passed to BallTree or KDTree. This can affect the speed of the construction and query, as well as the memory required to store the tree. The optimal value depends on the nature of the problem.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/ResourcesUseParameter', 'https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - metric = hyperparams.Enumeration[str]( - values=['euclidean', 'manhattan', 'chebyshev', 'minkowski', 'wminkowski', 'seuclidean', 'mahalanobis'], - default='minkowski', - description='the distance metric to use for the tree. The default metric is minkowski, and with p=2 is equivalent to the standard Euclidean metric. See the documentation of the DistanceMetric class for a list of available metrics.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - p = hyperparams.Enumeration[int]( - values=[1, 2], - default=2, - description='Power parameter for the Minkowski metric. When p = 1, this is equivalent to using manhattan_distance (l1), and euclidean_distance (l2) for p = 2. For arbitrary p, minkowski_distance (l_p) is used.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - n_jobs = hyperparams.Union( - configuration=OrderedDict({ - 'limit': hyperparams.Bounded[int]( - default=1, - lower=1, - upper=None, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ), - 'all_cores': hyperparams.Constant( - default=-1, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ) - }), - default='limit', - description='The number of parallel jobs to run for neighbors search. If ``-1``, then the number of jobs is set to the number of CPU cores. Doesn\'t affect :meth:`fit` method.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/ResourcesUseParameter'] - ) - - use_inputs_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to force primitive to use as training input. If any specified column cannot be parsed, it is skipped.", - ) - use_outputs_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to force primitive to use as training target. If any specified column cannot be parsed, it is skipped.", - ) - exclude_inputs_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to not use as training inputs. Applicable only if \"use_columns\" is not provided.", - ) - exclude_outputs_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to not use as training target. Applicable only if \"use_columns\" is not provided.", - ) - return_result = hyperparams.Enumeration( - values=['append', 'replace', 'new'], - default='new', - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Should parsed columns be appended, should they replace original columns, or should only parsed columns be returned? This hyperparam is ignored if use_semantic_types is set to false.", - ) - use_semantic_types = hyperparams.UniformBool( - default=False, - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Controls whether semantic_types metadata will be used for filtering columns in input dataframe. Setting this to false makes the code ignore return_result and will produce only the output dataframe" - ) - add_index_columns = hyperparams.UniformBool( - default=False, - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Also include primary index columns if input data has them. Applicable only if \"return_result\" is set to \"new\".", - ) - error_on_no_input = hyperparams.UniformBool( - default=True, - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Throw an exception if no input column is selected/provided. Defaults to true to behave like sklearn. To prevent pipelines from breaking set this to False.", - ) - - return_semantic_type = hyperparams.Enumeration[str]( - values=['https://metadata.datadrivendiscovery.org/types/Attribute', 'https://metadata.datadrivendiscovery.org/types/ConstructedAttribute', 'https://metadata.datadrivendiscovery.org/types/PredictedTarget'], - default='https://metadata.datadrivendiscovery.org/types/PredictedTarget', - description='Decides what semantic type to attach to generated output', - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'] - ) - -class SKKNeighborsClassifier(SupervisedLearnerPrimitiveBase[Inputs, Outputs, Params, Hyperparams], - ProbabilisticCompositionalityMixin[Inputs, Outputs, Params, Hyperparams]): - """ - Primitive wrapping for sklearn KNeighborsClassifier - `sklearn documentation `_ - - """ - - __author__ = "JPL MARVIN" - metadata = metadata_base.PrimitiveMetadata({ - "algorithm_types": [metadata_base.PrimitiveAlgorithmType.K_NEAREST_NEIGHBORS, ], - "name": "sklearn.neighbors.classification.KNeighborsClassifier", - "primitive_family": metadata_base.PrimitiveFamily.CLASSIFICATION, - "python_path": "d3m.primitives.classification.k_neighbors.SKlearn", - "source": {'name': 'JPL', 'contact': 'mailto:shah@jpl.nasa.gov', 'uris': ['https://gitlab.com/datadrivendiscovery/sklearn-wrap/issues', 'https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html']}, - "version": "2019.11.13", - "id": "754f7210-a0b7-3b7a-8c98-f43c7b663d28", - "hyperparams_to_tune": ['n_neighbors', 'p'], - 'installation': [ - {'type': metadata_base.PrimitiveInstallationType.PIP, - 'package_uri': 'git+https://gitlab.com/datadrivendiscovery/sklearn-wrap.git@{git_commit}#egg=sklearn_wrap'.format( - git_commit=utils.current_git_commit(os.path.dirname(__file__)), - ), - }] - }) - - def __init__(self, *, - hyperparams: Hyperparams, - random_seed: int = 0, - docker_containers: Dict[str, DockerContainer] = None) -> None: - - super().__init__(hyperparams=hyperparams, random_seed=random_seed, docker_containers=docker_containers) - - # False - self._clf = KNeighborsClassifier( - n_neighbors=self.hyperparams['n_neighbors'], - weights=self.hyperparams['weights'], - algorithm=self.hyperparams['algorithm'], - leaf_size=self.hyperparams['leaf_size'], - metric=self.hyperparams['metric'], - p=self.hyperparams['p'], - n_jobs=self.hyperparams['n_jobs'], - ) - - self._inputs = None - self._outputs = None - self._training_inputs = None - self._training_outputs = None - self._target_names = None - self._training_indices = None - self._target_column_indices = None - self._target_columns_metadata: List[OrderedDict] = None - self._input_column_names = None - self._fitted = False - self._new_training_data = False - - def set_training_data(self, *, inputs: Inputs, outputs: Outputs) -> None: - self._inputs = inputs - self._outputs = outputs - self._fitted = False - self._new_training_data = True - - def fit(self, *, timeout: float = None, iterations: int = None) -> CallResult[None]: - if self._inputs is None or self._outputs is None: - raise ValueError("Missing training data.") - - if not self._new_training_data: - return CallResult(None) - self._new_training_data = False - - self._training_inputs, self._training_indices = self._get_columns_to_fit(self._inputs, self.hyperparams) - self._training_outputs, self._target_names, self._target_column_indices = self._get_targets(self._outputs, self.hyperparams) - self._input_column_names = self._training_inputs.columns - - if len(self._training_indices) > 0 and len(self._target_column_indices) > 0: - self._target_columns_metadata = self._get_target_columns_metadata(self._training_outputs.metadata, self.hyperparams) - sk_training_output = self._training_outputs.values - - shape = sk_training_output.shape - if len(shape) == 2 and shape[1] == 1: - sk_training_output = numpy.ravel(sk_training_output) - - self._clf.fit(self._training_inputs, sk_training_output) - self._fitted = True - else: - if self.hyperparams['error_on_no_input']: - raise RuntimeError("No input columns were selected") - self.logger.warn("No input columns were selected") - - return CallResult(None) - - - - def produce(self, *, inputs: Inputs, timeout: float = None, iterations: int = None) -> CallResult[Outputs]: - sk_inputs, columns_to_use = self._get_columns_to_fit(inputs, self.hyperparams) - output = [] - if len(sk_inputs.columns): - try: - sk_output = self._clf.predict(sk_inputs) - except sklearn.exceptions.NotFittedError as error: - raise PrimitiveNotFittedError("Primitive not fitted.") from error - # For primitives that allow predicting without fitting like GaussianProcessRegressor - if not self._fitted: - raise PrimitiveNotFittedError("Primitive not fitted.") - if sparse.issparse(sk_output): - sk_output = sk_output.toarray() - output = self._wrap_predictions(inputs, sk_output) - output.columns = self._target_names - output = [output] - else: - if self.hyperparams['error_on_no_input']: - raise RuntimeError("No input columns were selected") - self.logger.warn("No input columns were selected") - outputs = base_utils.combine_columns(return_result=self.hyperparams['return_result'], - add_index_columns=self.hyperparams['add_index_columns'], - inputs=inputs, column_indices=self._target_column_indices, - columns_list=output) - - return CallResult(outputs) - - - def get_params(self) -> Params: - if not self._fitted: - return Params( - _fit_method=None, - _fit_X=None, - _tree=None, - classes_=None, - _y=None, - outputs_2d_=None, - effective_metric_=None, - effective_metric_params_=None, - radius=None, - input_column_names=self._input_column_names, - training_indices_=self._training_indices, - target_names_=self._target_names, - target_column_indices_=self._target_column_indices, - target_columns_metadata_=self._target_columns_metadata - ) - - return Params( - _fit_method=getattr(self._clf, '_fit_method', None), - _fit_X=getattr(self._clf, '_fit_X', None), - _tree=getattr(self._clf, '_tree', None), - classes_=getattr(self._clf, 'classes_', None), - _y=getattr(self._clf, '_y', None), - outputs_2d_=getattr(self._clf, 'outputs_2d_', None), - effective_metric_=getattr(self._clf, 'effective_metric_', None), - effective_metric_params_=getattr(self._clf, 'effective_metric_params_', None), - radius=getattr(self._clf, 'radius', None), - input_column_names=self._input_column_names, - training_indices_=self._training_indices, - target_names_=self._target_names, - target_column_indices_=self._target_column_indices, - target_columns_metadata_=self._target_columns_metadata - ) - - def set_params(self, *, params: Params) -> None: - self._clf._fit_method = params['_fit_method'] - self._clf._fit_X = params['_fit_X'] - self._clf._tree = params['_tree'] - self._clf.classes_ = params['classes_'] - self._clf._y = params['_y'] - self._clf.outputs_2d_ = params['outputs_2d_'] - self._clf.effective_metric_ = params['effective_metric_'] - self._clf.effective_metric_params_ = params['effective_metric_params_'] - self._clf.radius = params['radius'] - self._input_column_names = params['input_column_names'] - self._training_indices = params['training_indices_'] - self._target_names = params['target_names_'] - self._target_column_indices = params['target_column_indices_'] - self._target_columns_metadata = params['target_columns_metadata_'] - - if params['_fit_method'] is not None: - self._fitted = True - if params['_fit_X'] is not None: - self._fitted = True - if params['_tree'] is not None: - self._fitted = True - if params['classes_'] is not None: - self._fitted = True - if params['_y'] is not None: - self._fitted = True - if params['outputs_2d_'] is not None: - self._fitted = True - if params['effective_metric_'] is not None: - self._fitted = True - if params['effective_metric_params_'] is not None: - self._fitted = True - if params['radius'] is not None: - self._fitted = True - - - - def log_likelihoods(self, *, - outputs: Outputs, - inputs: Inputs, - timeout: float = None, - iterations: int = None) -> CallResult[Sequence[float]]: - inputs = inputs.values # Get ndarray - outputs = outputs.values - return CallResult(numpy.log(self._clf.predict_proba(inputs)[:, outputs])) - - - - - @classmethod - def _get_columns_to_fit(cls, inputs: Inputs, hyperparams: Hyperparams): - if not hyperparams['use_semantic_types']: - return inputs, list(range(len(inputs.columns))) - - inputs_metadata = inputs.metadata - - def can_produce_column(column_index: int) -> bool: - return cls._can_produce_column(inputs_metadata, column_index, hyperparams) - - columns_to_produce, columns_not_to_produce = base_utils.get_columns_to_use(inputs_metadata, - use_columns=hyperparams['use_inputs_columns'], - exclude_columns=hyperparams['exclude_inputs_columns'], - can_use_column=can_produce_column) - return inputs.iloc[:, columns_to_produce], columns_to_produce - # return columns_to_produce - - @classmethod - def _can_produce_column(cls, inputs_metadata: metadata_base.DataMetadata, column_index: int, hyperparams: Hyperparams) -> bool: - column_metadata = inputs_metadata.query((metadata_base.ALL_ELEMENTS, column_index)) - - accepted_structural_types = (int, float, numpy.integer, numpy.float64) - accepted_semantic_types = set() - accepted_semantic_types.add("https://metadata.datadrivendiscovery.org/types/Attribute") - if not issubclass(column_metadata['structural_type'], accepted_structural_types): - return False - - semantic_types = set(column_metadata.get('semantic_types', [])) - - if len(semantic_types) == 0: - cls.logger.warning("No semantic types found in column metadata") - return False - # Making sure all accepted_semantic_types are available in semantic_types - if len(accepted_semantic_types - semantic_types) == 0: - return True - - return False - - @classmethod - def _get_targets(cls, data: d3m_dataframe, hyperparams: Hyperparams): - if not hyperparams['use_semantic_types']: - return data, list(data.columns), list(range(len(data.columns))) - - metadata = data.metadata - - def can_produce_column(column_index: int) -> bool: - accepted_semantic_types = set() - accepted_semantic_types.add("https://metadata.datadrivendiscovery.org/types/TrueTarget") - column_metadata = metadata.query((metadata_base.ALL_ELEMENTS, column_index)) - semantic_types = set(column_metadata.get('semantic_types', [])) - if len(semantic_types) == 0: - cls.logger.warning("No semantic types found in column metadata") - return False - # Making sure all accepted_semantic_types are available in semantic_types - if len(accepted_semantic_types - semantic_types) == 0: - return True - return False - - target_column_indices, target_columns_not_to_produce = base_utils.get_columns_to_use(metadata, - use_columns=hyperparams[ - 'use_outputs_columns'], - exclude_columns= - hyperparams[ - 'exclude_outputs_columns'], - can_use_column=can_produce_column) - targets = [] - if target_column_indices: - targets = data.select_columns(target_column_indices) - target_column_names = [] - for idx in target_column_indices: - target_column_names.append(data.columns[idx]) - return targets, target_column_names, target_column_indices - - @classmethod - def _get_target_columns_metadata(cls, outputs_metadata: metadata_base.DataMetadata, hyperparams) -> List[OrderedDict]: - outputs_length = outputs_metadata.query((metadata_base.ALL_ELEMENTS,))['dimension']['length'] - - target_columns_metadata: List[OrderedDict] = [] - for column_index in range(outputs_length): - column_metadata = OrderedDict(outputs_metadata.query_column(column_index)) - - # Update semantic types and prepare it for predicted targets. - semantic_types = set(column_metadata.get('semantic_types', [])) - semantic_types_to_remove = set(["https://metadata.datadrivendiscovery.org/types/TrueTarget","https://metadata.datadrivendiscovery.org/types/SuggestedTarget",]) - add_semantic_types = set(["https://metadata.datadrivendiscovery.org/types/PredictedTarget",]) - add_semantic_types.add(hyperparams["return_semantic_type"]) - semantic_types = semantic_types - semantic_types_to_remove - semantic_types = semantic_types.union(add_semantic_types) - column_metadata['semantic_types'] = list(semantic_types) - - target_columns_metadata.append(column_metadata) - - return target_columns_metadata - - @classmethod - def _update_predictions_metadata(cls, inputs_metadata: metadata_base.DataMetadata, outputs: Optional[Outputs], - target_columns_metadata: List[OrderedDict]) -> metadata_base.DataMetadata: - outputs_metadata = metadata_base.DataMetadata().generate(value=outputs) - - for column_index, column_metadata in enumerate(target_columns_metadata): - column_metadata.pop("structural_type", None) - outputs_metadata = outputs_metadata.update_column(column_index, column_metadata) - - return outputs_metadata - - def _wrap_predictions(self, inputs: Inputs, predictions: ndarray) -> Outputs: - outputs = d3m_dataframe(predictions, generate_metadata=False) - outputs.metadata = self._update_predictions_metadata(inputs.metadata, outputs, self._target_columns_metadata) - return outputs - - - @classmethod - def _add_target_columns_metadata(cls, outputs_metadata: metadata_base.DataMetadata): - outputs_length = outputs_metadata.query((metadata_base.ALL_ELEMENTS,))['dimension']['length'] - - target_columns_metadata: List[OrderedDict] = [] - for column_index in range(outputs_length): - column_metadata = OrderedDict() - semantic_types = [] - semantic_types.append('https://metadata.datadrivendiscovery.org/types/PredictedTarget') - column_name = outputs_metadata.query((metadata_base.ALL_ELEMENTS, column_index)).get("name") - if column_name is None: - column_name = "output_{}".format(column_index) - column_metadata["semantic_types"] = semantic_types - column_metadata["name"] = str(column_name) - target_columns_metadata.append(column_metadata) - - return target_columns_metadata - - -SKKNeighborsClassifier.__doc__ = KNeighborsClassifier.__doc__ \ No newline at end of file diff --git a/common-primitives/sklearn-wrap/sklearn_wrap/SKKNeighborsRegressor.py b/common-primitives/sklearn-wrap/sklearn_wrap/SKKNeighborsRegressor.py deleted file mode 100644 index 38b4469..0000000 --- a/common-primitives/sklearn-wrap/sklearn_wrap/SKKNeighborsRegressor.py +++ /dev/null @@ -1,475 +0,0 @@ -from typing import Any, Callable, List, Dict, Union, Optional, Sequence, Tuple -from numpy import ndarray -from collections import OrderedDict -from scipy import sparse -import os -import sklearn -import numpy -import typing - -# Custom import commands if any -from sklearn.neighbors.regression import KNeighborsRegressor - - -from d3m.container.numpy import ndarray as d3m_ndarray -from d3m.container import DataFrame as d3m_dataframe -from d3m.metadata import hyperparams, params, base as metadata_base -from d3m import utils -from d3m.base import utils as base_utils -from d3m.exceptions import PrimitiveNotFittedError -from d3m.primitive_interfaces.base import CallResult, DockerContainer - -from d3m.primitive_interfaces.supervised_learning import SupervisedLearnerPrimitiveBase -from d3m.primitive_interfaces.base import ProbabilisticCompositionalityMixin, ContinueFitMixin -from d3m import exceptions -import pandas - - - -Inputs = d3m_dataframe -Outputs = d3m_dataframe - - -class Params(params.Params): - _fit_method: Optional[str] - _fit_X: Optional[ndarray] - _tree: Optional[object] - _y: Optional[ndarray] - effective_metric_: Optional[str] - effective_metric_params_: Optional[Dict] - radius: Optional[float] - input_column_names: Optional[Any] - target_names_: Optional[Sequence[Any]] - training_indices_: Optional[Sequence[int]] - target_column_indices_: Optional[Sequence[int]] - target_columns_metadata_: Optional[List[OrderedDict]] - - - -class Hyperparams(hyperparams.Hyperparams): - n_neighbors = hyperparams.Bounded[int]( - default=5, - lower=0, - upper=None, - description='Number of neighbors to use by default for :meth:`k_neighbors` queries.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - weights = hyperparams.Enumeration[str]( - values=['uniform', 'distance'], - default='uniform', - description='weight function used in prediction. Possible values: - \'uniform\' : uniform weights. All points in each neighborhood are weighted equally. - \'distance\' : weight points by the inverse of their distance. in this case, closer neighbors of a query point will have a greater influence than neighbors which are further away. - [callable] : a user-defined function which accepts an array of distances, and returns an array of the same shape containing the weights. Uniform weights are used by default.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - algorithm = hyperparams.Enumeration[str]( - values=['auto', 'ball_tree', 'kd_tree', 'brute'], - default='auto', - description='Algorithm used to compute the nearest neighbors: - \'ball_tree\' will use :class:`BallTree` - \'kd_tree\' will use :class:`KDtree` - \'brute\' will use a brute-force search. - \'auto\' will attempt to decide the most appropriate algorithm based on the values passed to :meth:`fit` method. Note: fitting on sparse input will override the setting of this parameter, using brute force.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - leaf_size = hyperparams.Bounded[int]( - default=30, - lower=0, - upper=None, - description='Leaf size passed to BallTree or KDTree. This can affect the speed of the construction and query, as well as the memory required to store the tree. The optimal value depends on the nature of the problem.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/ResourcesUseParameter', 'https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - metric = hyperparams.Constant( - default='minkowski', - description='the distance metric to use for the tree. The default metric is minkowski, and with p=2 is equivalent to the standard Euclidean metric. See the documentation of the DistanceMetric class for a list of available metrics.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - p = hyperparams.Enumeration[int]( - values=[1, 2], - default=2, - description='Power parameter for the Minkowski metric. When p = 1, this is equivalent to using manhattan_distance (l1), and euclidean_distance (l2) for p = 2. For arbitrary p, minkowski_distance (l_p) is used.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - n_jobs = hyperparams.Union( - configuration=OrderedDict({ - 'limit': hyperparams.Bounded[int]( - default=1, - lower=1, - upper=None, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ), - 'all_cores': hyperparams.Constant( - default=-1, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ) - }), - default='limit', - description='The number of parallel jobs to run for neighbors search. If ``-1``, then the number of jobs is set to the number of CPU cores. Doesn\'t affect :meth:`fit` method.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/ResourcesUseParameter'] - ) - - use_inputs_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to force primitive to use as training input. If any specified column cannot be parsed, it is skipped.", - ) - use_outputs_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to force primitive to use as training target. If any specified column cannot be parsed, it is skipped.", - ) - exclude_inputs_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to not use as training inputs. Applicable only if \"use_columns\" is not provided.", - ) - exclude_outputs_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to not use as training target. Applicable only if \"use_columns\" is not provided.", - ) - return_result = hyperparams.Enumeration( - values=['append', 'replace', 'new'], - default='new', - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Should parsed columns be appended, should they replace original columns, or should only parsed columns be returned? This hyperparam is ignored if use_semantic_types is set to false.", - ) - use_semantic_types = hyperparams.UniformBool( - default=False, - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Controls whether semantic_types metadata will be used for filtering columns in input dataframe. Setting this to false makes the code ignore return_result and will produce only the output dataframe" - ) - add_index_columns = hyperparams.UniformBool( - default=False, - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Also include primary index columns if input data has them. Applicable only if \"return_result\" is set to \"new\".", - ) - error_on_no_input = hyperparams.UniformBool( - default=True, - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Throw an exception if no input column is selected/provided. Defaults to true to behave like sklearn. To prevent pipelines from breaking set this to False.", - ) - - return_semantic_type = hyperparams.Enumeration[str]( - values=['https://metadata.datadrivendiscovery.org/types/Attribute', 'https://metadata.datadrivendiscovery.org/types/ConstructedAttribute', 'https://metadata.datadrivendiscovery.org/types/PredictedTarget'], - default='https://metadata.datadrivendiscovery.org/types/PredictedTarget', - description='Decides what semantic type to attach to generated output', - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'] - ) - -class SKKNeighborsRegressor(SupervisedLearnerPrimitiveBase[Inputs, Outputs, Params, Hyperparams]): - """ - Primitive wrapping for sklearn KNeighborsRegressor - `sklearn documentation `_ - - """ - - __author__ = "JPL MARVIN" - metadata = metadata_base.PrimitiveMetadata({ - "algorithm_types": [metadata_base.PrimitiveAlgorithmType.K_NEAREST_NEIGHBORS, ], - "name": "sklearn.neighbors.regression.KNeighborsRegressor", - "primitive_family": metadata_base.PrimitiveFamily.REGRESSION, - "python_path": "d3m.primitives.regression.k_neighbors.SKlearn", - "source": {'name': 'JPL', 'contact': 'mailto:shah@jpl.nasa.gov', 'uris': ['https://gitlab.com/datadrivendiscovery/sklearn-wrap/issues', 'https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsRegressor.html']}, - "version": "2019.11.13", - "id": "50b499a5-cef8-3028-8a99-ae553819f855", - "hyperparams_to_tune": ['n_neighbors', 'p'], - 'installation': [ - {'type': metadata_base.PrimitiveInstallationType.PIP, - 'package_uri': 'git+https://gitlab.com/datadrivendiscovery/sklearn-wrap.git@{git_commit}#egg=sklearn_wrap'.format( - git_commit=utils.current_git_commit(os.path.dirname(__file__)), - ), - }] - }) - - def __init__(self, *, - hyperparams: Hyperparams, - random_seed: int = 0, - docker_containers: Dict[str, DockerContainer] = None) -> None: - - super().__init__(hyperparams=hyperparams, random_seed=random_seed, docker_containers=docker_containers) - - # False - self._clf = KNeighborsRegressor( - n_neighbors=self.hyperparams['n_neighbors'], - weights=self.hyperparams['weights'], - algorithm=self.hyperparams['algorithm'], - leaf_size=self.hyperparams['leaf_size'], - metric=self.hyperparams['metric'], - p=self.hyperparams['p'], - n_jobs=self.hyperparams['n_jobs'], - ) - - self._inputs = None - self._outputs = None - self._training_inputs = None - self._training_outputs = None - self._target_names = None - self._training_indices = None - self._target_column_indices = None - self._target_columns_metadata: List[OrderedDict] = None - self._input_column_names = None - self._fitted = False - self._new_training_data = False - - def set_training_data(self, *, inputs: Inputs, outputs: Outputs) -> None: - self._inputs = inputs - self._outputs = outputs - self._fitted = False - self._new_training_data = True - - def fit(self, *, timeout: float = None, iterations: int = None) -> CallResult[None]: - if self._inputs is None or self._outputs is None: - raise ValueError("Missing training data.") - - if not self._new_training_data: - return CallResult(None) - self._new_training_data = False - - self._training_inputs, self._training_indices = self._get_columns_to_fit(self._inputs, self.hyperparams) - self._training_outputs, self._target_names, self._target_column_indices = self._get_targets(self._outputs, self.hyperparams) - self._input_column_names = self._training_inputs.columns - - if len(self._training_indices) > 0 and len(self._target_column_indices) > 0: - self._target_columns_metadata = self._get_target_columns_metadata(self._training_outputs.metadata, self.hyperparams) - sk_training_output = self._training_outputs.values - - shape = sk_training_output.shape - if len(shape) == 2 and shape[1] == 1: - sk_training_output = numpy.ravel(sk_training_output) - - self._clf.fit(self._training_inputs, sk_training_output) - self._fitted = True - else: - if self.hyperparams['error_on_no_input']: - raise RuntimeError("No input columns were selected") - self.logger.warn("No input columns were selected") - - return CallResult(None) - - - - def produce(self, *, inputs: Inputs, timeout: float = None, iterations: int = None) -> CallResult[Outputs]: - sk_inputs, columns_to_use = self._get_columns_to_fit(inputs, self.hyperparams) - output = [] - if len(sk_inputs.columns): - try: - sk_output = self._clf.predict(sk_inputs) - except sklearn.exceptions.NotFittedError as error: - raise PrimitiveNotFittedError("Primitive not fitted.") from error - # For primitives that allow predicting without fitting like GaussianProcessRegressor - if not self._fitted: - raise PrimitiveNotFittedError("Primitive not fitted.") - if sparse.issparse(sk_output): - sk_output = sk_output.toarray() - output = self._wrap_predictions(inputs, sk_output) - output.columns = self._target_names - output = [output] - else: - if self.hyperparams['error_on_no_input']: - raise RuntimeError("No input columns were selected") - self.logger.warn("No input columns were selected") - outputs = base_utils.combine_columns(return_result=self.hyperparams['return_result'], - add_index_columns=self.hyperparams['add_index_columns'], - inputs=inputs, column_indices=self._target_column_indices, - columns_list=output) - - return CallResult(outputs) - - - def get_params(self) -> Params: - if not self._fitted: - return Params( - _fit_method=None, - _fit_X=None, - _tree=None, - _y=None, - effective_metric_=None, - effective_metric_params_=None, - radius=None, - input_column_names=self._input_column_names, - training_indices_=self._training_indices, - target_names_=self._target_names, - target_column_indices_=self._target_column_indices, - target_columns_metadata_=self._target_columns_metadata - ) - - return Params( - _fit_method=getattr(self._clf, '_fit_method', None), - _fit_X=getattr(self._clf, '_fit_X', None), - _tree=getattr(self._clf, '_tree', None), - _y=getattr(self._clf, '_y', None), - effective_metric_=getattr(self._clf, 'effective_metric_', None), - effective_metric_params_=getattr(self._clf, 'effective_metric_params_', None), - radius=getattr(self._clf, 'radius', None), - input_column_names=self._input_column_names, - training_indices_=self._training_indices, - target_names_=self._target_names, - target_column_indices_=self._target_column_indices, - target_columns_metadata_=self._target_columns_metadata - ) - - def set_params(self, *, params: Params) -> None: - self._clf._fit_method = params['_fit_method'] - self._clf._fit_X = params['_fit_X'] - self._clf._tree = params['_tree'] - self._clf._y = params['_y'] - self._clf.effective_metric_ = params['effective_metric_'] - self._clf.effective_metric_params_ = params['effective_metric_params_'] - self._clf.radius = params['radius'] - self._input_column_names = params['input_column_names'] - self._training_indices = params['training_indices_'] - self._target_names = params['target_names_'] - self._target_column_indices = params['target_column_indices_'] - self._target_columns_metadata = params['target_columns_metadata_'] - - if params['_fit_method'] is not None: - self._fitted = True - if params['_fit_X'] is not None: - self._fitted = True - if params['_tree'] is not None: - self._fitted = True - if params['_y'] is not None: - self._fitted = True - if params['effective_metric_'] is not None: - self._fitted = True - if params['effective_metric_params_'] is not None: - self._fitted = True - if params['radius'] is not None: - self._fitted = True - - - - - - - - @classmethod - def _get_columns_to_fit(cls, inputs: Inputs, hyperparams: Hyperparams): - if not hyperparams['use_semantic_types']: - return inputs, list(range(len(inputs.columns))) - - inputs_metadata = inputs.metadata - - def can_produce_column(column_index: int) -> bool: - return cls._can_produce_column(inputs_metadata, column_index, hyperparams) - - columns_to_produce, columns_not_to_produce = base_utils.get_columns_to_use(inputs_metadata, - use_columns=hyperparams['use_inputs_columns'], - exclude_columns=hyperparams['exclude_inputs_columns'], - can_use_column=can_produce_column) - return inputs.iloc[:, columns_to_produce], columns_to_produce - # return columns_to_produce - - @classmethod - def _can_produce_column(cls, inputs_metadata: metadata_base.DataMetadata, column_index: int, hyperparams: Hyperparams) -> bool: - column_metadata = inputs_metadata.query((metadata_base.ALL_ELEMENTS, column_index)) - - accepted_structural_types = (int, float, numpy.integer, numpy.float64) - accepted_semantic_types = set() - accepted_semantic_types.add("https://metadata.datadrivendiscovery.org/types/Attribute") - if not issubclass(column_metadata['structural_type'], accepted_structural_types): - return False - - semantic_types = set(column_metadata.get('semantic_types', [])) - - if len(semantic_types) == 0: - cls.logger.warning("No semantic types found in column metadata") - return False - # Making sure all accepted_semantic_types are available in semantic_types - if len(accepted_semantic_types - semantic_types) == 0: - return True - - return False - - @classmethod - def _get_targets(cls, data: d3m_dataframe, hyperparams: Hyperparams): - if not hyperparams['use_semantic_types']: - return data, list(data.columns), list(range(len(data.columns))) - - metadata = data.metadata - - def can_produce_column(column_index: int) -> bool: - accepted_semantic_types = set() - accepted_semantic_types.add("https://metadata.datadrivendiscovery.org/types/TrueTarget") - column_metadata = metadata.query((metadata_base.ALL_ELEMENTS, column_index)) - semantic_types = set(column_metadata.get('semantic_types', [])) - if len(semantic_types) == 0: - cls.logger.warning("No semantic types found in column metadata") - return False - # Making sure all accepted_semantic_types are available in semantic_types - if len(accepted_semantic_types - semantic_types) == 0: - return True - return False - - target_column_indices, target_columns_not_to_produce = base_utils.get_columns_to_use(metadata, - use_columns=hyperparams[ - 'use_outputs_columns'], - exclude_columns= - hyperparams[ - 'exclude_outputs_columns'], - can_use_column=can_produce_column) - targets = [] - if target_column_indices: - targets = data.select_columns(target_column_indices) - target_column_names = [] - for idx in target_column_indices: - target_column_names.append(data.columns[idx]) - return targets, target_column_names, target_column_indices - - @classmethod - def _get_target_columns_metadata(cls, outputs_metadata: metadata_base.DataMetadata, hyperparams) -> List[OrderedDict]: - outputs_length = outputs_metadata.query((metadata_base.ALL_ELEMENTS,))['dimension']['length'] - - target_columns_metadata: List[OrderedDict] = [] - for column_index in range(outputs_length): - column_metadata = OrderedDict(outputs_metadata.query_column(column_index)) - - # Update semantic types and prepare it for predicted targets. - semantic_types = set(column_metadata.get('semantic_types', [])) - semantic_types_to_remove = set(["https://metadata.datadrivendiscovery.org/types/TrueTarget","https://metadata.datadrivendiscovery.org/types/SuggestedTarget",]) - add_semantic_types = set(["https://metadata.datadrivendiscovery.org/types/PredictedTarget",]) - add_semantic_types.add(hyperparams["return_semantic_type"]) - semantic_types = semantic_types - semantic_types_to_remove - semantic_types = semantic_types.union(add_semantic_types) - column_metadata['semantic_types'] = list(semantic_types) - - target_columns_metadata.append(column_metadata) - - return target_columns_metadata - - @classmethod - def _update_predictions_metadata(cls, inputs_metadata: metadata_base.DataMetadata, outputs: Optional[Outputs], - target_columns_metadata: List[OrderedDict]) -> metadata_base.DataMetadata: - outputs_metadata = metadata_base.DataMetadata().generate(value=outputs) - - for column_index, column_metadata in enumerate(target_columns_metadata): - column_metadata.pop("structural_type", None) - outputs_metadata = outputs_metadata.update_column(column_index, column_metadata) - - return outputs_metadata - - def _wrap_predictions(self, inputs: Inputs, predictions: ndarray) -> Outputs: - outputs = d3m_dataframe(predictions, generate_metadata=False) - outputs.metadata = self._update_predictions_metadata(inputs.metadata, outputs, self._target_columns_metadata) - return outputs - - - @classmethod - def _add_target_columns_metadata(cls, outputs_metadata: metadata_base.DataMetadata): - outputs_length = outputs_metadata.query((metadata_base.ALL_ELEMENTS,))['dimension']['length'] - - target_columns_metadata: List[OrderedDict] = [] - for column_index in range(outputs_length): - column_metadata = OrderedDict() - semantic_types = [] - semantic_types.append('https://metadata.datadrivendiscovery.org/types/PredictedTarget') - column_name = outputs_metadata.query((metadata_base.ALL_ELEMENTS, column_index)).get("name") - if column_name is None: - column_name = "output_{}".format(column_index) - column_metadata["semantic_types"] = semantic_types - column_metadata["name"] = str(column_name) - target_columns_metadata.append(column_metadata) - - return target_columns_metadata - - -SKKNeighborsRegressor.__doc__ = KNeighborsRegressor.__doc__ \ No newline at end of file diff --git a/common-primitives/sklearn-wrap/sklearn_wrap/SKKernelPCA.py b/common-primitives/sklearn-wrap/sklearn_wrap/SKKernelPCA.py deleted file mode 100644 index 0c7fb57..0000000 --- a/common-primitives/sklearn-wrap/sklearn_wrap/SKKernelPCA.py +++ /dev/null @@ -1,536 +0,0 @@ -from typing import Any, Callable, List, Dict, Union, Optional, Sequence, Tuple -from numpy import ndarray -from collections import OrderedDict -from scipy import sparse -import os -import sklearn -import numpy -import typing - -# Custom import commands if any -from sklearn.decomposition.kernel_pca import KernelPCA - - -from d3m.container.numpy import ndarray as d3m_ndarray -from d3m.container import DataFrame as d3m_dataframe -from d3m.metadata import hyperparams, params, base as metadata_base -from d3m import utils -from d3m.base import utils as base_utils -from d3m.exceptions import PrimitiveNotFittedError -from d3m.primitive_interfaces.base import CallResult, DockerContainer -from d3m.primitive_interfaces.unsupervised_learning import UnsupervisedLearnerPrimitiveBase - - -Inputs = d3m_dataframe -Outputs = d3m_dataframe - - -class Params(params.Params): - lambdas_: Optional[ndarray] - alphas_: Optional[ndarray] - dual_coef_: Optional[ndarray] - X_fit_: Optional[ndarray] - _centerer: Optional[sklearn.base.BaseEstimator] - X_transformed_fit_: Optional[ndarray] - input_column_names: Optional[Any] - target_names_: Optional[Sequence[Any]] - training_indices_: Optional[Sequence[int]] - target_column_indices_: Optional[Sequence[int]] - target_columns_metadata_: Optional[List[OrderedDict]] - - - -class Hyperparams(hyperparams.Hyperparams): - n_components = hyperparams.Union( - configuration=OrderedDict({ - 'int': hyperparams.Bounded[int]( - lower=0, - upper=None, - default=100, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ), - 'none': hyperparams.Constant( - default=None, - description='All non-zero components are kept.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ) - }), - default='none', - description='Number of components. If None, all non-zero components are kept.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - kernel = hyperparams.Choice( - choices={ - 'linear': hyperparams.Hyperparams.define( - configuration=OrderedDict({}) - ), - 'poly': hyperparams.Hyperparams.define( - configuration=OrderedDict({ - 'degree': hyperparams.Bounded[int]( - default=3, - lower=0, - upper=None, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ), - 'gamma': hyperparams.Union( - configuration=OrderedDict({ - 'float': hyperparams.Constant( - default=1.0, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ), - 'none': hyperparams.Constant( - default=None, - description='Equals 1/n_features.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ) - }), - default='none', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ), - 'coef0': hyperparams.Constant( - default=1, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - }) - ), - 'rbf': hyperparams.Hyperparams.define( - configuration=OrderedDict({ - 'gamma': hyperparams.Union( - configuration=OrderedDict({ - 'float': hyperparams.Constant( - default=1.0, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ), - 'none': hyperparams.Constant( - default=None, - description='Equals 1/n_features.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ) - }), - default='none', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - }) - ), - 'sigmoid': hyperparams.Hyperparams.define( - configuration=OrderedDict({ - 'gamma': hyperparams.Union( - configuration=OrderedDict({ - 'float': hyperparams.Constant( - default=1.0, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ), - 'none': hyperparams.Constant( - default=None, - description='Equals 1/n_features.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ) - }), - default='none', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ), - 'coef0': hyperparams.Constant( - default=1, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - }) - ), - 'precomputed': hyperparams.Hyperparams.define( - configuration=OrderedDict({}) - ) - }, - default='rbf', - description='Kernel. Default="linear".', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - fit_inverse_transform = hyperparams.UniformBool( - default=False, - description='Learn the inverse transform for non-precomputed kernels. (i.e. learn to find the pre-image of a point)', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - alpha = hyperparams.Constant( - default=1, - description='Hyperparameter of the ridge regression that learns the inverse transform (when fit_inverse_transform=True).', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - eigen_solver = hyperparams.Enumeration[str]( - default='auto', - values=['auto', 'dense', 'arpack'], - description='Select eigensolver to use. If n_components is much less than the number of training samples, arpack may be more efficient than the dense eigensolver.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - tol = hyperparams.Bounded[float]( - default=0, - lower=0, - upper=None, - description='Convergence tolerance for arpack. If 0, optimal value will be chosen by arpack.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - max_iter = hyperparams.Union( - configuration=OrderedDict({ - 'int': hyperparams.Bounded[int]( - lower=0, - upper=None, - default=4, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ), - 'none': hyperparams.Constant( - default=None, - description='Optimal value is chosen by arpack.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ) - }), - default='none', - description='Maximum number of iterations for arpack. If None, optimal value will be chosen by arpack.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - remove_zero_eig = hyperparams.UniformBool( - default=False, - description='If True, then all components with zero eigenvalues are removed, so that the number of components in the output may be < n_components (and sometimes even zero due to numerical instability). When n_components is None, this parameter is ignored and components with zero eigenvalues are removed regardless.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - n_jobs = hyperparams.Union( - configuration=OrderedDict({ - 'limit': hyperparams.Bounded[int]( - default=1, - lower=1, - upper=None, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ), - 'all_cores': hyperparams.Constant( - default=-1, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ) - }), - default='limit', - description='The number of parallel jobs to run. If `-1`, then the number of jobs is set to the number of CPU cores. .. versionadded:: 0.18 copy_X : boolean, default=True If True, input X is copied and stored by the model in the `X_fit_` attribute. If no further changes will be done to X, setting `copy_X=False` saves memory by storing a reference. .. versionadded:: 0.18', - semantic_types=['https://metadata.datadrivendiscovery.org/types/ResourcesUseParameter'] - ) - - use_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to force primitive to operate on. If any specified column cannot be parsed, it is skipped.", - ) - exclude_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to not operate on. Applicable only if \"use_columns\" is not provided.", - ) - return_result = hyperparams.Enumeration( - values=['append', 'replace', 'new'], - default='new', - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Should parsed columns be appended, should they replace original columns, or should only parsed columns be returned? This hyperparam is ignored if use_semantic_types is set to false.", - ) - use_semantic_types = hyperparams.UniformBool( - default=False, - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Controls whether semantic_types metadata will be used for filtering columns in input dataframe. Setting this to false makes the code ignore return_result and will produce only the output dataframe" - ) - add_index_columns = hyperparams.UniformBool( - default=False, - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Also include primary index columns if input data has them. Applicable only if \"return_result\" is set to \"new\".", - ) - error_on_no_input = hyperparams.UniformBool( - default=True, - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Throw an exception if no input column is selected/provided. Defaults to true to behave like sklearn. To prevent pipelines from breaking set this to False.", - ) - - return_semantic_type = hyperparams.Enumeration[str]( - values=['https://metadata.datadrivendiscovery.org/types/Attribute', 'https://metadata.datadrivendiscovery.org/types/ConstructedAttribute'], - default='https://metadata.datadrivendiscovery.org/types/Attribute', - description='Decides what semantic type to attach to generated attributes', - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'] - ) - -class SKKernelPCA(UnsupervisedLearnerPrimitiveBase[Inputs, Outputs, Params, Hyperparams]): - """ - Primitive wrapping for sklearn KernelPCA - `sklearn documentation `_ - - """ - - __author__ = "JPL MARVIN" - metadata = metadata_base.PrimitiveMetadata({ - "algorithm_types": [metadata_base.PrimitiveAlgorithmType.PRINCIPAL_COMPONENT_ANALYSIS, ], - "name": "sklearn.decomposition.kernel_pca.KernelPCA", - "primitive_family": metadata_base.PrimitiveFamily.FEATURE_EXTRACTION, - "python_path": "d3m.primitives.feature_extraction.kernel_pca.SKlearn", - "source": {'name': 'JPL', 'contact': 'mailto:shah@jpl.nasa.gov', 'uris': ['https://gitlab.com/datadrivendiscovery/sklearn-wrap/issues', 'https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.KernelPCA.html']}, - "version": "2019.11.13", - "id": "fec6eba2-4a1b-3ea9-a31f-1da371941ede", - "hyperparams_to_tune": ['n_components', 'kernel', 'alpha'], - 'installation': [ - {'type': metadata_base.PrimitiveInstallationType.PIP, - 'package_uri': 'git+https://gitlab.com/datadrivendiscovery/sklearn-wrap.git@{git_commit}#egg=sklearn_wrap'.format( - git_commit=utils.current_git_commit(os.path.dirname(__file__)), - ), - }] - }) - - def __init__(self, *, - hyperparams: Hyperparams, - random_seed: int = 0, - docker_containers: Dict[str, DockerContainer] = None) -> None: - - super().__init__(hyperparams=hyperparams, random_seed=random_seed, docker_containers=docker_containers) - - # False - self._clf = KernelPCA( - n_components=self.hyperparams['n_components'], - kernel=self.hyperparams['kernel']['choice'], - degree=self.hyperparams['kernel'].get('degree', 3), - gamma=self.hyperparams['kernel'].get('gamma', 'none'), - coef0=self.hyperparams['kernel'].get('coef0', 1), - fit_inverse_transform=self.hyperparams['fit_inverse_transform'], - alpha=self.hyperparams['alpha'], - eigen_solver=self.hyperparams['eigen_solver'], - tol=self.hyperparams['tol'], - max_iter=self.hyperparams['max_iter'], - remove_zero_eig=self.hyperparams['remove_zero_eig'], - n_jobs=self.hyperparams['n_jobs'], - random_state=self.random_seed, - ) - - self._inputs = None - self._outputs = None - self._training_inputs = None - self._training_outputs = None - self._target_names = None - self._training_indices = None - self._target_column_indices = None - self._target_columns_metadata: List[OrderedDict] = None - self._input_column_names = None - self._fitted = False - - - def set_training_data(self, *, inputs: Inputs) -> None: - self._inputs = inputs - self._fitted = False - - def fit(self, *, timeout: float = None, iterations: int = None) -> CallResult[None]: - if self._fitted: - return CallResult(None) - - self._training_inputs, self._training_indices = self._get_columns_to_fit(self._inputs, self.hyperparams) - self._input_column_names = self._training_inputs.columns - - if self._training_inputs is None: - return CallResult(None) - - if len(self._training_indices) > 0: - self._clf.fit(self._training_inputs) - self._fitted = True - else: - if self.hyperparams['error_on_no_input']: - raise RuntimeError("No input columns were selected") - self.logger.warn("No input columns were selected") - return CallResult(None) - - def produce(self, *, inputs: Inputs, timeout: float = None, iterations: int = None) -> CallResult[Outputs]: - if not self._fitted: - raise PrimitiveNotFittedError("Primitive not fitted.") - sk_inputs = inputs - if self.hyperparams['use_semantic_types']: - sk_inputs = inputs.iloc[:, self._training_indices] - output_columns = [] - if len(self._training_indices) > 0: - sk_output = self._clf.transform(sk_inputs) - if sparse.issparse(sk_output): - sk_output = sk_output.toarray() - outputs = self._wrap_predictions(inputs, sk_output) - if len(outputs.columns) == len(self._input_column_names): - outputs.columns = self._input_column_names - output_columns = [outputs] - else: - if self.hyperparams['error_on_no_input']: - raise RuntimeError("No input columns were selected") - self.logger.warn("No input columns were selected") - outputs = base_utils.combine_columns(return_result=self.hyperparams['return_result'], - add_index_columns=self.hyperparams['add_index_columns'], - inputs=inputs, column_indices=self._training_indices, - columns_list=output_columns) - return CallResult(outputs) - - - def get_params(self) -> Params: - if not self._fitted: - return Params( - lambdas_=None, - alphas_=None, - dual_coef_=None, - X_fit_=None, - _centerer=None, - X_transformed_fit_=None, - input_column_names=self._input_column_names, - training_indices_=self._training_indices, - target_names_=self._target_names, - target_column_indices_=self._target_column_indices, - target_columns_metadata_=self._target_columns_metadata - ) - - return Params( - lambdas_=getattr(self._clf, 'lambdas_', None), - alphas_=getattr(self._clf, 'alphas_', None), - dual_coef_=getattr(self._clf, 'dual_coef_', None), - X_fit_=getattr(self._clf, 'X_fit_', None), - _centerer=getattr(self._clf, '_centerer', None), - X_transformed_fit_=getattr(self._clf, 'X_transformed_fit_', None), - input_column_names=self._input_column_names, - training_indices_=self._training_indices, - target_names_=self._target_names, - target_column_indices_=self._target_column_indices, - target_columns_metadata_=self._target_columns_metadata - ) - - def set_params(self, *, params: Params) -> None: - self._clf.lambdas_ = params['lambdas_'] - self._clf.alphas_ = params['alphas_'] - self._clf.dual_coef_ = params['dual_coef_'] - self._clf.X_fit_ = params['X_fit_'] - self._clf._centerer = params['_centerer'] - self._clf.X_transformed_fit_ = params['X_transformed_fit_'] - self._input_column_names = params['input_column_names'] - self._training_indices = params['training_indices_'] - self._target_names = params['target_names_'] - self._target_column_indices = params['target_column_indices_'] - self._target_columns_metadata = params['target_columns_metadata_'] - - if params['lambdas_'] is not None: - self._fitted = True - if params['alphas_'] is not None: - self._fitted = True - if params['dual_coef_'] is not None: - self._fitted = True - if params['X_fit_'] is not None: - self._fitted = True - if params['_centerer'] is not None: - self._fitted = True - if params['X_transformed_fit_'] is not None: - self._fitted = True - - - - - - @classmethod - def _get_columns_to_fit(cls, inputs: Inputs, hyperparams: Hyperparams): - if not hyperparams['use_semantic_types']: - return inputs, list(range(len(inputs.columns))) - - inputs_metadata = inputs.metadata - - def can_produce_column(column_index: int) -> bool: - return cls._can_produce_column(inputs_metadata, column_index, hyperparams) - - columns_to_produce, columns_not_to_produce = base_utils.get_columns_to_use(inputs_metadata, - use_columns=hyperparams['use_columns'], - exclude_columns=hyperparams['exclude_columns'], - can_use_column=can_produce_column) - return inputs.iloc[:, columns_to_produce], columns_to_produce - # return columns_to_produce - - @classmethod - def _can_produce_column(cls, inputs_metadata: metadata_base.DataMetadata, column_index: int, hyperparams: Hyperparams) -> bool: - column_metadata = inputs_metadata.query((metadata_base.ALL_ELEMENTS, column_index)) - - accepted_structural_types = (int, float, numpy.integer, numpy.float64) - accepted_semantic_types = set() - accepted_semantic_types.add("https://metadata.datadrivendiscovery.org/types/Attribute") - if not issubclass(column_metadata['structural_type'], accepted_structural_types): - return False - - semantic_types = set(column_metadata.get('semantic_types', [])) - - if len(semantic_types) == 0: - cls.logger.warning("No semantic types found in column metadata") - return False - - # Making sure all accepted_semantic_types are available in semantic_types - if len(accepted_semantic_types - semantic_types) == 0: - return True - - return False - - - @classmethod - def _get_target_columns_metadata(cls, outputs_metadata: metadata_base.DataMetadata, hyperparams) -> List[OrderedDict]: - outputs_length = outputs_metadata.query((metadata_base.ALL_ELEMENTS,))['dimension']['length'] - - target_columns_metadata: List[OrderedDict] = [] - for column_index in range(outputs_length): - column_metadata = OrderedDict(outputs_metadata.query_column(column_index)) - - # Update semantic types and prepare it for predicted targets. - semantic_types = set(column_metadata.get('semantic_types', [])) - semantic_types_to_remove = set([]) - add_semantic_types = [] - add_semantic_types.add(hyperparams["return_semantic_type"]) - semantic_types = semantic_types - semantic_types_to_remove - semantic_types = semantic_types.union(add_semantic_types) - column_metadata['semantic_types'] = list(semantic_types) - - target_columns_metadata.append(column_metadata) - - return target_columns_metadata - - @classmethod - def _update_predictions_metadata(cls, inputs_metadata: metadata_base.DataMetadata, outputs: Optional[Outputs], - target_columns_metadata: List[OrderedDict]) -> metadata_base.DataMetadata: - outputs_metadata = metadata_base.DataMetadata().generate(value=outputs) - - for column_index, column_metadata in enumerate(target_columns_metadata): - column_metadata.pop("structural_type", None) - outputs_metadata = outputs_metadata.update_column(column_index, column_metadata) - - return outputs_metadata - - def _wrap_predictions(self, inputs: Inputs, predictions: ndarray) -> Outputs: - outputs = d3m_dataframe(predictions, generate_metadata=True) - target_columns_metadata = self._copy_inputs_metadata(inputs.metadata, self._training_indices, outputs.metadata, self.hyperparams) - outputs.metadata = self._update_predictions_metadata(inputs.metadata, outputs, target_columns_metadata) - return outputs - - - @classmethod - def _copy_inputs_metadata(cls, inputs_metadata: metadata_base.DataMetadata, input_indices: List[int], - outputs_metadata: metadata_base.DataMetadata, hyperparams): - outputs_length = outputs_metadata.query((metadata_base.ALL_ELEMENTS,))['dimension']['length'] - target_columns_metadata: List[OrderedDict] = [] - for column_index in input_indices: - column_name = inputs_metadata.query((metadata_base.ALL_ELEMENTS, column_index)).get("name") - if column_name is None: - column_name = "output_{}".format(column_index) - - column_metadata = OrderedDict(inputs_metadata.query_column(column_index)) - semantic_types = set(column_metadata.get('semantic_types', [])) - semantic_types_to_remove = set([]) - add_semantic_types = set() - add_semantic_types.add(hyperparams["return_semantic_type"]) - semantic_types = semantic_types - semantic_types_to_remove - semantic_types = semantic_types.union(add_semantic_types) - column_metadata['semantic_types'] = list(semantic_types) - - column_metadata["name"] = str(column_name) - target_columns_metadata.append(column_metadata) - - # If outputs has more columns than index, add Attribute Type to all remaining - if outputs_length > len(input_indices): - for column_index in range(len(input_indices), outputs_length): - column_metadata = OrderedDict() - semantic_types = set() - semantic_types.add(hyperparams["return_semantic_type"]) - column_name = "output_{}".format(column_index) - column_metadata["semantic_types"] = list(semantic_types) - column_metadata["name"] = str(column_name) - target_columns_metadata.append(column_metadata) - - return target_columns_metadata - - -SKKernelPCA.__doc__ = KernelPCA.__doc__ \ No newline at end of file diff --git a/common-primitives/sklearn-wrap/sklearn_wrap/SKKernelRidge.py b/common-primitives/sklearn-wrap/sklearn_wrap/SKKernelRidge.py deleted file mode 100644 index a8b12ee..0000000 --- a/common-primitives/sklearn-wrap/sklearn_wrap/SKKernelRidge.py +++ /dev/null @@ -1,491 +0,0 @@ -from typing import Any, Callable, List, Dict, Union, Optional, Sequence, Tuple -from numpy import ndarray -from collections import OrderedDict -from scipy import sparse -import os -import sklearn -import numpy -import typing - -# Custom import commands if any -from sklearn.kernel_ridge import KernelRidge - - -from d3m.container.numpy import ndarray as d3m_ndarray -from d3m.container import DataFrame as d3m_dataframe -from d3m.metadata import hyperparams, params, base as metadata_base -from d3m import utils -from d3m.base import utils as base_utils -from d3m.exceptions import PrimitiveNotFittedError -from d3m.primitive_interfaces.base import CallResult, DockerContainer - -from d3m.primitive_interfaces.supervised_learning import SupervisedLearnerPrimitiveBase -from d3m.primitive_interfaces.base import ProbabilisticCompositionalityMixin, ContinueFitMixin -from d3m import exceptions -import pandas - - - -Inputs = d3m_dataframe -Outputs = d3m_dataframe - - -class Params(params.Params): - dual_coef_: Optional[ndarray] - X_fit_: Optional[Union[ndarray, sparse.spmatrix]] - input_column_names: Optional[Any] - target_names_: Optional[Sequence[Any]] - training_indices_: Optional[Sequence[int]] - target_column_indices_: Optional[Sequence[int]] - target_columns_metadata_: Optional[List[OrderedDict]] - - - -class Hyperparams(hyperparams.Hyperparams): - alpha = hyperparams.Bounded[float]( - default=1, - lower=0, - upper=None, - description='Small positive values of alpha improve the conditioning of the problem and reduce the variance of the estimates. Alpha corresponds to ``(2*C)^-1`` in other linear models such as LogisticRegression or LinearSVC. If an array is passed, penalties are assumed to be specific to the targets. Hence they must correspond in number.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - kernel = hyperparams.Choice( - choices={ - 'linear': hyperparams.Hyperparams.define( - configuration=OrderedDict({}) - ), - 'poly': hyperparams.Hyperparams.define( - configuration=OrderedDict({ - 'degree': hyperparams.Bounded[float]( - default=3, - lower=0, - upper=None, - description='Degree of the polynomial kernel. Ignored by other kernels.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ), - 'gamma': hyperparams.Bounded[float]( - default=0, - lower=0, - upper=None, - description='Gamma parameter for the RBF, laplacian, polynomial, exponential chi2 and sigmoid kernels. Interpretation of the default value is left to the kernel; see the documentation for sklearn.metrics.pairwise. Ignored by other kernels.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ), - 'coef0': hyperparams.Bounded[float]( - default=1, - lower=0, - upper=None, - description='Zero coefficient for polynomial and sigmoid kernels. Ignored by other kernels classes', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - }) - ), - 'rbf': hyperparams.Hyperparams.define( - configuration=OrderedDict({ - 'gamma': hyperparams.Bounded[float]( - default=0, - lower=0, - upper=None, - description='Gamma parameter for the RBF, laplacian, polynomial, exponential chi2 and sigmoid kernels. Interpretation of the default value is left to the kernel; see the documentation for sklearn.metrics.pairwise. Ignored by other kernels.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - }) - ), - 'sigmoid': hyperparams.Hyperparams.define( - configuration=OrderedDict({ - 'gamma': hyperparams.Bounded[float]( - default=0, - lower=0, - upper=None, - description='Gamma parameter for the RBF, laplacian, polynomial, exponential chi2 and sigmoid kernels. Interpretation of the default value is left to the kernel; see the documentation for sklearn.metrics.pairwise. Ignored by other kernels.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ), - 'coef0': hyperparams.Bounded[float]( - default=1, - lower=0, - upper=None, - description='Zero coefficient for polynomial and sigmoid kernels. Ignored by other kernels classes', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - }) - ), - 'additive_chi2': hyperparams.Hyperparams.define( - configuration=OrderedDict({}) - ), - 'chi2': hyperparams.Hyperparams.define( - configuration=OrderedDict({ - 'gamma': hyperparams.Bounded[float]( - default=0, - lower=0, - upper=None, - description='Gamma parameter for the RBF, laplacian, polynomial, exponential chi2 and sigmoid kernels. Interpretation of the default value is left to the kernel; see the documentation for sklearn.metrics.pairwise. Ignored by other kernels.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - }) - ), - 'laplacian': hyperparams.Hyperparams.define( - configuration=OrderedDict({ - 'gamma': hyperparams.Bounded[float]( - default=0, - lower=0, - upper=None, - description='Gamma parameter for the RBF, laplacian, polynomial, exponential chi2 and sigmoid kernels. Interpretation of the default value is left to the kernel; see the documentation for sklearn.metrics.pairwise. Ignored by other kernels.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - }) - ), - 'cosine': hyperparams.Hyperparams.define( - configuration=OrderedDict({}) - ), - 'precomputed': hyperparams.Hyperparams.define( - configuration=OrderedDict({}) - ) - }, - default='linear', - description='Kernel mapping used internally. A callable should accept two arguments and the keyword arguments passed to this object as kernel_params, and should return a floating point number.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - - use_inputs_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to force primitive to use as training input. If any specified column cannot be parsed, it is skipped.", - ) - use_outputs_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to force primitive to use as training target. If any specified column cannot be parsed, it is skipped.", - ) - exclude_inputs_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to not use as training inputs. Applicable only if \"use_columns\" is not provided.", - ) - exclude_outputs_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to not use as training target. Applicable only if \"use_columns\" is not provided.", - ) - return_result = hyperparams.Enumeration( - values=['append', 'replace', 'new'], - default='new', - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Should parsed columns be appended, should they replace original columns, or should only parsed columns be returned? This hyperparam is ignored if use_semantic_types is set to false.", - ) - use_semantic_types = hyperparams.UniformBool( - default=False, - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Controls whether semantic_types metadata will be used for filtering columns in input dataframe. Setting this to false makes the code ignore return_result and will produce only the output dataframe" - ) - add_index_columns = hyperparams.UniformBool( - default=False, - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Also include primary index columns if input data has them. Applicable only if \"return_result\" is set to \"new\".", - ) - error_on_no_input = hyperparams.UniformBool( - default=True, - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Throw an exception if no input column is selected/provided. Defaults to true to behave like sklearn. To prevent pipelines from breaking set this to False.", - ) - - return_semantic_type = hyperparams.Enumeration[str]( - values=['https://metadata.datadrivendiscovery.org/types/Attribute', 'https://metadata.datadrivendiscovery.org/types/ConstructedAttribute', 'https://metadata.datadrivendiscovery.org/types/PredictedTarget'], - default='https://metadata.datadrivendiscovery.org/types/PredictedTarget', - description='Decides what semantic type to attach to generated output', - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'] - ) - -class SKKernelRidge(SupervisedLearnerPrimitiveBase[Inputs, Outputs, Params, Hyperparams]): - """ - Primitive wrapping for sklearn KernelRidge - `sklearn documentation `_ - - """ - - __author__ = "JPL MARVIN" - metadata = metadata_base.PrimitiveMetadata({ - "algorithm_types": [metadata_base.PrimitiveAlgorithmType.SUPPORT_VECTOR_MACHINE, ], - "name": "sklearn.kernel_ridge.KernelRidge", - "primitive_family": metadata_base.PrimitiveFamily.REGRESSION, - "python_path": "d3m.primitives.regression.kernel_ridge.SKlearn", - "source": {'name': 'JPL', 'contact': 'mailto:shah@jpl.nasa.gov', 'uris': ['https://gitlab.com/datadrivendiscovery/sklearn-wrap/issues', 'https://scikit-learn.org/stable/modules/generated/sklearn.kernel_ridge.KernelRidge.html']}, - "version": "2019.11.13", - "id": "0fca4b96-d46b-3598-a4a5-bfa428d039fc", - "hyperparams_to_tune": ['alpha', 'kernel'], - 'installation': [ - {'type': metadata_base.PrimitiveInstallationType.PIP, - 'package_uri': 'git+https://gitlab.com/datadrivendiscovery/sklearn-wrap.git@{git_commit}#egg=sklearn_wrap'.format( - git_commit=utils.current_git_commit(os.path.dirname(__file__)), - ), - }] - }) - - def __init__(self, *, - hyperparams: Hyperparams, - random_seed: int = 0, - docker_containers: Dict[str, DockerContainer] = None) -> None: - - super().__init__(hyperparams=hyperparams, random_seed=random_seed, docker_containers=docker_containers) - - # False - self._clf = KernelRidge( - alpha=self.hyperparams['alpha'], - kernel=self.hyperparams['kernel']['choice'], - degree=self.hyperparams['kernel'].get('degree', 3), - gamma=self.hyperparams['kernel'].get('gamma', 0), - coef0=self.hyperparams['kernel'].get('coef0', 1), - ) - - self._inputs = None - self._outputs = None - self._training_inputs = None - self._training_outputs = None - self._target_names = None - self._training_indices = None - self._target_column_indices = None - self._target_columns_metadata: List[OrderedDict] = None - self._input_column_names = None - self._fitted = False - self._new_training_data = False - - def set_training_data(self, *, inputs: Inputs, outputs: Outputs) -> None: - self._inputs = inputs - self._outputs = outputs - self._fitted = False - self._new_training_data = True - - def fit(self, *, timeout: float = None, iterations: int = None) -> CallResult[None]: - if self._inputs is None or self._outputs is None: - raise ValueError("Missing training data.") - - if not self._new_training_data: - return CallResult(None) - self._new_training_data = False - - self._training_inputs, self._training_indices = self._get_columns_to_fit(self._inputs, self.hyperparams) - self._training_outputs, self._target_names, self._target_column_indices = self._get_targets(self._outputs, self.hyperparams) - self._input_column_names = self._training_inputs.columns - - if len(self._training_indices) > 0 and len(self._target_column_indices) > 0: - self._target_columns_metadata = self._get_target_columns_metadata(self._training_outputs.metadata, self.hyperparams) - sk_training_output = self._training_outputs.values - - shape = sk_training_output.shape - if len(shape) == 2 and shape[1] == 1: - sk_training_output = numpy.ravel(sk_training_output) - - self._clf.fit(self._training_inputs, sk_training_output) - self._fitted = True - else: - if self.hyperparams['error_on_no_input']: - raise RuntimeError("No input columns were selected") - self.logger.warn("No input columns were selected") - - return CallResult(None) - - - - def produce(self, *, inputs: Inputs, timeout: float = None, iterations: int = None) -> CallResult[Outputs]: - sk_inputs, columns_to_use = self._get_columns_to_fit(inputs, self.hyperparams) - output = [] - if len(sk_inputs.columns): - try: - sk_output = self._clf.predict(sk_inputs) - except sklearn.exceptions.NotFittedError as error: - raise PrimitiveNotFittedError("Primitive not fitted.") from error - # For primitives that allow predicting without fitting like GaussianProcessRegressor - if not self._fitted: - raise PrimitiveNotFittedError("Primitive not fitted.") - if sparse.issparse(sk_output): - sk_output = sk_output.toarray() - output = self._wrap_predictions(inputs, sk_output) - output.columns = self._target_names - output = [output] - else: - if self.hyperparams['error_on_no_input']: - raise RuntimeError("No input columns were selected") - self.logger.warn("No input columns were selected") - outputs = base_utils.combine_columns(return_result=self.hyperparams['return_result'], - add_index_columns=self.hyperparams['add_index_columns'], - inputs=inputs, column_indices=self._target_column_indices, - columns_list=output) - - return CallResult(outputs) - - - def get_params(self) -> Params: - if not self._fitted: - return Params( - dual_coef_=None, - X_fit_=None, - input_column_names=self._input_column_names, - training_indices_=self._training_indices, - target_names_=self._target_names, - target_column_indices_=self._target_column_indices, - target_columns_metadata_=self._target_columns_metadata - ) - - return Params( - dual_coef_=getattr(self._clf, 'dual_coef_', None), - X_fit_=getattr(self._clf, 'X_fit_', None), - input_column_names=self._input_column_names, - training_indices_=self._training_indices, - target_names_=self._target_names, - target_column_indices_=self._target_column_indices, - target_columns_metadata_=self._target_columns_metadata - ) - - def set_params(self, *, params: Params) -> None: - self._clf.dual_coef_ = params['dual_coef_'] - self._clf.X_fit_ = params['X_fit_'] - self._input_column_names = params['input_column_names'] - self._training_indices = params['training_indices_'] - self._target_names = params['target_names_'] - self._target_column_indices = params['target_column_indices_'] - self._target_columns_metadata = params['target_columns_metadata_'] - - if params['dual_coef_'] is not None: - self._fitted = True - if params['X_fit_'] is not None: - self._fitted = True - - - - - - - - @classmethod - def _get_columns_to_fit(cls, inputs: Inputs, hyperparams: Hyperparams): - if not hyperparams['use_semantic_types']: - return inputs, list(range(len(inputs.columns))) - - inputs_metadata = inputs.metadata - - def can_produce_column(column_index: int) -> bool: - return cls._can_produce_column(inputs_metadata, column_index, hyperparams) - - columns_to_produce, columns_not_to_produce = base_utils.get_columns_to_use(inputs_metadata, - use_columns=hyperparams['use_inputs_columns'], - exclude_columns=hyperparams['exclude_inputs_columns'], - can_use_column=can_produce_column) - return inputs.iloc[:, columns_to_produce], columns_to_produce - # return columns_to_produce - - @classmethod - def _can_produce_column(cls, inputs_metadata: metadata_base.DataMetadata, column_index: int, hyperparams: Hyperparams) -> bool: - column_metadata = inputs_metadata.query((metadata_base.ALL_ELEMENTS, column_index)) - - accepted_structural_types = (int, float, numpy.integer, numpy.float64) - accepted_semantic_types = set() - accepted_semantic_types.add("https://metadata.datadrivendiscovery.org/types/Attribute") - if not issubclass(column_metadata['structural_type'], accepted_structural_types): - return False - - semantic_types = set(column_metadata.get('semantic_types', [])) - - if len(semantic_types) == 0: - cls.logger.warning("No semantic types found in column metadata") - return False - # Making sure all accepted_semantic_types are available in semantic_types - if len(accepted_semantic_types - semantic_types) == 0: - return True - - return False - - @classmethod - def _get_targets(cls, data: d3m_dataframe, hyperparams: Hyperparams): - if not hyperparams['use_semantic_types']: - return data, list(data.columns), list(range(len(data.columns))) - - metadata = data.metadata - - def can_produce_column(column_index: int) -> bool: - accepted_semantic_types = set() - accepted_semantic_types.add("https://metadata.datadrivendiscovery.org/types/TrueTarget") - column_metadata = metadata.query((metadata_base.ALL_ELEMENTS, column_index)) - semantic_types = set(column_metadata.get('semantic_types', [])) - if len(semantic_types) == 0: - cls.logger.warning("No semantic types found in column metadata") - return False - # Making sure all accepted_semantic_types are available in semantic_types - if len(accepted_semantic_types - semantic_types) == 0: - return True - return False - - target_column_indices, target_columns_not_to_produce = base_utils.get_columns_to_use(metadata, - use_columns=hyperparams[ - 'use_outputs_columns'], - exclude_columns= - hyperparams[ - 'exclude_outputs_columns'], - can_use_column=can_produce_column) - targets = [] - if target_column_indices: - targets = data.select_columns(target_column_indices) - target_column_names = [] - for idx in target_column_indices: - target_column_names.append(data.columns[idx]) - return targets, target_column_names, target_column_indices - - @classmethod - def _get_target_columns_metadata(cls, outputs_metadata: metadata_base.DataMetadata, hyperparams) -> List[OrderedDict]: - outputs_length = outputs_metadata.query((metadata_base.ALL_ELEMENTS,))['dimension']['length'] - - target_columns_metadata: List[OrderedDict] = [] - for column_index in range(outputs_length): - column_metadata = OrderedDict(outputs_metadata.query_column(column_index)) - - # Update semantic types and prepare it for predicted targets. - semantic_types = set(column_metadata.get('semantic_types', [])) - semantic_types_to_remove = set(["https://metadata.datadrivendiscovery.org/types/TrueTarget","https://metadata.datadrivendiscovery.org/types/SuggestedTarget",]) - add_semantic_types = set(["https://metadata.datadrivendiscovery.org/types/PredictedTarget",]) - add_semantic_types.add(hyperparams["return_semantic_type"]) - semantic_types = semantic_types - semantic_types_to_remove - semantic_types = semantic_types.union(add_semantic_types) - column_metadata['semantic_types'] = list(semantic_types) - - target_columns_metadata.append(column_metadata) - - return target_columns_metadata - - @classmethod - def _update_predictions_metadata(cls, inputs_metadata: metadata_base.DataMetadata, outputs: Optional[Outputs], - target_columns_metadata: List[OrderedDict]) -> metadata_base.DataMetadata: - outputs_metadata = metadata_base.DataMetadata().generate(value=outputs) - - for column_index, column_metadata in enumerate(target_columns_metadata): - column_metadata.pop("structural_type", None) - outputs_metadata = outputs_metadata.update_column(column_index, column_metadata) - - return outputs_metadata - - def _wrap_predictions(self, inputs: Inputs, predictions: ndarray) -> Outputs: - outputs = d3m_dataframe(predictions, generate_metadata=False) - outputs.metadata = self._update_predictions_metadata(inputs.metadata, outputs, self._target_columns_metadata) - return outputs - - - @classmethod - def _add_target_columns_metadata(cls, outputs_metadata: metadata_base.DataMetadata): - outputs_length = outputs_metadata.query((metadata_base.ALL_ELEMENTS,))['dimension']['length'] - - target_columns_metadata: List[OrderedDict] = [] - for column_index in range(outputs_length): - column_metadata = OrderedDict() - semantic_types = [] - semantic_types.append('https://metadata.datadrivendiscovery.org/types/PredictedTarget') - column_name = outputs_metadata.query((metadata_base.ALL_ELEMENTS, column_index)).get("name") - if column_name is None: - column_name = "output_{}".format(column_index) - column_metadata["semantic_types"] = semantic_types - column_metadata["name"] = str(column_name) - target_columns_metadata.append(column_metadata) - - return target_columns_metadata - - -SKKernelRidge.__doc__ = KernelRidge.__doc__ \ No newline at end of file diff --git a/common-primitives/sklearn-wrap/sklearn_wrap/SKLars.py b/common-primitives/sklearn-wrap/sklearn_wrap/SKLars.py deleted file mode 100644 index 1136d16..0000000 --- a/common-primitives/sklearn-wrap/sklearn_wrap/SKLars.py +++ /dev/null @@ -1,460 +0,0 @@ -from typing import Any, Callable, List, Dict, Union, Optional, Sequence, Tuple -from numpy import ndarray -from collections import OrderedDict -from scipy import sparse -import os -import sklearn -import numpy -import typing - -# Custom import commands if any -from sklearn.linear_model.least_angle import Lars - - -from d3m.container.numpy import ndarray as d3m_ndarray -from d3m.container import DataFrame as d3m_dataframe -from d3m.metadata import hyperparams, params, base as metadata_base -from d3m import utils -from d3m.base import utils as base_utils -from d3m.exceptions import PrimitiveNotFittedError -from d3m.primitive_interfaces.base import CallResult, DockerContainer - -from d3m.primitive_interfaces.supervised_learning import SupervisedLearnerPrimitiveBase -from d3m.primitive_interfaces.base import ProbabilisticCompositionalityMixin, ContinueFitMixin -from d3m import exceptions -import pandas - - - -Inputs = d3m_dataframe -Outputs = d3m_dataframe - - -class Params(params.Params): - alphas_: Optional[ndarray] - active_: Optional[Sequence[Any]] - coef_path_: Optional[ndarray] - coef_: Optional[ndarray] - intercept_: Optional[Union[float, ndarray]] - n_iter_: Optional[Union[int, ndarray, None]] - input_column_names: Optional[Any] - target_names_: Optional[Sequence[Any]] - training_indices_: Optional[Sequence[int]] - target_column_indices_: Optional[Sequence[int]] - target_columns_metadata_: Optional[List[OrderedDict]] - - - -class Hyperparams(hyperparams.Hyperparams): - fit_intercept = hyperparams.UniformBool( - default=True, - description='Whether to calculate the intercept for this model. If set to false, no intercept will be used in calculations (e.g. data is expected to be already centered).', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - normalize = hyperparams.UniformBool( - default=True, - description='This parameter is ignored when ``fit_intercept`` is set to False. If True, the regressors X will be normalized before regression by subtracting the mean and dividing by the l2-norm. If you wish to standardize, please use :class:`sklearn.preprocessing.StandardScaler` before calling ``fit`` on an estimator with ``normalize=False``.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - precompute = hyperparams.Union( - configuration=OrderedDict({ - 'bool': hyperparams.UniformBool( - default=False, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ), - 'auto': hyperparams.Constant( - default='auto', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ) - }), - default='auto', - description='Whether to use a precomputed Gram matrix to speed up calculations. If set to ``\'auto\'`` let us decide. The Gram matrix can also be passed as argument.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/ResourcesUseParameter'] - ) - n_nonzero_coefs = hyperparams.Bounded[int]( - default=500, - lower=0, - upper=None, - description='Target number of non-zero coefficients. Use ``np.inf`` for no limit.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - eps = hyperparams.Bounded[float]( - default=numpy.finfo(numpy.float).eps, - lower=0, - upper=None, - description='The machine-precision regularization in the computation of the Cholesky diagonal factors. Increase this for very ill-conditioned systems. Unlike the ``tol`` parameter in some iterative optimization-based algorithms, this parameter does not control the tolerance of the optimization. copy_X : boolean, optional, default True If ``True``, X will be copied; else, it may be overwritten.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - fit_path = hyperparams.UniformBool( - default=True, - description='If True the full path is stored in the ``coef_path_`` attribute. If you compute the solution for a large problem or many targets, setting ``fit_path`` to ``False`` will lead to a speedup, especially with a small alpha.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - - use_inputs_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to force primitive to use as training input. If any specified column cannot be parsed, it is skipped.", - ) - use_outputs_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to force primitive to use as training target. If any specified column cannot be parsed, it is skipped.", - ) - exclude_inputs_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to not use as training inputs. Applicable only if \"use_columns\" is not provided.", - ) - exclude_outputs_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to not use as training target. Applicable only if \"use_columns\" is not provided.", - ) - return_result = hyperparams.Enumeration( - values=['append', 'replace', 'new'], - default='new', - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Should parsed columns be appended, should they replace original columns, or should only parsed columns be returned? This hyperparam is ignored if use_semantic_types is set to false.", - ) - use_semantic_types = hyperparams.UniformBool( - default=False, - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Controls whether semantic_types metadata will be used for filtering columns in input dataframe. Setting this to false makes the code ignore return_result and will produce only the output dataframe" - ) - add_index_columns = hyperparams.UniformBool( - default=False, - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Also include primary index columns if input data has them. Applicable only if \"return_result\" is set to \"new\".", - ) - error_on_no_input = hyperparams.UniformBool( - default=True, - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Throw an exception if no input column is selected/provided. Defaults to true to behave like sklearn. To prevent pipelines from breaking set this to False.", - ) - - return_semantic_type = hyperparams.Enumeration[str]( - values=['https://metadata.datadrivendiscovery.org/types/Attribute', 'https://metadata.datadrivendiscovery.org/types/ConstructedAttribute', 'https://metadata.datadrivendiscovery.org/types/PredictedTarget'], - default='https://metadata.datadrivendiscovery.org/types/PredictedTarget', - description='Decides what semantic type to attach to generated output', - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'] - ) - -class SKLars(SupervisedLearnerPrimitiveBase[Inputs, Outputs, Params, Hyperparams]): - """ - Primitive wrapping for sklearn Lars - `sklearn documentation `_ - - """ - - __author__ = "JPL MARVIN" - metadata = metadata_base.PrimitiveMetadata({ - "algorithm_types": [metadata_base.PrimitiveAlgorithmType.LINEAR_REGRESSION, ], - "name": "sklearn.linear_model.least_angle.Lars", - "primitive_family": metadata_base.PrimitiveFamily.REGRESSION, - "python_path": "d3m.primitives.regression.lars.SKlearn", - "source": {'name': 'JPL', 'contact': 'mailto:shah@jpl.nasa.gov', 'uris': ['https://gitlab.com/datadrivendiscovery/sklearn-wrap/issues', 'https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lars.html']}, - "version": "2019.11.13", - "id": "989a40cd-114c-309d-9a94-59d2669d6c94", - "hyperparams_to_tune": ['eps'], - 'installation': [ - {'type': metadata_base.PrimitiveInstallationType.PIP, - 'package_uri': 'git+https://gitlab.com/datadrivendiscovery/sklearn-wrap.git@{git_commit}#egg=sklearn_wrap'.format( - git_commit=utils.current_git_commit(os.path.dirname(__file__)), - ), - }] - }) - - def __init__(self, *, - hyperparams: Hyperparams, - random_seed: int = 0, - docker_containers: Dict[str, DockerContainer] = None, - _verbose: int = 0) -> None: - - super().__init__(hyperparams=hyperparams, random_seed=random_seed, docker_containers=docker_containers) - - # False - self._clf = Lars( - fit_intercept=self.hyperparams['fit_intercept'], - normalize=self.hyperparams['normalize'], - precompute=self.hyperparams['precompute'], - n_nonzero_coefs=self.hyperparams['n_nonzero_coefs'], - eps=self.hyperparams['eps'], - fit_path=self.hyperparams['fit_path'], - verbose=_verbose - ) - - self._inputs = None - self._outputs = None - self._training_inputs = None - self._training_outputs = None - self._target_names = None - self._training_indices = None - self._target_column_indices = None - self._target_columns_metadata: List[OrderedDict] = None - self._input_column_names = None - self._fitted = False - self._new_training_data = False - - def set_training_data(self, *, inputs: Inputs, outputs: Outputs) -> None: - self._inputs = inputs - self._outputs = outputs - self._fitted = False - self._new_training_data = True - - def fit(self, *, timeout: float = None, iterations: int = None) -> CallResult[None]: - if self._inputs is None or self._outputs is None: - raise ValueError("Missing training data.") - - if not self._new_training_data: - return CallResult(None) - self._new_training_data = False - - self._training_inputs, self._training_indices = self._get_columns_to_fit(self._inputs, self.hyperparams) - self._training_outputs, self._target_names, self._target_column_indices = self._get_targets(self._outputs, self.hyperparams) - self._input_column_names = self._training_inputs.columns - - if len(self._training_indices) > 0 and len(self._target_column_indices) > 0: - self._target_columns_metadata = self._get_target_columns_metadata(self._training_outputs.metadata, self.hyperparams) - sk_training_output = self._training_outputs.values - - shape = sk_training_output.shape - if len(shape) == 2 and shape[1] == 1: - sk_training_output = numpy.ravel(sk_training_output) - - self._clf.fit(self._training_inputs, sk_training_output) - self._fitted = True - else: - if self.hyperparams['error_on_no_input']: - raise RuntimeError("No input columns were selected") - self.logger.warn("No input columns were selected") - - return CallResult(None) - - - - def produce(self, *, inputs: Inputs, timeout: float = None, iterations: int = None) -> CallResult[Outputs]: - sk_inputs, columns_to_use = self._get_columns_to_fit(inputs, self.hyperparams) - output = [] - if len(sk_inputs.columns): - try: - sk_output = self._clf.predict(sk_inputs) - except sklearn.exceptions.NotFittedError as error: - raise PrimitiveNotFittedError("Primitive not fitted.") from error - # For primitives that allow predicting without fitting like GaussianProcessRegressor - if not self._fitted: - raise PrimitiveNotFittedError("Primitive not fitted.") - if sparse.issparse(sk_output): - sk_output = sk_output.toarray() - output = self._wrap_predictions(inputs, sk_output) - output.columns = self._target_names - output = [output] - else: - if self.hyperparams['error_on_no_input']: - raise RuntimeError("No input columns were selected") - self.logger.warn("No input columns were selected") - outputs = base_utils.combine_columns(return_result=self.hyperparams['return_result'], - add_index_columns=self.hyperparams['add_index_columns'], - inputs=inputs, column_indices=self._target_column_indices, - columns_list=output) - - return CallResult(outputs) - - - def get_params(self) -> Params: - if not self._fitted: - return Params( - alphas_=None, - active_=None, - coef_path_=None, - coef_=None, - intercept_=None, - n_iter_=None, - input_column_names=self._input_column_names, - training_indices_=self._training_indices, - target_names_=self._target_names, - target_column_indices_=self._target_column_indices, - target_columns_metadata_=self._target_columns_metadata - ) - - return Params( - alphas_=getattr(self._clf, 'alphas_', None), - active_=getattr(self._clf, 'active_', None), - coef_path_=getattr(self._clf, 'coef_path_', None), - coef_=getattr(self._clf, 'coef_', None), - intercept_=getattr(self._clf, 'intercept_', None), - n_iter_=getattr(self._clf, 'n_iter_', None), - input_column_names=self._input_column_names, - training_indices_=self._training_indices, - target_names_=self._target_names, - target_column_indices_=self._target_column_indices, - target_columns_metadata_=self._target_columns_metadata - ) - - def set_params(self, *, params: Params) -> None: - self._clf.alphas_ = params['alphas_'] - self._clf.active_ = params['active_'] - self._clf.coef_path_ = params['coef_path_'] - self._clf.coef_ = params['coef_'] - self._clf.intercept_ = params['intercept_'] - self._clf.n_iter_ = params['n_iter_'] - self._input_column_names = params['input_column_names'] - self._training_indices = params['training_indices_'] - self._target_names = params['target_names_'] - self._target_column_indices = params['target_column_indices_'] - self._target_columns_metadata = params['target_columns_metadata_'] - - if params['alphas_'] is not None: - self._fitted = True - if params['active_'] is not None: - self._fitted = True - if params['coef_path_'] is not None: - self._fitted = True - if params['coef_'] is not None: - self._fitted = True - if params['intercept_'] is not None: - self._fitted = True - if params['n_iter_'] is not None: - self._fitted = True - - - - - - - - @classmethod - def _get_columns_to_fit(cls, inputs: Inputs, hyperparams: Hyperparams): - if not hyperparams['use_semantic_types']: - return inputs, list(range(len(inputs.columns))) - - inputs_metadata = inputs.metadata - - def can_produce_column(column_index: int) -> bool: - return cls._can_produce_column(inputs_metadata, column_index, hyperparams) - - columns_to_produce, columns_not_to_produce = base_utils.get_columns_to_use(inputs_metadata, - use_columns=hyperparams['use_inputs_columns'], - exclude_columns=hyperparams['exclude_inputs_columns'], - can_use_column=can_produce_column) - return inputs.iloc[:, columns_to_produce], columns_to_produce - # return columns_to_produce - - @classmethod - def _can_produce_column(cls, inputs_metadata: metadata_base.DataMetadata, column_index: int, hyperparams: Hyperparams) -> bool: - column_metadata = inputs_metadata.query((metadata_base.ALL_ELEMENTS, column_index)) - - accepted_structural_types = (int, float, numpy.integer, numpy.float64) - accepted_semantic_types = set() - accepted_semantic_types.add("https://metadata.datadrivendiscovery.org/types/Attribute") - if not issubclass(column_metadata['structural_type'], accepted_structural_types): - return False - - semantic_types = set(column_metadata.get('semantic_types', [])) - - if len(semantic_types) == 0: - cls.logger.warning("No semantic types found in column metadata") - return False - # Making sure all accepted_semantic_types are available in semantic_types - if len(accepted_semantic_types - semantic_types) == 0: - return True - - return False - - @classmethod - def _get_targets(cls, data: d3m_dataframe, hyperparams: Hyperparams): - if not hyperparams['use_semantic_types']: - return data, list(data.columns), list(range(len(data.columns))) - - metadata = data.metadata - - def can_produce_column(column_index: int) -> bool: - accepted_semantic_types = set() - accepted_semantic_types.add("https://metadata.datadrivendiscovery.org/types/TrueTarget") - column_metadata = metadata.query((metadata_base.ALL_ELEMENTS, column_index)) - semantic_types = set(column_metadata.get('semantic_types', [])) - if len(semantic_types) == 0: - cls.logger.warning("No semantic types found in column metadata") - return False - # Making sure all accepted_semantic_types are available in semantic_types - if len(accepted_semantic_types - semantic_types) == 0: - return True - return False - - target_column_indices, target_columns_not_to_produce = base_utils.get_columns_to_use(metadata, - use_columns=hyperparams[ - 'use_outputs_columns'], - exclude_columns= - hyperparams[ - 'exclude_outputs_columns'], - can_use_column=can_produce_column) - targets = [] - if target_column_indices: - targets = data.select_columns(target_column_indices) - target_column_names = [] - for idx in target_column_indices: - target_column_names.append(data.columns[idx]) - return targets, target_column_names, target_column_indices - - @classmethod - def _get_target_columns_metadata(cls, outputs_metadata: metadata_base.DataMetadata, hyperparams) -> List[OrderedDict]: - outputs_length = outputs_metadata.query((metadata_base.ALL_ELEMENTS,))['dimension']['length'] - - target_columns_metadata: List[OrderedDict] = [] - for column_index in range(outputs_length): - column_metadata = OrderedDict(outputs_metadata.query_column(column_index)) - - # Update semantic types and prepare it for predicted targets. - semantic_types = set(column_metadata.get('semantic_types', [])) - semantic_types_to_remove = set(["https://metadata.datadrivendiscovery.org/types/TrueTarget","https://metadata.datadrivendiscovery.org/types/SuggestedTarget",]) - add_semantic_types = set(["https://metadata.datadrivendiscovery.org/types/PredictedTarget",]) - add_semantic_types.add(hyperparams["return_semantic_type"]) - semantic_types = semantic_types - semantic_types_to_remove - semantic_types = semantic_types.union(add_semantic_types) - column_metadata['semantic_types'] = list(semantic_types) - - target_columns_metadata.append(column_metadata) - - return target_columns_metadata - - @classmethod - def _update_predictions_metadata(cls, inputs_metadata: metadata_base.DataMetadata, outputs: Optional[Outputs], - target_columns_metadata: List[OrderedDict]) -> metadata_base.DataMetadata: - outputs_metadata = metadata_base.DataMetadata().generate(value=outputs) - - for column_index, column_metadata in enumerate(target_columns_metadata): - column_metadata.pop("structural_type", None) - outputs_metadata = outputs_metadata.update_column(column_index, column_metadata) - - return outputs_metadata - - def _wrap_predictions(self, inputs: Inputs, predictions: ndarray) -> Outputs: - outputs = d3m_dataframe(predictions, generate_metadata=False) - outputs.metadata = self._update_predictions_metadata(inputs.metadata, outputs, self._target_columns_metadata) - return outputs - - - @classmethod - def _add_target_columns_metadata(cls, outputs_metadata: metadata_base.DataMetadata): - outputs_length = outputs_metadata.query((metadata_base.ALL_ELEMENTS,))['dimension']['length'] - - target_columns_metadata: List[OrderedDict] = [] - for column_index in range(outputs_length): - column_metadata = OrderedDict() - semantic_types = [] - semantic_types.append('https://metadata.datadrivendiscovery.org/types/PredictedTarget') - column_name = outputs_metadata.query((metadata_base.ALL_ELEMENTS, column_index)).get("name") - if column_name is None: - column_name = "output_{}".format(column_index) - column_metadata["semantic_types"] = semantic_types - column_metadata["name"] = str(column_name) - target_columns_metadata.append(column_metadata) - - return target_columns_metadata - - -SKLars.__doc__ = Lars.__doc__ \ No newline at end of file diff --git a/common-primitives/sklearn-wrap/sklearn_wrap/SKLasso.py b/common-primitives/sklearn-wrap/sklearn_wrap/SKLasso.py deleted file mode 100644 index 028f7f7..0000000 --- a/common-primitives/sklearn-wrap/sklearn_wrap/SKLasso.py +++ /dev/null @@ -1,474 +0,0 @@ -from typing import Any, Callable, List, Dict, Union, Optional, Sequence, Tuple -from numpy import ndarray -from collections import OrderedDict -from scipy import sparse -import os -import sklearn -import numpy -import typing - -# Custom import commands if any -from sklearn.linear_model.coordinate_descent import Lasso - - -from d3m.container.numpy import ndarray as d3m_ndarray -from d3m.container import DataFrame as d3m_dataframe -from d3m.metadata import hyperparams, params, base as metadata_base -from d3m import utils -from d3m.base import utils as base_utils -from d3m.exceptions import PrimitiveNotFittedError -from d3m.primitive_interfaces.base import CallResult, DockerContainer - -from d3m.primitive_interfaces.supervised_learning import SupervisedLearnerPrimitiveBase -from d3m.primitive_interfaces.base import ProbabilisticCompositionalityMixin, ContinueFitMixin -from d3m import exceptions -import pandas - - - -Inputs = d3m_dataframe -Outputs = d3m_dataframe - - -class Params(params.Params): - coef_: Optional[ndarray] - intercept_: Optional[Union[float, ndarray]] - n_iter_: Optional[int] - dual_gap_: Optional[float] - l1_ratio: Optional[float] - input_column_names: Optional[Any] - target_names_: Optional[Sequence[Any]] - training_indices_: Optional[Sequence[int]] - target_column_indices_: Optional[Sequence[int]] - target_columns_metadata_: Optional[List[OrderedDict]] - - - -class Hyperparams(hyperparams.Hyperparams): - alpha = hyperparams.Bounded[float]( - default=1, - lower=0, - upper=None, - description='Constant that multiplies the L1 term. Defaults to 1.0. ``alpha = 0`` is equivalent to an ordinary least square, solved by the :class:`LinearRegression` object. For numerical reasons, using ``alpha = 0`` with the ``Lasso`` object is not advised. Given this, you should use the :class:`LinearRegression` object.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - fit_intercept = hyperparams.UniformBool( - default=True, - description='whether to calculate the intercept for this model. If set to false, no intercept will be used in calculations (e.g. data is expected to be already centered).', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - normalize = hyperparams.UniformBool( - default=False, - description='This parameter is ignored when ``fit_intercept`` is set to False. If True, the regressors X will be normalized before regression by subtracting the mean and dividing by the l2-norm. If you wish to standardize, please use :class:`sklearn.preprocessing.StandardScaler` before calling ``fit`` on an estimator with ``normalize=False``.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - precompute = hyperparams.Union( - configuration=OrderedDict({ - 'bool': hyperparams.UniformBool( - default=False, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ), - 'auto': hyperparams.Constant( - default='auto', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ) - }), - default='bool', - description='Whether to use a precomputed Gram matrix to speed up calculations. If set to ``\'auto\'`` let us decide. The Gram matrix can also be passed as argument. For sparse input this option is always ``True`` to preserve sparsity. copy_X : boolean, optional, default True If ``True``, X will be copied; else, it may be overwritten.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/ResourcesUseParameter'] - ) - max_iter = hyperparams.Bounded[int]( - default=1000, - lower=0, - upper=None, - description='The maximum number of iterations', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - tol = hyperparams.Bounded[float]( - default=0.0001, - lower=0, - upper=None, - description='The tolerance for the optimization: if the updates are smaller than ``tol``, the optimization code checks the dual gap for optimality and continues until it is smaller than ``tol``.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - warm_start = hyperparams.UniformBool( - default=False, - description='When set to True, reuse the solution of the previous call to fit as initialization, otherwise, just erase the previous solution.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - positive = hyperparams.UniformBool( - default=False, - description='When set to ``True``, forces the coefficients to be positive.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - selection = hyperparams.Enumeration[str]( - default='cyclic', - values=['cyclic', 'random'], - description='If set to \'random\', a random coefficient is updated every iteration rather than looping over features sequentially by default. This (setting to \'random\') often leads to significantly faster convergence especially when tol is higher than 1e-4.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - - use_inputs_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to force primitive to use as training input. If any specified column cannot be parsed, it is skipped.", - ) - use_outputs_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to force primitive to use as training target. If any specified column cannot be parsed, it is skipped.", - ) - exclude_inputs_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to not use as training inputs. Applicable only if \"use_columns\" is not provided.", - ) - exclude_outputs_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to not use as training target. Applicable only if \"use_columns\" is not provided.", - ) - return_result = hyperparams.Enumeration( - values=['append', 'replace', 'new'], - default='new', - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Should parsed columns be appended, should they replace original columns, or should only parsed columns be returned? This hyperparam is ignored if use_semantic_types is set to false.", - ) - use_semantic_types = hyperparams.UniformBool( - default=False, - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Controls whether semantic_types metadata will be used for filtering columns in input dataframe. Setting this to false makes the code ignore return_result and will produce only the output dataframe" - ) - add_index_columns = hyperparams.UniformBool( - default=False, - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Also include primary index columns if input data has them. Applicable only if \"return_result\" is set to \"new\".", - ) - error_on_no_input = hyperparams.UniformBool( - default=True, - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Throw an exception if no input column is selected/provided. Defaults to true to behave like sklearn. To prevent pipelines from breaking set this to False.", - ) - - return_semantic_type = hyperparams.Enumeration[str]( - values=['https://metadata.datadrivendiscovery.org/types/Attribute', 'https://metadata.datadrivendiscovery.org/types/ConstructedAttribute', 'https://metadata.datadrivendiscovery.org/types/PredictedTarget'], - default='https://metadata.datadrivendiscovery.org/types/PredictedTarget', - description='Decides what semantic type to attach to generated output', - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'] - ) - -class SKLasso(SupervisedLearnerPrimitiveBase[Inputs, Outputs, Params, Hyperparams]): - """ - Primitive wrapping for sklearn Lasso - `sklearn documentation `_ - - """ - - __author__ = "JPL MARVIN" - metadata = metadata_base.PrimitiveMetadata({ - "algorithm_types": [metadata_base.PrimitiveAlgorithmType.LASSO, ], - "name": "sklearn.linear_model.coordinate_descent.Lasso", - "primitive_family": metadata_base.PrimitiveFamily.REGRESSION, - "python_path": "d3m.primitives.regression.lasso.SKlearn", - "source": {'name': 'JPL', 'contact': 'mailto:shah@jpl.nasa.gov', 'uris': ['https://gitlab.com/datadrivendiscovery/sklearn-wrap/issues', 'https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html']}, - "version": "2019.11.13", - "id": "a7100c7d-8d8e-3f2a-a0ee-b4380383ed6c", - "hyperparams_to_tune": ['alpha', 'max_iter'], - 'installation': [ - {'type': metadata_base.PrimitiveInstallationType.PIP, - 'package_uri': 'git+https://gitlab.com/datadrivendiscovery/sklearn-wrap.git@{git_commit}#egg=sklearn_wrap'.format( - git_commit=utils.current_git_commit(os.path.dirname(__file__)), - ), - }] - }) - - def __init__(self, *, - hyperparams: Hyperparams, - random_seed: int = 0, - docker_containers: Dict[str, DockerContainer] = None) -> None: - - super().__init__(hyperparams=hyperparams, random_seed=random_seed, docker_containers=docker_containers) - - # False - self._clf = Lasso( - alpha=self.hyperparams['alpha'], - fit_intercept=self.hyperparams['fit_intercept'], - normalize=self.hyperparams['normalize'], - precompute=self.hyperparams['precompute'], - max_iter=self.hyperparams['max_iter'], - tol=self.hyperparams['tol'], - warm_start=self.hyperparams['warm_start'], - positive=self.hyperparams['positive'], - selection=self.hyperparams['selection'], - random_state=self.random_seed, - ) - - self._inputs = None - self._outputs = None - self._training_inputs = None - self._training_outputs = None - self._target_names = None - self._training_indices = None - self._target_column_indices = None - self._target_columns_metadata: List[OrderedDict] = None - self._input_column_names = None - self._fitted = False - self._new_training_data = False - - def set_training_data(self, *, inputs: Inputs, outputs: Outputs) -> None: - self._inputs = inputs - self._outputs = outputs - self._fitted = False - self._new_training_data = True - - def fit(self, *, timeout: float = None, iterations: int = None) -> CallResult[None]: - if self._inputs is None or self._outputs is None: - raise ValueError("Missing training data.") - - if not self._new_training_data: - return CallResult(None) - self._new_training_data = False - - self._training_inputs, self._training_indices = self._get_columns_to_fit(self._inputs, self.hyperparams) - self._training_outputs, self._target_names, self._target_column_indices = self._get_targets(self._outputs, self.hyperparams) - self._input_column_names = self._training_inputs.columns - - if len(self._training_indices) > 0 and len(self._target_column_indices) > 0: - self._target_columns_metadata = self._get_target_columns_metadata(self._training_outputs.metadata, self.hyperparams) - sk_training_output = self._training_outputs.values - - shape = sk_training_output.shape - if len(shape) == 2 and shape[1] == 1: - sk_training_output = numpy.ravel(sk_training_output) - - self._clf.fit(self._training_inputs, sk_training_output) - self._fitted = True - else: - if self.hyperparams['error_on_no_input']: - raise RuntimeError("No input columns were selected") - self.logger.warn("No input columns were selected") - - return CallResult(None) - - - - def produce(self, *, inputs: Inputs, timeout: float = None, iterations: int = None) -> CallResult[Outputs]: - sk_inputs, columns_to_use = self._get_columns_to_fit(inputs, self.hyperparams) - output = [] - if len(sk_inputs.columns): - try: - sk_output = self._clf.predict(sk_inputs) - except sklearn.exceptions.NotFittedError as error: - raise PrimitiveNotFittedError("Primitive not fitted.") from error - # For primitives that allow predicting without fitting like GaussianProcessRegressor - if not self._fitted: - raise PrimitiveNotFittedError("Primitive not fitted.") - if sparse.issparse(sk_output): - sk_output = sk_output.toarray() - output = self._wrap_predictions(inputs, sk_output) - output.columns = self._target_names - output = [output] - else: - if self.hyperparams['error_on_no_input']: - raise RuntimeError("No input columns were selected") - self.logger.warn("No input columns were selected") - outputs = base_utils.combine_columns(return_result=self.hyperparams['return_result'], - add_index_columns=self.hyperparams['add_index_columns'], - inputs=inputs, column_indices=self._target_column_indices, - columns_list=output) - - return CallResult(outputs) - - - def get_params(self) -> Params: - if not self._fitted: - return Params( - coef_=None, - intercept_=None, - n_iter_=None, - dual_gap_=None, - l1_ratio=None, - input_column_names=self._input_column_names, - training_indices_=self._training_indices, - target_names_=self._target_names, - target_column_indices_=self._target_column_indices, - target_columns_metadata_=self._target_columns_metadata - ) - - return Params( - coef_=getattr(self._clf, 'coef_', None), - intercept_=getattr(self._clf, 'intercept_', None), - n_iter_=getattr(self._clf, 'n_iter_', None), - dual_gap_=getattr(self._clf, 'dual_gap_', None), - l1_ratio=getattr(self._clf, 'l1_ratio', None), - input_column_names=self._input_column_names, - training_indices_=self._training_indices, - target_names_=self._target_names, - target_column_indices_=self._target_column_indices, - target_columns_metadata_=self._target_columns_metadata - ) - - def set_params(self, *, params: Params) -> None: - self._clf.coef_ = params['coef_'] - self._clf.intercept_ = params['intercept_'] - self._clf.n_iter_ = params['n_iter_'] - self._clf.dual_gap_ = params['dual_gap_'] - self._clf.l1_ratio = params['l1_ratio'] - self._input_column_names = params['input_column_names'] - self._training_indices = params['training_indices_'] - self._target_names = params['target_names_'] - self._target_column_indices = params['target_column_indices_'] - self._target_columns_metadata = params['target_columns_metadata_'] - - if params['coef_'] is not None: - self._fitted = True - if params['intercept_'] is not None: - self._fitted = True - if params['n_iter_'] is not None: - self._fitted = True - if params['dual_gap_'] is not None: - self._fitted = True - if params['l1_ratio'] is not None: - self._fitted = True - - - - - - - - @classmethod - def _get_columns_to_fit(cls, inputs: Inputs, hyperparams: Hyperparams): - if not hyperparams['use_semantic_types']: - return inputs, list(range(len(inputs.columns))) - - inputs_metadata = inputs.metadata - - def can_produce_column(column_index: int) -> bool: - return cls._can_produce_column(inputs_metadata, column_index, hyperparams) - - columns_to_produce, columns_not_to_produce = base_utils.get_columns_to_use(inputs_metadata, - use_columns=hyperparams['use_inputs_columns'], - exclude_columns=hyperparams['exclude_inputs_columns'], - can_use_column=can_produce_column) - return inputs.iloc[:, columns_to_produce], columns_to_produce - # return columns_to_produce - - @classmethod - def _can_produce_column(cls, inputs_metadata: metadata_base.DataMetadata, column_index: int, hyperparams: Hyperparams) -> bool: - column_metadata = inputs_metadata.query((metadata_base.ALL_ELEMENTS, column_index)) - - accepted_structural_types = (int, float, numpy.integer, numpy.float64) - accepted_semantic_types = set() - accepted_semantic_types.add("https://metadata.datadrivendiscovery.org/types/Attribute") - if not issubclass(column_metadata['structural_type'], accepted_structural_types): - return False - - semantic_types = set(column_metadata.get('semantic_types', [])) - - if len(semantic_types) == 0: - cls.logger.warning("No semantic types found in column metadata") - return False - # Making sure all accepted_semantic_types are available in semantic_types - if len(accepted_semantic_types - semantic_types) == 0: - return True - - return False - - @classmethod - def _get_targets(cls, data: d3m_dataframe, hyperparams: Hyperparams): - if not hyperparams['use_semantic_types']: - return data, list(data.columns), list(range(len(data.columns))) - - metadata = data.metadata - - def can_produce_column(column_index: int) -> bool: - accepted_semantic_types = set() - accepted_semantic_types.add("https://metadata.datadrivendiscovery.org/types/TrueTarget") - column_metadata = metadata.query((metadata_base.ALL_ELEMENTS, column_index)) - semantic_types = set(column_metadata.get('semantic_types', [])) - if len(semantic_types) == 0: - cls.logger.warning("No semantic types found in column metadata") - return False - # Making sure all accepted_semantic_types are available in semantic_types - if len(accepted_semantic_types - semantic_types) == 0: - return True - return False - - target_column_indices, target_columns_not_to_produce = base_utils.get_columns_to_use(metadata, - use_columns=hyperparams[ - 'use_outputs_columns'], - exclude_columns= - hyperparams[ - 'exclude_outputs_columns'], - can_use_column=can_produce_column) - targets = [] - if target_column_indices: - targets = data.select_columns(target_column_indices) - target_column_names = [] - for idx in target_column_indices: - target_column_names.append(data.columns[idx]) - return targets, target_column_names, target_column_indices - - @classmethod - def _get_target_columns_metadata(cls, outputs_metadata: metadata_base.DataMetadata, hyperparams) -> List[OrderedDict]: - outputs_length = outputs_metadata.query((metadata_base.ALL_ELEMENTS,))['dimension']['length'] - - target_columns_metadata: List[OrderedDict] = [] - for column_index in range(outputs_length): - column_metadata = OrderedDict(outputs_metadata.query_column(column_index)) - - # Update semantic types and prepare it for predicted targets. - semantic_types = set(column_metadata.get('semantic_types', [])) - semantic_types_to_remove = set(["https://metadata.datadrivendiscovery.org/types/TrueTarget","https://metadata.datadrivendiscovery.org/types/SuggestedTarget",]) - add_semantic_types = set(["https://metadata.datadrivendiscovery.org/types/PredictedTarget",]) - add_semantic_types.add(hyperparams["return_semantic_type"]) - semantic_types = semantic_types - semantic_types_to_remove - semantic_types = semantic_types.union(add_semantic_types) - column_metadata['semantic_types'] = list(semantic_types) - - target_columns_metadata.append(column_metadata) - - return target_columns_metadata - - @classmethod - def _update_predictions_metadata(cls, inputs_metadata: metadata_base.DataMetadata, outputs: Optional[Outputs], - target_columns_metadata: List[OrderedDict]) -> metadata_base.DataMetadata: - outputs_metadata = metadata_base.DataMetadata().generate(value=outputs) - - for column_index, column_metadata in enumerate(target_columns_metadata): - column_metadata.pop("structural_type", None) - outputs_metadata = outputs_metadata.update_column(column_index, column_metadata) - - return outputs_metadata - - def _wrap_predictions(self, inputs: Inputs, predictions: ndarray) -> Outputs: - outputs = d3m_dataframe(predictions, generate_metadata=False) - outputs.metadata = self._update_predictions_metadata(inputs.metadata, outputs, self._target_columns_metadata) - return outputs - - - @classmethod - def _add_target_columns_metadata(cls, outputs_metadata: metadata_base.DataMetadata): - outputs_length = outputs_metadata.query((metadata_base.ALL_ELEMENTS,))['dimension']['length'] - - target_columns_metadata: List[OrderedDict] = [] - for column_index in range(outputs_length): - column_metadata = OrderedDict() - semantic_types = [] - semantic_types.append('https://metadata.datadrivendiscovery.org/types/PredictedTarget') - column_name = outputs_metadata.query((metadata_base.ALL_ELEMENTS, column_index)).get("name") - if column_name is None: - column_name = "output_{}".format(column_index) - column_metadata["semantic_types"] = semantic_types - column_metadata["name"] = str(column_name) - target_columns_metadata.append(column_metadata) - - return target_columns_metadata - - -SKLasso.__doc__ = Lasso.__doc__ \ No newline at end of file diff --git a/common-primitives/sklearn-wrap/sklearn_wrap/SKLassoCV.py b/common-primitives/sklearn-wrap/sklearn_wrap/SKLassoCV.py deleted file mode 100644 index 5c53829..0000000 --- a/common-primitives/sklearn-wrap/sklearn_wrap/SKLassoCV.py +++ /dev/null @@ -1,526 +0,0 @@ -from typing import Any, Callable, List, Dict, Union, Optional, Sequence, Tuple -from numpy import ndarray -from collections import OrderedDict -from scipy import sparse -import os -import sklearn -import numpy -import typing - -# Custom import commands if any -from sklearn.linear_model.coordinate_descent import LassoCV - - -from d3m.container.numpy import ndarray as d3m_ndarray -from d3m.container import DataFrame as d3m_dataframe -from d3m.metadata import hyperparams, params, base as metadata_base -from d3m import utils -from d3m.base import utils as base_utils -from d3m.exceptions import PrimitiveNotFittedError -from d3m.primitive_interfaces.base import CallResult, DockerContainer - -from d3m.primitive_interfaces.supervised_learning import SupervisedLearnerPrimitiveBase -from d3m.primitive_interfaces.base import ProbabilisticCompositionalityMixin, ContinueFitMixin -from d3m import exceptions -import pandas - - - -Inputs = d3m_dataframe -Outputs = d3m_dataframe - - -class Params(params.Params): - alpha_: Optional[float] - coef_: Optional[ndarray] - intercept_: Optional[float] - mse_path_: Optional[ndarray] - alphas_: Optional[ndarray] - dual_gap_: Optional[float] - n_iter_: Optional[int] - input_column_names: Optional[Any] - target_names_: Optional[Sequence[Any]] - training_indices_: Optional[Sequence[int]] - target_column_indices_: Optional[Sequence[int]] - target_columns_metadata_: Optional[List[OrderedDict]] - - - -class Hyperparams(hyperparams.Hyperparams): - eps = hyperparams.Bounded[float]( - default=0.001, - lower=0, - upper=None, - description='Length of the path. ``eps=1e-3`` means that ``alpha_min / alpha_max = 1e-3``.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - n_alphas = hyperparams.Bounded[int]( - default=100, - lower=0, - upper=None, - description='Number of alphas along the regularization path', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - fit_intercept = hyperparams.UniformBool( - default=True, - description='whether to calculate the intercept for this model. If set to false, no intercept will be used in calculations (e.g. data is expected to be already centered).', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - normalize = hyperparams.UniformBool( - default=False, - description='This parameter is ignored when ``fit_intercept`` is set to False. If True, the regressors X will be normalized before regression by subtracting the mean and dividing by the l2-norm. If you wish to standardize, please use :class:`sklearn.preprocessing.StandardScaler` before calling ``fit`` on an estimator with ``normalize=False``.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - precompute = hyperparams.Union( - configuration=OrderedDict({ - 'auto': hyperparams.Constant( - default='auto', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ), - 'bool': hyperparams.UniformBool( - default=False, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ) - }), - default='auto', - description='Whether to use a precomputed Gram matrix to speed up calculations. If set to ``\'auto\'`` let us decide. The Gram matrix can also be passed as argument.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/ResourcesUseParameter'] - ) - max_iter = hyperparams.Bounded[int]( - default=1000, - lower=0, - upper=None, - description='The maximum number of iterations', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - tol = hyperparams.Bounded[float]( - default=0.0001, - lower=0, - upper=None, - description='The tolerance for the optimization: if the updates are smaller than ``tol``, the optimization code checks the dual gap for optimality and continues until it is smaller than ``tol``. copy_X : boolean, optional, default True If ``True``, X will be copied; else, it may be overwritten.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - cv = hyperparams.Union( - configuration=OrderedDict({ - 'int': hyperparams.Bounded[int]( - default=5, - lower=0, - upper=None, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ), - 'none': hyperparams.Constant( - default=None, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ) - }), - default='int', - description='Determines the cross-validation splitting strategy. Possible inputs for cv are: - None, to use the default 3-fold cross-validation, - integer, to specify the number of folds. - An object to be used as a cross-validation generator. - An iterable yielding train/test splits. For integer/None inputs, :class:`KFold` is used. Refer :ref:`User Guide ` for the various cross-validation strategies that can be used here.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - n_jobs = hyperparams.Union( - configuration=OrderedDict({ - 'limit': hyperparams.Bounded[int]( - default=1, - lower=1, - upper=None, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ), - 'all_cores': hyperparams.Constant( - default=-1, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ) - }), - default='limit', - description='Number of CPUs to use during the cross validation. If ``-1``, use all the CPUs.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/ResourcesUseParameter'] - ) - positive = hyperparams.UniformBool( - default=False, - description='If positive, restrict regression coefficients to be positive', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - selection = hyperparams.Enumeration[str]( - default='cyclic', - values=['cyclic', 'random'], - description='If set to \'random\', a random coefficient is updated every iteration rather than looping over features sequentially by default. This (setting to \'random\') often leads to significantly faster convergence especially when tol is higher than 1e-4.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - - use_inputs_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to force primitive to use as training input. If any specified column cannot be parsed, it is skipped.", - ) - use_outputs_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to force primitive to use as training target. If any specified column cannot be parsed, it is skipped.", - ) - exclude_inputs_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to not use as training inputs. Applicable only if \"use_columns\" is not provided.", - ) - exclude_outputs_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to not use as training target. Applicable only if \"use_columns\" is not provided.", - ) - return_result = hyperparams.Enumeration( - values=['append', 'replace', 'new'], - default='new', - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Should parsed columns be appended, should they replace original columns, or should only parsed columns be returned? This hyperparam is ignored if use_semantic_types is set to false.", - ) - use_semantic_types = hyperparams.UniformBool( - default=False, - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Controls whether semantic_types metadata will be used for filtering columns in input dataframe. Setting this to false makes the code ignore return_result and will produce only the output dataframe" - ) - add_index_columns = hyperparams.UniformBool( - default=False, - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Also include primary index columns if input data has them. Applicable only if \"return_result\" is set to \"new\".", - ) - error_on_no_input = hyperparams.UniformBool( - default=True, - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Throw an exception if no input column is selected/provided. Defaults to true to behave like sklearn. To prevent pipelines from breaking set this to False.", - ) - - return_semantic_type = hyperparams.Enumeration[str]( - values=['https://metadata.datadrivendiscovery.org/types/Attribute', 'https://metadata.datadrivendiscovery.org/types/ConstructedAttribute', 'https://metadata.datadrivendiscovery.org/types/PredictedTarget'], - default='https://metadata.datadrivendiscovery.org/types/PredictedTarget', - description='Decides what semantic type to attach to generated output', - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'] - ) - -class SKLassoCV(SupervisedLearnerPrimitiveBase[Inputs, Outputs, Params, Hyperparams]): - """ - Primitive wrapping for sklearn LassoCV - `sklearn documentation `_ - - """ - - __author__ = "JPL MARVIN" - metadata = metadata_base.PrimitiveMetadata({ - "algorithm_types": [metadata_base.PrimitiveAlgorithmType.LASSO, ], - "name": "sklearn.linear_model.coordinate_descent.LassoCV", - "primitive_family": metadata_base.PrimitiveFamily.REGRESSION, - "python_path": "d3m.primitives.regression.lasso_cv.SKlearn", - "source": {'name': 'JPL', 'contact': 'mailto:shah@jpl.nasa.gov', 'uris': ['https://gitlab.com/datadrivendiscovery/sklearn-wrap/issues', 'https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LassoCV.html']}, - "version": "2019.11.13", - "id": "cfd0482b-d639-3d2b-b876-87f25277a088", - "hyperparams_to_tune": ['eps', 'max_iter'], - 'installation': [ - {'type': metadata_base.PrimitiveInstallationType.PIP, - 'package_uri': 'git+https://gitlab.com/datadrivendiscovery/sklearn-wrap.git@{git_commit}#egg=sklearn_wrap'.format( - git_commit=utils.current_git_commit(os.path.dirname(__file__)), - ), - }] - }) - - def __init__(self, *, - hyperparams: Hyperparams, - random_seed: int = 0, - docker_containers: Dict[str, DockerContainer] = None, - _verbose: int = 0) -> None: - - super().__init__(hyperparams=hyperparams, random_seed=random_seed, docker_containers=docker_containers) - - # False - self._clf = LassoCV( - eps=self.hyperparams['eps'], - n_alphas=self.hyperparams['n_alphas'], - fit_intercept=self.hyperparams['fit_intercept'], - normalize=self.hyperparams['normalize'], - precompute=self.hyperparams['precompute'], - max_iter=self.hyperparams['max_iter'], - tol=self.hyperparams['tol'], - cv=self.hyperparams['cv'], - n_jobs=self.hyperparams['n_jobs'], - positive=self.hyperparams['positive'], - selection=self.hyperparams['selection'], - verbose=_verbose, - random_state=self.random_seed, - ) - - self._inputs = None - self._outputs = None - self._training_inputs = None - self._training_outputs = None - self._target_names = None - self._training_indices = None - self._target_column_indices = None - self._target_columns_metadata: List[OrderedDict] = None - self._input_column_names = None - self._fitted = False - self._new_training_data = False - - def set_training_data(self, *, inputs: Inputs, outputs: Outputs) -> None: - self._inputs = inputs - self._outputs = outputs - self._fitted = False - self._new_training_data = True - - def fit(self, *, timeout: float = None, iterations: int = None) -> CallResult[None]: - if self._inputs is None or self._outputs is None: - raise ValueError("Missing training data.") - - if not self._new_training_data: - return CallResult(None) - self._new_training_data = False - - self._training_inputs, self._training_indices = self._get_columns_to_fit(self._inputs, self.hyperparams) - self._training_outputs, self._target_names, self._target_column_indices = self._get_targets(self._outputs, self.hyperparams) - self._input_column_names = self._training_inputs.columns - - if len(self._training_indices) > 0 and len(self._target_column_indices) > 0: - self._target_columns_metadata = self._get_target_columns_metadata(self._training_outputs.metadata, self.hyperparams) - sk_training_output = self._training_outputs.values - - shape = sk_training_output.shape - if len(shape) == 2 and shape[1] == 1: - sk_training_output = numpy.ravel(sk_training_output) - - self._clf.fit(self._training_inputs, sk_training_output) - self._fitted = True - else: - if self.hyperparams['error_on_no_input']: - raise RuntimeError("No input columns were selected") - self.logger.warn("No input columns were selected") - - return CallResult(None) - - - - def produce(self, *, inputs: Inputs, timeout: float = None, iterations: int = None) -> CallResult[Outputs]: - sk_inputs, columns_to_use = self._get_columns_to_fit(inputs, self.hyperparams) - output = [] - if len(sk_inputs.columns): - try: - sk_output = self._clf.predict(sk_inputs) - except sklearn.exceptions.NotFittedError as error: - raise PrimitiveNotFittedError("Primitive not fitted.") from error - # For primitives that allow predicting without fitting like GaussianProcessRegressor - if not self._fitted: - raise PrimitiveNotFittedError("Primitive not fitted.") - if sparse.issparse(sk_output): - sk_output = sk_output.toarray() - output = self._wrap_predictions(inputs, sk_output) - output.columns = self._target_names - output = [output] - else: - if self.hyperparams['error_on_no_input']: - raise RuntimeError("No input columns were selected") - self.logger.warn("No input columns were selected") - outputs = base_utils.combine_columns(return_result=self.hyperparams['return_result'], - add_index_columns=self.hyperparams['add_index_columns'], - inputs=inputs, column_indices=self._target_column_indices, - columns_list=output) - - return CallResult(outputs) - - - def get_params(self) -> Params: - if not self._fitted: - return Params( - alpha_=None, - coef_=None, - intercept_=None, - mse_path_=None, - alphas_=None, - dual_gap_=None, - n_iter_=None, - input_column_names=self._input_column_names, - training_indices_=self._training_indices, - target_names_=self._target_names, - target_column_indices_=self._target_column_indices, - target_columns_metadata_=self._target_columns_metadata - ) - - return Params( - alpha_=getattr(self._clf, 'alpha_', None), - coef_=getattr(self._clf, 'coef_', None), - intercept_=getattr(self._clf, 'intercept_', None), - mse_path_=getattr(self._clf, 'mse_path_', None), - alphas_=getattr(self._clf, 'alphas_', None), - dual_gap_=getattr(self._clf, 'dual_gap_', None), - n_iter_=getattr(self._clf, 'n_iter_', None), - input_column_names=self._input_column_names, - training_indices_=self._training_indices, - target_names_=self._target_names, - target_column_indices_=self._target_column_indices, - target_columns_metadata_=self._target_columns_metadata - ) - - def set_params(self, *, params: Params) -> None: - self._clf.alpha_ = params['alpha_'] - self._clf.coef_ = params['coef_'] - self._clf.intercept_ = params['intercept_'] - self._clf.mse_path_ = params['mse_path_'] - self._clf.alphas_ = params['alphas_'] - self._clf.dual_gap_ = params['dual_gap_'] - self._clf.n_iter_ = params['n_iter_'] - self._input_column_names = params['input_column_names'] - self._training_indices = params['training_indices_'] - self._target_names = params['target_names_'] - self._target_column_indices = params['target_column_indices_'] - self._target_columns_metadata = params['target_columns_metadata_'] - - if params['alpha_'] is not None: - self._fitted = True - if params['coef_'] is not None: - self._fitted = True - if params['intercept_'] is not None: - self._fitted = True - if params['mse_path_'] is not None: - self._fitted = True - if params['alphas_'] is not None: - self._fitted = True - if params['dual_gap_'] is not None: - self._fitted = True - if params['n_iter_'] is not None: - self._fitted = True - - - - - - - - @classmethod - def _get_columns_to_fit(cls, inputs: Inputs, hyperparams: Hyperparams): - if not hyperparams['use_semantic_types']: - return inputs, list(range(len(inputs.columns))) - - inputs_metadata = inputs.metadata - - def can_produce_column(column_index: int) -> bool: - return cls._can_produce_column(inputs_metadata, column_index, hyperparams) - - columns_to_produce, columns_not_to_produce = base_utils.get_columns_to_use(inputs_metadata, - use_columns=hyperparams['use_inputs_columns'], - exclude_columns=hyperparams['exclude_inputs_columns'], - can_use_column=can_produce_column) - return inputs.iloc[:, columns_to_produce], columns_to_produce - # return columns_to_produce - - @classmethod - def _can_produce_column(cls, inputs_metadata: metadata_base.DataMetadata, column_index: int, hyperparams: Hyperparams) -> bool: - column_metadata = inputs_metadata.query((metadata_base.ALL_ELEMENTS, column_index)) - - accepted_structural_types = (int, float, numpy.integer, numpy.float64) - accepted_semantic_types = set() - accepted_semantic_types.add("https://metadata.datadrivendiscovery.org/types/Attribute") - if not issubclass(column_metadata['structural_type'], accepted_structural_types): - return False - - semantic_types = set(column_metadata.get('semantic_types', [])) - - if len(semantic_types) == 0: - cls.logger.warning("No semantic types found in column metadata") - return False - # Making sure all accepted_semantic_types are available in semantic_types - if len(accepted_semantic_types - semantic_types) == 0: - return True - - return False - - @classmethod - def _get_targets(cls, data: d3m_dataframe, hyperparams: Hyperparams): - if not hyperparams['use_semantic_types']: - return data, list(data.columns), list(range(len(data.columns))) - - metadata = data.metadata - - def can_produce_column(column_index: int) -> bool: - accepted_semantic_types = set() - accepted_semantic_types.add("https://metadata.datadrivendiscovery.org/types/TrueTarget") - column_metadata = metadata.query((metadata_base.ALL_ELEMENTS, column_index)) - semantic_types = set(column_metadata.get('semantic_types', [])) - if len(semantic_types) == 0: - cls.logger.warning("No semantic types found in column metadata") - return False - # Making sure all accepted_semantic_types are available in semantic_types - if len(accepted_semantic_types - semantic_types) == 0: - return True - return False - - target_column_indices, target_columns_not_to_produce = base_utils.get_columns_to_use(metadata, - use_columns=hyperparams[ - 'use_outputs_columns'], - exclude_columns= - hyperparams[ - 'exclude_outputs_columns'], - can_use_column=can_produce_column) - targets = [] - if target_column_indices: - targets = data.select_columns(target_column_indices) - target_column_names = [] - for idx in target_column_indices: - target_column_names.append(data.columns[idx]) - return targets, target_column_names, target_column_indices - - @classmethod - def _get_target_columns_metadata(cls, outputs_metadata: metadata_base.DataMetadata, hyperparams) -> List[OrderedDict]: - outputs_length = outputs_metadata.query((metadata_base.ALL_ELEMENTS,))['dimension']['length'] - - target_columns_metadata: List[OrderedDict] = [] - for column_index in range(outputs_length): - column_metadata = OrderedDict(outputs_metadata.query_column(column_index)) - - # Update semantic types and prepare it for predicted targets. - semantic_types = set(column_metadata.get('semantic_types', [])) - semantic_types_to_remove = set(["https://metadata.datadrivendiscovery.org/types/TrueTarget","https://metadata.datadrivendiscovery.org/types/SuggestedTarget",]) - add_semantic_types = set(["https://metadata.datadrivendiscovery.org/types/PredictedTarget",]) - add_semantic_types.add(hyperparams["return_semantic_type"]) - semantic_types = semantic_types - semantic_types_to_remove - semantic_types = semantic_types.union(add_semantic_types) - column_metadata['semantic_types'] = list(semantic_types) - - target_columns_metadata.append(column_metadata) - - return target_columns_metadata - - @classmethod - def _update_predictions_metadata(cls, inputs_metadata: metadata_base.DataMetadata, outputs: Optional[Outputs], - target_columns_metadata: List[OrderedDict]) -> metadata_base.DataMetadata: - outputs_metadata = metadata_base.DataMetadata().generate(value=outputs) - - for column_index, column_metadata in enumerate(target_columns_metadata): - column_metadata.pop("structural_type", None) - outputs_metadata = outputs_metadata.update_column(column_index, column_metadata) - - return outputs_metadata - - def _wrap_predictions(self, inputs: Inputs, predictions: ndarray) -> Outputs: - outputs = d3m_dataframe(predictions, generate_metadata=False) - outputs.metadata = self._update_predictions_metadata(inputs.metadata, outputs, self._target_columns_metadata) - return outputs - - - @classmethod - def _add_target_columns_metadata(cls, outputs_metadata: metadata_base.DataMetadata): - outputs_length = outputs_metadata.query((metadata_base.ALL_ELEMENTS,))['dimension']['length'] - - target_columns_metadata: List[OrderedDict] = [] - for column_index in range(outputs_length): - column_metadata = OrderedDict() - semantic_types = [] - semantic_types.append('https://metadata.datadrivendiscovery.org/types/PredictedTarget') - column_name = outputs_metadata.query((metadata_base.ALL_ELEMENTS, column_index)).get("name") - if column_name is None: - column_name = "output_{}".format(column_index) - column_metadata["semantic_types"] = semantic_types - column_metadata["name"] = str(column_name) - target_columns_metadata.append(column_metadata) - - return target_columns_metadata - - -SKLassoCV.__doc__ = LassoCV.__doc__ \ No newline at end of file diff --git a/common-primitives/sklearn-wrap/sklearn_wrap/SKLinearDiscriminantAnalysis.py b/common-primitives/sklearn-wrap/sklearn_wrap/SKLinearDiscriminantAnalysis.py deleted file mode 100644 index b574279..0000000 --- a/common-primitives/sklearn-wrap/sklearn_wrap/SKLinearDiscriminantAnalysis.py +++ /dev/null @@ -1,535 +0,0 @@ -from typing import Any, Callable, List, Dict, Union, Optional, Sequence, Tuple -from numpy import ndarray -from collections import OrderedDict -from scipy import sparse -import os -import sklearn -import numpy -import typing - -# Custom import commands if any -from sklearn.discriminant_analysis import LinearDiscriminantAnalysis - - -from d3m.container.numpy import ndarray as d3m_ndarray -from d3m.container import DataFrame as d3m_dataframe -from d3m.metadata import hyperparams, params, base as metadata_base -from d3m import utils -from d3m.base import utils as base_utils -from d3m.exceptions import PrimitiveNotFittedError -from d3m.primitive_interfaces.base import CallResult, DockerContainer - -from d3m.primitive_interfaces.supervised_learning import SupervisedLearnerPrimitiveBase -from d3m.primitive_interfaces.base import ProbabilisticCompositionalityMixin, ContinueFitMixin -from d3m import exceptions -import pandas - - - -Inputs = d3m_dataframe -Outputs = d3m_dataframe - - -class Params(params.Params): - coef_: Optional[ndarray] - intercept_: Optional[ndarray] - covariance_: Optional[ndarray] - explained_variance_ratio_: Optional[ndarray] - means_: Optional[ndarray] - priors_: Optional[ndarray] - scalings_: Optional[ndarray] - xbar_: Optional[ndarray] - classes_: Optional[ndarray] - _max_components: Optional[int] - input_column_names: Optional[Any] - target_names_: Optional[Sequence[Any]] - training_indices_: Optional[Sequence[int]] - target_column_indices_: Optional[Sequence[int]] - target_columns_metadata_: Optional[List[OrderedDict]] - - - -class Hyperparams(hyperparams.Hyperparams): - solver = hyperparams.Enumeration[str]( - default='svd', - values=['svd', 'lsqr', 'eigen'], - description='Solver to use, possible values: - \'svd\': Singular value decomposition (default). Does not compute the covariance matrix, therefore this solver is recommended for data with a large number of features. - \'lsqr\': Least squares solution, can be combined with shrinkage. - \'eigen\': Eigenvalue decomposition, can be combined with shrinkage.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - shrinkage = hyperparams.Union( - configuration=OrderedDict({ - 'string': hyperparams.Constant( - default='auto', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ), - 'float': hyperparams.Bounded[float]( - default=0, - lower=0, - upper=1, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ), - 'none': hyperparams.Constant( - default=None, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ) - }), - default='none', - description='Shrinkage parameter, possible values: - None: no shrinkage (default). - \'auto\': automatic shrinkage using the Ledoit-Wolf lemma. - float between 0 and 1: fixed shrinkage parameter. Note that shrinkage works only with \'lsqr\' and \'eigen\' solvers.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - n_components = hyperparams.Union( - configuration=OrderedDict({ - 'int': hyperparams.Bounded[int]( - default=0, - lower=0, - upper=None, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ), - 'none': hyperparams.Constant( - default=None, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ) - }), - default='none', - description='Number of components (< n_classes - 1) for dimensionality reduction.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - tol = hyperparams.Bounded[float]( - default=0.0001, - lower=0, - upper=None, - description='Threshold used for rank estimation in SVD solver. .. versionadded:: 0.17', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - - use_inputs_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to force primitive to use as training input. If any specified column cannot be parsed, it is skipped.", - ) - use_outputs_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to force primitive to use as training target. If any specified column cannot be parsed, it is skipped.", - ) - exclude_inputs_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to not use as training inputs. Applicable only if \"use_columns\" is not provided.", - ) - exclude_outputs_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to not use as training target. Applicable only if \"use_columns\" is not provided.", - ) - return_result = hyperparams.Enumeration( - values=['append', 'replace', 'new'], - default='new', - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Should parsed columns be appended, should they replace original columns, or should only parsed columns be returned? This hyperparam is ignored if use_semantic_types is set to false.", - ) - use_semantic_types = hyperparams.UniformBool( - default=False, - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Controls whether semantic_types metadata will be used for filtering columns in input dataframe. Setting this to false makes the code ignore return_result and will produce only the output dataframe" - ) - add_index_columns = hyperparams.UniformBool( - default=False, - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Also include primary index columns if input data has them. Applicable only if \"return_result\" is set to \"new\".", - ) - error_on_no_input = hyperparams.UniformBool( - default=True, - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Throw an exception if no input column is selected/provided. Defaults to true to behave like sklearn. To prevent pipelines from breaking set this to False.", - ) - - return_semantic_type = hyperparams.Enumeration[str]( - values=['https://metadata.datadrivendiscovery.org/types/Attribute', 'https://metadata.datadrivendiscovery.org/types/ConstructedAttribute', 'https://metadata.datadrivendiscovery.org/types/PredictedTarget'], - default='https://metadata.datadrivendiscovery.org/types/PredictedTarget', - description='Decides what semantic type to attach to generated output', - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'] - ) - -class SKLinearDiscriminantAnalysis(SupervisedLearnerPrimitiveBase[Inputs, Outputs, Params, Hyperparams], - ProbabilisticCompositionalityMixin[Inputs, Outputs, Params, Hyperparams]): - """ - Primitive wrapping for sklearn LinearDiscriminantAnalysis - `sklearn documentation `_ - - """ - - __author__ = "JPL MARVIN" - metadata = metadata_base.PrimitiveMetadata({ - "algorithm_types": [metadata_base.PrimitiveAlgorithmType.LINEAR_DISCRIMINANT_ANALYSIS, ], - "name": "sklearn.discriminant_analysis.LinearDiscriminantAnalysis", - "primitive_family": metadata_base.PrimitiveFamily.CLASSIFICATION, - "python_path": "d3m.primitives.classification.linear_discriminant_analysis.SKlearn", - "source": {'name': 'JPL', 'contact': 'mailto:shah@jpl.nasa.gov', 'uris': ['https://gitlab.com/datadrivendiscovery/sklearn-wrap/issues', 'https://scikit-learn.org/stable/modules/generated/sklearn.discriminant_analysis.LinearDiscriminantAnalysis.html']}, - "version": "2019.11.13", - "id": "a323b46a-6c15-373e-91b4-20efbd65402f", - 'installation': [ - {'type': metadata_base.PrimitiveInstallationType.PIP, - 'package_uri': 'git+https://gitlab.com/datadrivendiscovery/sklearn-wrap.git@{git_commit}#egg=sklearn_wrap'.format( - git_commit=utils.current_git_commit(os.path.dirname(__file__)), - ), - }] - }) - - def __init__(self, *, - hyperparams: Hyperparams, - random_seed: int = 0, - docker_containers: Dict[str, DockerContainer] = None) -> None: - - super().__init__(hyperparams=hyperparams, random_seed=random_seed, docker_containers=docker_containers) - - # False - self._clf = LinearDiscriminantAnalysis( - solver=self.hyperparams['solver'], - shrinkage=self.hyperparams['shrinkage'], - n_components=self.hyperparams['n_components'], - tol=self.hyperparams['tol'], - ) - - self._inputs = None - self._outputs = None - self._training_inputs = None - self._training_outputs = None - self._target_names = None - self._training_indices = None - self._target_column_indices = None - self._target_columns_metadata: List[OrderedDict] = None - self._input_column_names = None - self._fitted = False - self._new_training_data = False - - def set_training_data(self, *, inputs: Inputs, outputs: Outputs) -> None: - self._inputs = inputs - self._outputs = outputs - self._fitted = False - self._new_training_data = True - - def fit(self, *, timeout: float = None, iterations: int = None) -> CallResult[None]: - if self._inputs is None or self._outputs is None: - raise ValueError("Missing training data.") - - if not self._new_training_data: - return CallResult(None) - self._new_training_data = False - - self._training_inputs, self._training_indices = self._get_columns_to_fit(self._inputs, self.hyperparams) - self._training_outputs, self._target_names, self._target_column_indices = self._get_targets(self._outputs, self.hyperparams) - self._input_column_names = self._training_inputs.columns - - if len(self._training_indices) > 0 and len(self._target_column_indices) > 0: - self._target_columns_metadata = self._get_target_columns_metadata(self._training_outputs.metadata, self.hyperparams) - sk_training_output = self._training_outputs.values - - shape = sk_training_output.shape - if len(shape) == 2 and shape[1] == 1: - sk_training_output = numpy.ravel(sk_training_output) - - self._clf.fit(self._training_inputs, sk_training_output) - self._fitted = True - else: - if self.hyperparams['error_on_no_input']: - raise RuntimeError("No input columns were selected") - self.logger.warn("No input columns were selected") - - return CallResult(None) - - - - def produce(self, *, inputs: Inputs, timeout: float = None, iterations: int = None) -> CallResult[Outputs]: - sk_inputs, columns_to_use = self._get_columns_to_fit(inputs, self.hyperparams) - output = [] - if len(sk_inputs.columns): - try: - sk_output = self._clf.predict(sk_inputs) - except sklearn.exceptions.NotFittedError as error: - raise PrimitiveNotFittedError("Primitive not fitted.") from error - # For primitives that allow predicting without fitting like GaussianProcessRegressor - if not self._fitted: - raise PrimitiveNotFittedError("Primitive not fitted.") - if sparse.issparse(sk_output): - sk_output = sk_output.toarray() - output = self._wrap_predictions(inputs, sk_output) - output.columns = self._target_names - output = [output] - else: - if self.hyperparams['error_on_no_input']: - raise RuntimeError("No input columns were selected") - self.logger.warn("No input columns were selected") - outputs = base_utils.combine_columns(return_result=self.hyperparams['return_result'], - add_index_columns=self.hyperparams['add_index_columns'], - inputs=inputs, column_indices=self._target_column_indices, - columns_list=output) - - return CallResult(outputs) - - - def get_params(self) -> Params: - if not self._fitted: - return Params( - coef_=None, - intercept_=None, - covariance_=None, - explained_variance_ratio_=None, - means_=None, - priors_=None, - scalings_=None, - xbar_=None, - classes_=None, - _max_components=None, - input_column_names=self._input_column_names, - training_indices_=self._training_indices, - target_names_=self._target_names, - target_column_indices_=self._target_column_indices, - target_columns_metadata_=self._target_columns_metadata - ) - - return Params( - coef_=getattr(self._clf, 'coef_', None), - intercept_=getattr(self._clf, 'intercept_', None), - covariance_=getattr(self._clf, 'covariance_', None), - explained_variance_ratio_=getattr(self._clf, 'explained_variance_ratio_', None), - means_=getattr(self._clf, 'means_', None), - priors_=getattr(self._clf, 'priors_', None), - scalings_=getattr(self._clf, 'scalings_', None), - xbar_=getattr(self._clf, 'xbar_', None), - classes_=getattr(self._clf, 'classes_', None), - _max_components=getattr(self._clf, '_max_components', None), - input_column_names=self._input_column_names, - training_indices_=self._training_indices, - target_names_=self._target_names, - target_column_indices_=self._target_column_indices, - target_columns_metadata_=self._target_columns_metadata - ) - - def set_params(self, *, params: Params) -> None: - self._clf.coef_ = params['coef_'] - self._clf.intercept_ = params['intercept_'] - self._clf.covariance_ = params['covariance_'] - self._clf.explained_variance_ratio_ = params['explained_variance_ratio_'] - self._clf.means_ = params['means_'] - self._clf.priors_ = params['priors_'] - self._clf.scalings_ = params['scalings_'] - self._clf.xbar_ = params['xbar_'] - self._clf.classes_ = params['classes_'] - self._clf._max_components = params['_max_components'] - self._input_column_names = params['input_column_names'] - self._training_indices = params['training_indices_'] - self._target_names = params['target_names_'] - self._target_column_indices = params['target_column_indices_'] - self._target_columns_metadata = params['target_columns_metadata_'] - - if params['coef_'] is not None: - self._fitted = True - if params['intercept_'] is not None: - self._fitted = True - if params['covariance_'] is not None: - self._fitted = True - if params['explained_variance_ratio_'] is not None: - self._fitted = True - if params['means_'] is not None: - self._fitted = True - if params['priors_'] is not None: - self._fitted = True - if params['scalings_'] is not None: - self._fitted = True - if params['xbar_'] is not None: - self._fitted = True - if params['classes_'] is not None: - self._fitted = True - if params['_max_components'] is not None: - self._fitted = True - - - def log_likelihoods(self, *, - outputs: Outputs, - inputs: Inputs, - timeout: float = None, - iterations: int = None) -> CallResult[Sequence[float]]: - inputs = inputs.iloc[:, self._training_indices] # Get ndarray - outputs = outputs.iloc[:, self._target_column_indices] - - if len(inputs.columns) and len(outputs.columns): - - if outputs.shape[1] != self._clf.n_outputs_: - raise exceptions.InvalidArgumentValueError("\"outputs\" argument does not have the correct number of target columns.") - - log_proba = self._clf.predict_log_proba(inputs) - - # Making it always a list, even when only one target. - if self._clf.n_outputs_ == 1: - log_proba = [log_proba] - classes = [self._clf.classes_] - else: - classes = self._clf.classes_ - - samples_length = inputs.shape[0] - - log_likelihoods = [] - for k in range(self._clf.n_outputs_): - # We have to map each class to its internal (numerical) index used in the learner. - # This allows "outputs" to contain string classes. - outputs_column = outputs.iloc[:, k] - classes_map = pandas.Series(numpy.arange(len(classes[k])), index=classes[k]) - mapped_outputs_column = outputs_column.map(classes_map) - - # For each target column (column in "outputs"), for each sample (row) we pick the log - # likelihood for a given class. - log_likelihoods.append(log_proba[k][numpy.arange(samples_length), mapped_outputs_column]) - - results = d3m_dataframe(dict(enumerate(log_likelihoods)), generate_metadata=True) - results.columns = outputs.columns - - for k in range(self._clf.n_outputs_): - column_metadata = outputs.metadata.query_column(k) - if 'name' in column_metadata: - results.metadata = results.metadata.update_column(k, {'name': column_metadata['name']}) - - else: - results = d3m_dataframe(generate_metadata=True) - - return CallResult(results) - - - - - - @classmethod - def _get_columns_to_fit(cls, inputs: Inputs, hyperparams: Hyperparams): - if not hyperparams['use_semantic_types']: - return inputs, list(range(len(inputs.columns))) - - inputs_metadata = inputs.metadata - - def can_produce_column(column_index: int) -> bool: - return cls._can_produce_column(inputs_metadata, column_index, hyperparams) - - columns_to_produce, columns_not_to_produce = base_utils.get_columns_to_use(inputs_metadata, - use_columns=hyperparams['use_inputs_columns'], - exclude_columns=hyperparams['exclude_inputs_columns'], - can_use_column=can_produce_column) - return inputs.iloc[:, columns_to_produce], columns_to_produce - # return columns_to_produce - - @classmethod - def _can_produce_column(cls, inputs_metadata: metadata_base.DataMetadata, column_index: int, hyperparams: Hyperparams) -> bool: - column_metadata = inputs_metadata.query((metadata_base.ALL_ELEMENTS, column_index)) - - accepted_structural_types = (int, float, numpy.integer, numpy.float64) - accepted_semantic_types = set() - accepted_semantic_types.add("https://metadata.datadrivendiscovery.org/types/Attribute") - if not issubclass(column_metadata['structural_type'], accepted_structural_types): - return False - - semantic_types = set(column_metadata.get('semantic_types', [])) - - if len(semantic_types) == 0: - cls.logger.warning("No semantic types found in column metadata") - return False - # Making sure all accepted_semantic_types are available in semantic_types - if len(accepted_semantic_types - semantic_types) == 0: - return True - - return False - - @classmethod - def _get_targets(cls, data: d3m_dataframe, hyperparams: Hyperparams): - if not hyperparams['use_semantic_types']: - return data, list(data.columns), list(range(len(data.columns))) - - metadata = data.metadata - - def can_produce_column(column_index: int) -> bool: - accepted_semantic_types = set() - accepted_semantic_types.add("https://metadata.datadrivendiscovery.org/types/TrueTarget") - column_metadata = metadata.query((metadata_base.ALL_ELEMENTS, column_index)) - semantic_types = set(column_metadata.get('semantic_types', [])) - if len(semantic_types) == 0: - cls.logger.warning("No semantic types found in column metadata") - return False - # Making sure all accepted_semantic_types are available in semantic_types - if len(accepted_semantic_types - semantic_types) == 0: - return True - return False - - target_column_indices, target_columns_not_to_produce = base_utils.get_columns_to_use(metadata, - use_columns=hyperparams[ - 'use_outputs_columns'], - exclude_columns= - hyperparams[ - 'exclude_outputs_columns'], - can_use_column=can_produce_column) - targets = [] - if target_column_indices: - targets = data.select_columns(target_column_indices) - target_column_names = [] - for idx in target_column_indices: - target_column_names.append(data.columns[idx]) - return targets, target_column_names, target_column_indices - - @classmethod - def _get_target_columns_metadata(cls, outputs_metadata: metadata_base.DataMetadata, hyperparams) -> List[OrderedDict]: - outputs_length = outputs_metadata.query((metadata_base.ALL_ELEMENTS,))['dimension']['length'] - - target_columns_metadata: List[OrderedDict] = [] - for column_index in range(outputs_length): - column_metadata = OrderedDict(outputs_metadata.query_column(column_index)) - - # Update semantic types and prepare it for predicted targets. - semantic_types = set(column_metadata.get('semantic_types', [])) - semantic_types_to_remove = set(["https://metadata.datadrivendiscovery.org/types/TrueTarget","https://metadata.datadrivendiscovery.org/types/SuggestedTarget",]) - add_semantic_types = set(["https://metadata.datadrivendiscovery.org/types/PredictedTarget",]) - add_semantic_types.add(hyperparams["return_semantic_type"]) - semantic_types = semantic_types - semantic_types_to_remove - semantic_types = semantic_types.union(add_semantic_types) - column_metadata['semantic_types'] = list(semantic_types) - - target_columns_metadata.append(column_metadata) - - return target_columns_metadata - - @classmethod - def _update_predictions_metadata(cls, inputs_metadata: metadata_base.DataMetadata, outputs: Optional[Outputs], - target_columns_metadata: List[OrderedDict]) -> metadata_base.DataMetadata: - outputs_metadata = metadata_base.DataMetadata().generate(value=outputs) - - for column_index, column_metadata in enumerate(target_columns_metadata): - column_metadata.pop("structural_type", None) - outputs_metadata = outputs_metadata.update_column(column_index, column_metadata) - - return outputs_metadata - - def _wrap_predictions(self, inputs: Inputs, predictions: ndarray) -> Outputs: - outputs = d3m_dataframe(predictions, generate_metadata=False) - outputs.metadata = self._update_predictions_metadata(inputs.metadata, outputs, self._target_columns_metadata) - return outputs - - - @classmethod - def _add_target_columns_metadata(cls, outputs_metadata: metadata_base.DataMetadata): - outputs_length = outputs_metadata.query((metadata_base.ALL_ELEMENTS,))['dimension']['length'] - - target_columns_metadata: List[OrderedDict] = [] - for column_index in range(outputs_length): - column_metadata = OrderedDict() - semantic_types = [] - semantic_types.append('https://metadata.datadrivendiscovery.org/types/PredictedTarget') - column_name = outputs_metadata.query((metadata_base.ALL_ELEMENTS, column_index)).get("name") - if column_name is None: - column_name = "output_{}".format(column_index) - column_metadata["semantic_types"] = semantic_types - column_metadata["name"] = str(column_name) - target_columns_metadata.append(column_metadata) - - return target_columns_metadata - - -SKLinearDiscriminantAnalysis.__doc__ = LinearDiscriminantAnalysis.__doc__ \ No newline at end of file diff --git a/common-primitives/sklearn-wrap/sklearn_wrap/SKLinearRegression.py b/common-primitives/sklearn-wrap/sklearn_wrap/SKLinearRegression.py deleted file mode 100644 index 62ce474..0000000 --- a/common-primitives/sklearn-wrap/sklearn_wrap/SKLinearRegression.py +++ /dev/null @@ -1,431 +0,0 @@ -from typing import Any, Callable, List, Dict, Union, Optional, Sequence, Tuple -from numpy import ndarray -from collections import OrderedDict -from scipy import sparse -import os -import sklearn -import numpy -import typing - -# Custom import commands if any -from sklearn.linear_model.base import LinearRegression - - -from d3m.container.numpy import ndarray as d3m_ndarray -from d3m.container import DataFrame as d3m_dataframe -from d3m.metadata import hyperparams, params, base as metadata_base -from d3m import utils -from d3m.base import utils as base_utils -from d3m.exceptions import PrimitiveNotFittedError -from d3m.primitive_interfaces.base import CallResult, DockerContainer - -from d3m.primitive_interfaces.supervised_learning import SupervisedLearnerPrimitiveBase -from d3m.primitive_interfaces.base import ProbabilisticCompositionalityMixin, ContinueFitMixin -from d3m import exceptions -import pandas - - - -Inputs = d3m_dataframe -Outputs = d3m_dataframe - - -class Params(params.Params): - coef_: Optional[ndarray] - intercept_: Optional[float] - _residues: Optional[float] - rank_: Optional[int] - singular_: Optional[ndarray] - input_column_names: Optional[Any] - target_names_: Optional[Sequence[Any]] - training_indices_: Optional[Sequence[int]] - target_column_indices_: Optional[Sequence[int]] - target_columns_metadata_: Optional[List[OrderedDict]] - - - -class Hyperparams(hyperparams.Hyperparams): - fit_intercept = hyperparams.UniformBool( - default=True, - description='whether to calculate the intercept for this model. If set to False, no intercept will be used in calculations (e.g. data is expected to be already centered).', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - normalize = hyperparams.UniformBool( - default=True, - description='This parameter is ignored when ``fit_intercept`` is set to False. If True, the regressors X will be normalized before regression by subtracting the mean and dividing by the l2-norm. If you wish to standardize, please use :class:`sklearn.preprocessing.StandardScaler` before calling ``fit`` on an estimator with ``normalize=False``.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - n_jobs = hyperparams.Union( - configuration=OrderedDict({ - 'limit': hyperparams.Bounded[int]( - default=1, - lower=1, - upper=None, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ), - 'all_cores': hyperparams.Constant( - default=-1, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ) - }), - default='limit', - description='The number of jobs to use for the computation. This will only provide speedup for n_targets > 1 and sufficient large problems. ``None`` means 1 unless in a :obj:`joblib.parallel_backend` context. ``-1`` means using all processors. See :term:`Glossary ` for more details.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/ResourcesUseParameter'] - ) - - use_inputs_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to force primitive to use as training input. If any specified column cannot be parsed, it is skipped.", - ) - use_outputs_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to force primitive to use as training target. If any specified column cannot be parsed, it is skipped.", - ) - exclude_inputs_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to not use as training inputs. Applicable only if \"use_columns\" is not provided.", - ) - exclude_outputs_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to not use as training target. Applicable only if \"use_columns\" is not provided.", - ) - return_result = hyperparams.Enumeration( - values=['append', 'replace', 'new'], - default='new', - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Should parsed columns be appended, should they replace original columns, or should only parsed columns be returned? This hyperparam is ignored if use_semantic_types is set to false.", - ) - use_semantic_types = hyperparams.UniformBool( - default=False, - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Controls whether semantic_types metadata will be used for filtering columns in input dataframe. Setting this to false makes the code ignore return_result and will produce only the output dataframe" - ) - add_index_columns = hyperparams.UniformBool( - default=False, - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Also include primary index columns if input data has them. Applicable only if \"return_result\" is set to \"new\".", - ) - error_on_no_input = hyperparams.UniformBool( - default=True, - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Throw an exception if no input column is selected/provided. Defaults to true to behave like sklearn. To prevent pipelines from breaking set this to False.", - ) - - return_semantic_type = hyperparams.Enumeration[str]( - values=['https://metadata.datadrivendiscovery.org/types/Attribute', 'https://metadata.datadrivendiscovery.org/types/ConstructedAttribute', 'https://metadata.datadrivendiscovery.org/types/PredictedTarget'], - default='https://metadata.datadrivendiscovery.org/types/PredictedTarget', - description='Decides what semantic type to attach to generated output', - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'] - ) - -class SKLinearRegression(SupervisedLearnerPrimitiveBase[Inputs, Outputs, Params, Hyperparams]): - """ - Primitive wrapping for sklearn LinearRegression - `sklearn documentation `_ - - """ - - __author__ = "JPL MARVIN" - metadata = metadata_base.PrimitiveMetadata({ - "algorithm_types": [metadata_base.PrimitiveAlgorithmType.LINEAR_REGRESSION, ], - "name": "sklearn.linear_model.base.LinearRegression", - "primitive_family": metadata_base.PrimitiveFamily.REGRESSION, - "python_path": "d3m.primitives.regression.linear.SKlearn", - "source": {'name': 'JPL', 'contact': 'mailto:shah@jpl.nasa.gov', 'uris': ['https://gitlab.com/datadrivendiscovery/sklearn-wrap/issues', 'https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html']}, - "version": "2019.11.13", - "id": "816cc0f8-8bf4-4d00-830d-272342349577", - 'installation': [ - {'type': metadata_base.PrimitiveInstallationType.PIP, - 'package_uri': 'git+https://gitlab.com/datadrivendiscovery/sklearn-wrap.git@{git_commit}#egg=sklearn_wrap'.format( - git_commit=utils.current_git_commit(os.path.dirname(__file__)), - ), - }] - }) - - def __init__(self, *, - hyperparams: Hyperparams, - random_seed: int = 0, - docker_containers: Dict[str, DockerContainer] = None) -> None: - - super().__init__(hyperparams=hyperparams, random_seed=random_seed, docker_containers=docker_containers) - - # False - self._clf = LinearRegression( - fit_intercept=self.hyperparams['fit_intercept'], - normalize=self.hyperparams['normalize'], - n_jobs=self.hyperparams['n_jobs'], - ) - - self._inputs = None - self._outputs = None - self._training_inputs = None - self._training_outputs = None - self._target_names = None - self._training_indices = None - self._target_column_indices = None - self._target_columns_metadata: List[OrderedDict] = None - self._input_column_names = None - self._fitted = False - self._new_training_data = False - - def set_training_data(self, *, inputs: Inputs, outputs: Outputs) -> None: - self._inputs = inputs - self._outputs = outputs - self._fitted = False - self._new_training_data = True - - def fit(self, *, timeout: float = None, iterations: int = None) -> CallResult[None]: - if self._inputs is None or self._outputs is None: - raise ValueError("Missing training data.") - - if not self._new_training_data: - return CallResult(None) - self._new_training_data = False - - self._training_inputs, self._training_indices = self._get_columns_to_fit(self._inputs, self.hyperparams) - self._training_outputs, self._target_names, self._target_column_indices = self._get_targets(self._outputs, self.hyperparams) - self._input_column_names = self._training_inputs.columns - - if len(self._training_indices) > 0 and len(self._target_column_indices) > 0: - self._target_columns_metadata = self._get_target_columns_metadata(self._training_outputs.metadata, self.hyperparams) - sk_training_output = self._training_outputs.values - - shape = sk_training_output.shape - if len(shape) == 2 and shape[1] == 1: - sk_training_output = numpy.ravel(sk_training_output) - - self._clf.fit(self._training_inputs, sk_training_output) - self._fitted = True - else: - if self.hyperparams['error_on_no_input']: - raise RuntimeError("No input columns were selected") - self.logger.warn("No input columns were selected") - - return CallResult(None) - - - - def produce(self, *, inputs: Inputs, timeout: float = None, iterations: int = None) -> CallResult[Outputs]: - sk_inputs, columns_to_use = self._get_columns_to_fit(inputs, self.hyperparams) - output = [] - if len(sk_inputs.columns): - try: - sk_output = self._clf.predict(sk_inputs) - except sklearn.exceptions.NotFittedError as error: - raise PrimitiveNotFittedError("Primitive not fitted.") from error - # For primitives that allow predicting without fitting like GaussianProcessRegressor - if not self._fitted: - raise PrimitiveNotFittedError("Primitive not fitted.") - if sparse.issparse(sk_output): - sk_output = sk_output.toarray() - output = self._wrap_predictions(inputs, sk_output) - output.columns = self._target_names - output = [output] - else: - if self.hyperparams['error_on_no_input']: - raise RuntimeError("No input columns were selected") - self.logger.warn("No input columns were selected") - outputs = base_utils.combine_columns(return_result=self.hyperparams['return_result'], - add_index_columns=self.hyperparams['add_index_columns'], - inputs=inputs, column_indices=self._target_column_indices, - columns_list=output) - - return CallResult(outputs) - - - def get_params(self) -> Params: - if not self._fitted: - return Params( - coef_=None, - intercept_=None, - _residues=None, - rank_=None, - singular_=None, - input_column_names=self._input_column_names, - training_indices_=self._training_indices, - target_names_=self._target_names, - target_column_indices_=self._target_column_indices, - target_columns_metadata_=self._target_columns_metadata - ) - - return Params( - coef_=getattr(self._clf, 'coef_', None), - intercept_=getattr(self._clf, 'intercept_', None), - _residues=getattr(self._clf, '_residues', None), - rank_=getattr(self._clf, 'rank_', None), - singular_=getattr(self._clf, 'singular_', None), - input_column_names=self._input_column_names, - training_indices_=self._training_indices, - target_names_=self._target_names, - target_column_indices_=self._target_column_indices, - target_columns_metadata_=self._target_columns_metadata - ) - - def set_params(self, *, params: Params) -> None: - self._clf.coef_ = params['coef_'] - self._clf.intercept_ = params['intercept_'] - self._clf._residues = params['_residues'] - self._clf.rank_ = params['rank_'] - self._clf.singular_ = params['singular_'] - self._input_column_names = params['input_column_names'] - self._training_indices = params['training_indices_'] - self._target_names = params['target_names_'] - self._target_column_indices = params['target_column_indices_'] - self._target_columns_metadata = params['target_columns_metadata_'] - - if params['coef_'] is not None: - self._fitted = True - if params['intercept_'] is not None: - self._fitted = True - if params['_residues'] is not None: - self._fitted = True - if params['rank_'] is not None: - self._fitted = True - if params['singular_'] is not None: - self._fitted = True - - - - - - - - @classmethod - def _get_columns_to_fit(cls, inputs: Inputs, hyperparams: Hyperparams): - if not hyperparams['use_semantic_types']: - return inputs, list(range(len(inputs.columns))) - - inputs_metadata = inputs.metadata - - def can_produce_column(column_index: int) -> bool: - return cls._can_produce_column(inputs_metadata, column_index, hyperparams) - - columns_to_produce, columns_not_to_produce = base_utils.get_columns_to_use(inputs_metadata, - use_columns=hyperparams['use_inputs_columns'], - exclude_columns=hyperparams['exclude_inputs_columns'], - can_use_column=can_produce_column) - return inputs.iloc[:, columns_to_produce], columns_to_produce - # return columns_to_produce - - @classmethod - def _can_produce_column(cls, inputs_metadata: metadata_base.DataMetadata, column_index: int, hyperparams: Hyperparams) -> bool: - column_metadata = inputs_metadata.query((metadata_base.ALL_ELEMENTS, column_index)) - - accepted_structural_types = (int, float, numpy.integer, numpy.float64) - accepted_semantic_types = set() - accepted_semantic_types.add("https://metadata.datadrivendiscovery.org/types/Attribute") - if not issubclass(column_metadata['structural_type'], accepted_structural_types): - return False - - semantic_types = set(column_metadata.get('semantic_types', [])) - - if len(semantic_types) == 0: - cls.logger.warning("No semantic types found in column metadata") - return False - # Making sure all accepted_semantic_types are available in semantic_types - if len(accepted_semantic_types - semantic_types) == 0: - return True - - return False - - @classmethod - def _get_targets(cls, data: d3m_dataframe, hyperparams: Hyperparams): - if not hyperparams['use_semantic_types']: - return data, list(data.columns), list(range(len(data.columns))) - - metadata = data.metadata - - def can_produce_column(column_index: int) -> bool: - accepted_semantic_types = set() - accepted_semantic_types.add("https://metadata.datadrivendiscovery.org/types/TrueTarget") - column_metadata = metadata.query((metadata_base.ALL_ELEMENTS, column_index)) - semantic_types = set(column_metadata.get('semantic_types', [])) - if len(semantic_types) == 0: - cls.logger.warning("No semantic types found in column metadata") - return False - # Making sure all accepted_semantic_types are available in semantic_types - if len(accepted_semantic_types - semantic_types) == 0: - return True - return False - - target_column_indices, target_columns_not_to_produce = base_utils.get_columns_to_use(metadata, - use_columns=hyperparams[ - 'use_outputs_columns'], - exclude_columns= - hyperparams[ - 'exclude_outputs_columns'], - can_use_column=can_produce_column) - targets = [] - if target_column_indices: - targets = data.select_columns(target_column_indices) - target_column_names = [] - for idx in target_column_indices: - target_column_names.append(data.columns[idx]) - return targets, target_column_names, target_column_indices - - @classmethod - def _get_target_columns_metadata(cls, outputs_metadata: metadata_base.DataMetadata, hyperparams) -> List[OrderedDict]: - outputs_length = outputs_metadata.query((metadata_base.ALL_ELEMENTS,))['dimension']['length'] - - target_columns_metadata: List[OrderedDict] = [] - for column_index in range(outputs_length): - column_metadata = OrderedDict(outputs_metadata.query_column(column_index)) - - # Update semantic types and prepare it for predicted targets. - semantic_types = set(column_metadata.get('semantic_types', [])) - semantic_types_to_remove = set(["https://metadata.datadrivendiscovery.org/types/TrueTarget","https://metadata.datadrivendiscovery.org/types/SuggestedTarget",]) - add_semantic_types = set(["https://metadata.datadrivendiscovery.org/types/PredictedTarget",]) - add_semantic_types.add(hyperparams["return_semantic_type"]) - semantic_types = semantic_types - semantic_types_to_remove - semantic_types = semantic_types.union(add_semantic_types) - column_metadata['semantic_types'] = list(semantic_types) - - target_columns_metadata.append(column_metadata) - - return target_columns_metadata - - @classmethod - def _update_predictions_metadata(cls, inputs_metadata: metadata_base.DataMetadata, outputs: Optional[Outputs], - target_columns_metadata: List[OrderedDict]) -> metadata_base.DataMetadata: - outputs_metadata = metadata_base.DataMetadata().generate(value=outputs) - - for column_index, column_metadata in enumerate(target_columns_metadata): - column_metadata.pop("structural_type", None) - outputs_metadata = outputs_metadata.update_column(column_index, column_metadata) - - return outputs_metadata - - def _wrap_predictions(self, inputs: Inputs, predictions: ndarray) -> Outputs: - outputs = d3m_dataframe(predictions, generate_metadata=False) - outputs.metadata = self._update_predictions_metadata(inputs.metadata, outputs, self._target_columns_metadata) - return outputs - - - @classmethod - def _add_target_columns_metadata(cls, outputs_metadata: metadata_base.DataMetadata): - outputs_length = outputs_metadata.query((metadata_base.ALL_ELEMENTS,))['dimension']['length'] - - target_columns_metadata: List[OrderedDict] = [] - for column_index in range(outputs_length): - column_metadata = OrderedDict() - semantic_types = [] - semantic_types.append('https://metadata.datadrivendiscovery.org/types/PredictedTarget') - column_name = outputs_metadata.query((metadata_base.ALL_ELEMENTS, column_index)).get("name") - if column_name is None: - column_name = "output_{}".format(column_index) - column_metadata["semantic_types"] = semantic_types - column_metadata["name"] = str(column_name) - target_columns_metadata.append(column_metadata) - - return target_columns_metadata - - -SKLinearRegression.__doc__ = LinearRegression.__doc__ \ No newline at end of file diff --git a/common-primitives/sklearn-wrap/sklearn_wrap/SKLinearSVC.py b/common-primitives/sklearn-wrap/sklearn_wrap/SKLinearSVC.py deleted file mode 100644 index 55bb114..0000000 --- a/common-primitives/sklearn-wrap/sklearn_wrap/SKLinearSVC.py +++ /dev/null @@ -1,478 +0,0 @@ -from typing import Any, Callable, List, Dict, Union, Optional, Sequence, Tuple -from numpy import ndarray -from collections import OrderedDict -from scipy import sparse -import os -import sklearn -import numpy -import typing - -# Custom import commands if any -from sklearn.svm.classes import LinearSVC - - -from d3m.container.numpy import ndarray as d3m_ndarray -from d3m.container import DataFrame as d3m_dataframe -from d3m.metadata import hyperparams, params, base as metadata_base -from d3m import utils -from d3m.base import utils as base_utils -from d3m.exceptions import PrimitiveNotFittedError -from d3m.primitive_interfaces.base import CallResult, DockerContainer - -from d3m.primitive_interfaces.supervised_learning import SupervisedLearnerPrimitiveBase -from d3m.primitive_interfaces.base import ProbabilisticCompositionalityMixin, ContinueFitMixin -from d3m import exceptions -import pandas - - - -Inputs = d3m_dataframe -Outputs = d3m_dataframe - - -class Params(params.Params): - coef_: Optional[ndarray] - intercept_: Optional[ndarray] - classes_: Optional[ndarray] - n_iter_: Optional[numpy.int32] - input_column_names: Optional[Any] - target_names_: Optional[Sequence[Any]] - training_indices_: Optional[Sequence[int]] - target_column_indices_: Optional[Sequence[int]] - target_columns_metadata_: Optional[List[OrderedDict]] - - - -class Hyperparams(hyperparams.Hyperparams): - penalty = hyperparams.Enumeration[str]( - values=['l1', 'l2'], - default='l2', - description='Specifies the norm used in the penalization. The \'l2\' penalty is the standard used in SVC. The \'l1\' leads to ``coef_`` vectors that are sparse.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - loss = hyperparams.Enumeration[str]( - values=['hinge', 'squared_hinge'], - default='squared_hinge', - description='Specifies the loss function. \'hinge\' is the standard SVM loss (used e.g. by the SVC class) while \'squared_hinge\' is the square of the hinge loss.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - dual = hyperparams.UniformBool( - default=True, - description='Select the algorithm to either solve the dual or primal optimization problem. Prefer dual=False when n_samples > n_features.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - tol = hyperparams.Bounded[float]( - default=0.0001, - lower=0, - upper=None, - description='Tolerance for stopping criteria. multi_class: string, \'ovr\' or \'crammer_singer\' (default=\'ovr\') Determines the multi-class strategy if `y` contains more than two classes. ``"ovr"`` trains n_classes one-vs-rest classifiers, while ``"crammer_singer"`` optimizes a joint objective over all classes. While `crammer_singer` is interesting from a theoretical perspective as it is consistent, it is seldom used in practice as it rarely leads to better accuracy and is more expensive to compute. If ``"crammer_singer"`` is chosen, the options loss, penalty and dual will be ignored.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - C = hyperparams.Bounded[float]( - default=1, - lower=0, - upper=None, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - description='Penalty parameter C of the error term.' - ) - multi_class = hyperparams.Enumeration[str]( - values=['ovr', 'crammer_singer'], - default='ovr', - description='Determines the multi-class strategy if `y` contains more than two classes. ``"ovr"`` trains n_classes one-vs-rest classifiers, while ``"crammer_singer"`` optimizes a joint objective over all classes. While `crammer_singer` is interesting from a theoretical perspective as it is consistent, it is seldom used in practice as it rarely leads to better accuracy and is more expensive to compute. If ``"crammer_singer"`` is chosen, the options loss, penalty and dual will be ignored. ', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - fit_intercept = hyperparams.UniformBool( - default=True, - description='Whether to calculate the intercept for this model. If set to false, no intercept will be used in calculations (i.e. data is expected to be already centered).', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - intercept_scaling = hyperparams.Hyperparameter[float]( - default=1, - description='When self.fit_intercept is True, instance vector x becomes ``[x, self.intercept_scaling]``, i.e. a "synthetic" feature with constant value equals to intercept_scaling is appended to the instance vector. The intercept becomes intercept_scaling * synthetic feature weight Note! the synthetic feature weight is subject to l1/l2 regularization as all other features. To lessen the effect of regularization on synthetic feature weight (and therefore on the intercept) intercept_scaling has to be increased.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - class_weight = hyperparams.Union( - configuration=OrderedDict({ - 'str': hyperparams.Constant( - default='balanced', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ), - 'none': hyperparams.Constant( - default=None, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ) - }), - default='none', - description='Set the parameter C of class i to ``class_weight[i]*C`` for SVC. If not given, all classes are supposed to have weight one. The "balanced" mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data as ``n_samples / (n_classes * np.bincount(y))``', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - max_iter = hyperparams.Bounded[int]( - default=1000, - lower=0, - upper=None, - description='The maximum number of iterations to be run.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - - use_inputs_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to force primitive to use as training input. If any specified column cannot be parsed, it is skipped.", - ) - use_outputs_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to force primitive to use as training target. If any specified column cannot be parsed, it is skipped.", - ) - exclude_inputs_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to not use as training inputs. Applicable only if \"use_columns\" is not provided.", - ) - exclude_outputs_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to not use as training target. Applicable only if \"use_columns\" is not provided.", - ) - return_result = hyperparams.Enumeration( - values=['append', 'replace', 'new'], - default='new', - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Should parsed columns be appended, should they replace original columns, or should only parsed columns be returned? This hyperparam is ignored if use_semantic_types is set to false.", - ) - use_semantic_types = hyperparams.UniformBool( - default=False, - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Controls whether semantic_types metadata will be used for filtering columns in input dataframe. Setting this to false makes the code ignore return_result and will produce only the output dataframe" - ) - add_index_columns = hyperparams.UniformBool( - default=False, - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Also include primary index columns if input data has them. Applicable only if \"return_result\" is set to \"new\".", - ) - error_on_no_input = hyperparams.UniformBool( - default=True, - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Throw an exception if no input column is selected/provided. Defaults to true to behave like sklearn. To prevent pipelines from breaking set this to False.", - ) - - return_semantic_type = hyperparams.Enumeration[str]( - values=['https://metadata.datadrivendiscovery.org/types/Attribute', 'https://metadata.datadrivendiscovery.org/types/ConstructedAttribute', 'https://metadata.datadrivendiscovery.org/types/PredictedTarget'], - default='https://metadata.datadrivendiscovery.org/types/PredictedTarget', - description='Decides what semantic type to attach to generated output', - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'] - ) - -class SKLinearSVC(SupervisedLearnerPrimitiveBase[Inputs, Outputs, Params, Hyperparams]): - """ - Primitive wrapping for sklearn LinearSVC - `sklearn documentation `_ - - """ - - __author__ = "JPL MARVIN" - metadata = metadata_base.PrimitiveMetadata({ - "algorithm_types": [metadata_base.PrimitiveAlgorithmType.SUPPORT_VECTOR_MACHINE, ], - "name": "sklearn.svm.classes.LinearSVC", - "primitive_family": metadata_base.PrimitiveFamily.CLASSIFICATION, - "python_path": "d3m.primitives.classification.linear_svc.SKlearn", - "source": {'name': 'JPL', 'contact': 'mailto:shah@jpl.nasa.gov', 'uris': ['https://gitlab.com/datadrivendiscovery/sklearn-wrap/issues', 'https://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html']}, - "version": "2019.11.13", - "id": "71749b20-80e9-3a8e-998e-25da5bbc1abc", - "hyperparams_to_tune": ['C'], - 'installation': [ - {'type': metadata_base.PrimitiveInstallationType.PIP, - 'package_uri': 'git+https://gitlab.com/datadrivendiscovery/sklearn-wrap.git@{git_commit}#egg=sklearn_wrap'.format( - git_commit=utils.current_git_commit(os.path.dirname(__file__)), - ), - }] - }) - - def __init__(self, *, - hyperparams: Hyperparams, - random_seed: int = 0, - docker_containers: Dict[str, DockerContainer] = None, - _verbose: int = 0) -> None: - - super().__init__(hyperparams=hyperparams, random_seed=random_seed, docker_containers=docker_containers) - - # False - self._clf = LinearSVC( - penalty=self.hyperparams['penalty'], - loss=self.hyperparams['loss'], - dual=self.hyperparams['dual'], - tol=self.hyperparams['tol'], - C=self.hyperparams['C'], - multi_class=self.hyperparams['multi_class'], - fit_intercept=self.hyperparams['fit_intercept'], - intercept_scaling=self.hyperparams['intercept_scaling'], - class_weight=self.hyperparams['class_weight'], - max_iter=self.hyperparams['max_iter'], - verbose=_verbose, - random_state=self.random_seed, - ) - - self._inputs = None - self._outputs = None - self._training_inputs = None - self._training_outputs = None - self._target_names = None - self._training_indices = None - self._target_column_indices = None - self._target_columns_metadata: List[OrderedDict] = None - self._input_column_names = None - self._fitted = False - self._new_training_data = False - - def set_training_data(self, *, inputs: Inputs, outputs: Outputs) -> None: - self._inputs = inputs - self._outputs = outputs - self._fitted = False - self._new_training_data = True - - def fit(self, *, timeout: float = None, iterations: int = None) -> CallResult[None]: - if self._inputs is None or self._outputs is None: - raise ValueError("Missing training data.") - - if not self._new_training_data: - return CallResult(None) - self._new_training_data = False - - self._training_inputs, self._training_indices = self._get_columns_to_fit(self._inputs, self.hyperparams) - self._training_outputs, self._target_names, self._target_column_indices = self._get_targets(self._outputs, self.hyperparams) - self._input_column_names = self._training_inputs.columns - - if len(self._training_indices) > 0 and len(self._target_column_indices) > 0: - self._target_columns_metadata = self._get_target_columns_metadata(self._training_outputs.metadata, self.hyperparams) - sk_training_output = self._training_outputs.values - - shape = sk_training_output.shape - if len(shape) == 2 and shape[1] == 1: - sk_training_output = numpy.ravel(sk_training_output) - - self._clf.fit(self._training_inputs, sk_training_output) - self._fitted = True - else: - if self.hyperparams['error_on_no_input']: - raise RuntimeError("No input columns were selected") - self.logger.warn("No input columns were selected") - - return CallResult(None) - - - - def produce(self, *, inputs: Inputs, timeout: float = None, iterations: int = None) -> CallResult[Outputs]: - sk_inputs, columns_to_use = self._get_columns_to_fit(inputs, self.hyperparams) - output = [] - if len(sk_inputs.columns): - try: - sk_output = self._clf.predict(sk_inputs) - except sklearn.exceptions.NotFittedError as error: - raise PrimitiveNotFittedError("Primitive not fitted.") from error - # For primitives that allow predicting without fitting like GaussianProcessRegressor - if not self._fitted: - raise PrimitiveNotFittedError("Primitive not fitted.") - if sparse.issparse(sk_output): - sk_output = sk_output.toarray() - output = self._wrap_predictions(inputs, sk_output) - output.columns = self._target_names - output = [output] - else: - if self.hyperparams['error_on_no_input']: - raise RuntimeError("No input columns were selected") - self.logger.warn("No input columns were selected") - outputs = base_utils.combine_columns(return_result=self.hyperparams['return_result'], - add_index_columns=self.hyperparams['add_index_columns'], - inputs=inputs, column_indices=self._target_column_indices, - columns_list=output) - - return CallResult(outputs) - - - def get_params(self) -> Params: - if not self._fitted: - return Params( - coef_=None, - intercept_=None, - classes_=None, - n_iter_=None, - input_column_names=self._input_column_names, - training_indices_=self._training_indices, - target_names_=self._target_names, - target_column_indices_=self._target_column_indices, - target_columns_metadata_=self._target_columns_metadata - ) - - return Params( - coef_=getattr(self._clf, 'coef_', None), - intercept_=getattr(self._clf, 'intercept_', None), - classes_=getattr(self._clf, 'classes_', None), - n_iter_=getattr(self._clf, 'n_iter_', None), - input_column_names=self._input_column_names, - training_indices_=self._training_indices, - target_names_=self._target_names, - target_column_indices_=self._target_column_indices, - target_columns_metadata_=self._target_columns_metadata - ) - - def set_params(self, *, params: Params) -> None: - self._clf.coef_ = params['coef_'] - self._clf.intercept_ = params['intercept_'] - self._clf.classes_ = params['classes_'] - self._clf.n_iter_ = params['n_iter_'] - self._input_column_names = params['input_column_names'] - self._training_indices = params['training_indices_'] - self._target_names = params['target_names_'] - self._target_column_indices = params['target_column_indices_'] - self._target_columns_metadata = params['target_columns_metadata_'] - - if params['coef_'] is not None: - self._fitted = True - if params['intercept_'] is not None: - self._fitted = True - if params['classes_'] is not None: - self._fitted = True - if params['n_iter_'] is not None: - self._fitted = True - - - - - - - - @classmethod - def _get_columns_to_fit(cls, inputs: Inputs, hyperparams: Hyperparams): - if not hyperparams['use_semantic_types']: - return inputs, list(range(len(inputs.columns))) - - inputs_metadata = inputs.metadata - - def can_produce_column(column_index: int) -> bool: - return cls._can_produce_column(inputs_metadata, column_index, hyperparams) - - columns_to_produce, columns_not_to_produce = base_utils.get_columns_to_use(inputs_metadata, - use_columns=hyperparams['use_inputs_columns'], - exclude_columns=hyperparams['exclude_inputs_columns'], - can_use_column=can_produce_column) - return inputs.iloc[:, columns_to_produce], columns_to_produce - # return columns_to_produce - - @classmethod - def _can_produce_column(cls, inputs_metadata: metadata_base.DataMetadata, column_index: int, hyperparams: Hyperparams) -> bool: - column_metadata = inputs_metadata.query((metadata_base.ALL_ELEMENTS, column_index)) - - accepted_structural_types = (int, float, numpy.integer, numpy.float64) - accepted_semantic_types = set() - accepted_semantic_types.add("https://metadata.datadrivendiscovery.org/types/Attribute") - if not issubclass(column_metadata['structural_type'], accepted_structural_types): - return False - - semantic_types = set(column_metadata.get('semantic_types', [])) - - if len(semantic_types) == 0: - cls.logger.warning("No semantic types found in column metadata") - return False - # Making sure all accepted_semantic_types are available in semantic_types - if len(accepted_semantic_types - semantic_types) == 0: - return True - - return False - - @classmethod - def _get_targets(cls, data: d3m_dataframe, hyperparams: Hyperparams): - if not hyperparams['use_semantic_types']: - return data, list(data.columns), list(range(len(data.columns))) - - metadata = data.metadata - - def can_produce_column(column_index: int) -> bool: - accepted_semantic_types = set() - accepted_semantic_types.add("https://metadata.datadrivendiscovery.org/types/TrueTarget") - column_metadata = metadata.query((metadata_base.ALL_ELEMENTS, column_index)) - semantic_types = set(column_metadata.get('semantic_types', [])) - if len(semantic_types) == 0: - cls.logger.warning("No semantic types found in column metadata") - return False - # Making sure all accepted_semantic_types are available in semantic_types - if len(accepted_semantic_types - semantic_types) == 0: - return True - return False - - target_column_indices, target_columns_not_to_produce = base_utils.get_columns_to_use(metadata, - use_columns=hyperparams[ - 'use_outputs_columns'], - exclude_columns= - hyperparams[ - 'exclude_outputs_columns'], - can_use_column=can_produce_column) - targets = [] - if target_column_indices: - targets = data.select_columns(target_column_indices) - target_column_names = [] - for idx in target_column_indices: - target_column_names.append(data.columns[idx]) - return targets, target_column_names, target_column_indices - - @classmethod - def _get_target_columns_metadata(cls, outputs_metadata: metadata_base.DataMetadata, hyperparams) -> List[OrderedDict]: - outputs_length = outputs_metadata.query((metadata_base.ALL_ELEMENTS,))['dimension']['length'] - - target_columns_metadata: List[OrderedDict] = [] - for column_index in range(outputs_length): - column_metadata = OrderedDict(outputs_metadata.query_column(column_index)) - - # Update semantic types and prepare it for predicted targets. - semantic_types = set(column_metadata.get('semantic_types', [])) - semantic_types_to_remove = set(["https://metadata.datadrivendiscovery.org/types/TrueTarget","https://metadata.datadrivendiscovery.org/types/SuggestedTarget",]) - add_semantic_types = set(["https://metadata.datadrivendiscovery.org/types/PredictedTarget",]) - add_semantic_types.add(hyperparams["return_semantic_type"]) - semantic_types = semantic_types - semantic_types_to_remove - semantic_types = semantic_types.union(add_semantic_types) - column_metadata['semantic_types'] = list(semantic_types) - - target_columns_metadata.append(column_metadata) - - return target_columns_metadata - - @classmethod - def _update_predictions_metadata(cls, inputs_metadata: metadata_base.DataMetadata, outputs: Optional[Outputs], - target_columns_metadata: List[OrderedDict]) -> metadata_base.DataMetadata: - outputs_metadata = metadata_base.DataMetadata().generate(value=outputs) - - for column_index, column_metadata in enumerate(target_columns_metadata): - column_metadata.pop("structural_type", None) - outputs_metadata = outputs_metadata.update_column(column_index, column_metadata) - - return outputs_metadata - - def _wrap_predictions(self, inputs: Inputs, predictions: ndarray) -> Outputs: - outputs = d3m_dataframe(predictions, generate_metadata=False) - outputs.metadata = self._update_predictions_metadata(inputs.metadata, outputs, self._target_columns_metadata) - return outputs - - - @classmethod - def _add_target_columns_metadata(cls, outputs_metadata: metadata_base.DataMetadata): - outputs_length = outputs_metadata.query((metadata_base.ALL_ELEMENTS,))['dimension']['length'] - - target_columns_metadata: List[OrderedDict] = [] - for column_index in range(outputs_length): - column_metadata = OrderedDict() - semantic_types = [] - semantic_types.append('https://metadata.datadrivendiscovery.org/types/PredictedTarget') - column_name = outputs_metadata.query((metadata_base.ALL_ELEMENTS, column_index)).get("name") - if column_name is None: - column_name = "output_{}".format(column_index) - column_metadata["semantic_types"] = semantic_types - column_metadata["name"] = str(column_name) - target_columns_metadata.append(column_metadata) - - return target_columns_metadata - - -SKLinearSVC.__doc__ = LinearSVC.__doc__ \ No newline at end of file diff --git a/common-primitives/sklearn-wrap/sklearn_wrap/SKLinearSVR.py b/common-primitives/sklearn-wrap/sklearn_wrap/SKLinearSVR.py deleted file mode 100644 index af809b8..0000000 --- a/common-primitives/sklearn-wrap/sklearn_wrap/SKLinearSVR.py +++ /dev/null @@ -1,452 +0,0 @@ -from typing import Any, Callable, List, Dict, Union, Optional, Sequence, Tuple -from numpy import ndarray -from collections import OrderedDict -from scipy import sparse -import os -import sklearn -import numpy -import typing - -# Custom import commands if any -from sklearn.svm.classes import LinearSVR - - -from d3m.container.numpy import ndarray as d3m_ndarray -from d3m.container import DataFrame as d3m_dataframe -from d3m.metadata import hyperparams, params, base as metadata_base -from d3m import utils -from d3m.base import utils as base_utils -from d3m.exceptions import PrimitiveNotFittedError -from d3m.primitive_interfaces.base import CallResult, DockerContainer - -from d3m.primitive_interfaces.supervised_learning import SupervisedLearnerPrimitiveBase -from d3m.primitive_interfaces.base import ProbabilisticCompositionalityMixin, ContinueFitMixin -from d3m import exceptions -import pandas - - - -Inputs = d3m_dataframe -Outputs = d3m_dataframe - - -class Params(params.Params): - coef_: Optional[ndarray] - intercept_: Optional[ndarray] - n_iter_: Optional[numpy.int32] - input_column_names: Optional[Any] - target_names_: Optional[Sequence[Any]] - training_indices_: Optional[Sequence[int]] - target_column_indices_: Optional[Sequence[int]] - target_columns_metadata_: Optional[List[OrderedDict]] - - - -class Hyperparams(hyperparams.Hyperparams): - C = hyperparams.Bounded[float]( - default=1, - lower=0, - upper=None, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - description='Penalty parameter C of the error term. The penalty is a squared l2 penalty. The bigger this parameter, the less regularization is used.' - ) - loss = hyperparams.Enumeration[str]( - values=['epsilon_insensitive', 'squared_epsilon_insensitive'], - default='epsilon_insensitive', - description='Specifies the loss function. \'l1\' is the epsilon-insensitive loss (standard SVR) while \'l2\' is the squared epsilon-insensitive loss.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - epsilon = hyperparams.Bounded[float]( - default=0.1, - lower=0, - upper=None, - description='Epsilon parameter in the epsilon-insensitive loss function. Note that the value of this parameter depends on the scale of the target variable y. If unsure, set ``epsilon=0``.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - dual = hyperparams.UniformBool( - default=True, - description='Select the algorithm to either solve the dual or primal optimization problem. Prefer dual=False when n_samples > n_features.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - tol = hyperparams.Bounded[float]( - default=0.0001, - lower=0, - upper=None, - description='Tolerance for stopping criteria.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - fit_intercept = hyperparams.UniformBool( - default=True, - description='Whether to calculate the intercept for this model. If set to false, no intercept will be used in calculations (i.e. data is expected to be already centered).', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - intercept_scaling = hyperparams.Bounded[float]( - default=1, - lower=0, - upper=None, - description='When self.fit_intercept is True, instance vector x becomes [x, self.intercept_scaling], i.e. a "synthetic" feature with constant value equals to intercept_scaling is appended to the instance vector. The intercept becomes intercept_scaling * synthetic feature weight Note! the synthetic feature weight is subject to l1/l2 regularization as all other features. To lessen the effect of regularization on synthetic feature weight (and therefore on the intercept) intercept_scaling has to be increased.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - max_iter = hyperparams.Bounded[int]( - default=1000, - lower=0, - upper=None, - description='The maximum number of iterations to be run.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - - use_inputs_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to force primitive to use as training input. If any specified column cannot be parsed, it is skipped.", - ) - use_outputs_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to force primitive to use as training target. If any specified column cannot be parsed, it is skipped.", - ) - exclude_inputs_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to not use as training inputs. Applicable only if \"use_columns\" is not provided.", - ) - exclude_outputs_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to not use as training target. Applicable only if \"use_columns\" is not provided.", - ) - return_result = hyperparams.Enumeration( - values=['append', 'replace', 'new'], - default='new', - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Should parsed columns be appended, should they replace original columns, or should only parsed columns be returned? This hyperparam is ignored if use_semantic_types is set to false.", - ) - use_semantic_types = hyperparams.UniformBool( - default=False, - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Controls whether semantic_types metadata will be used for filtering columns in input dataframe. Setting this to false makes the code ignore return_result and will produce only the output dataframe" - ) - add_index_columns = hyperparams.UniformBool( - default=False, - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Also include primary index columns if input data has them. Applicable only if \"return_result\" is set to \"new\".", - ) - error_on_no_input = hyperparams.UniformBool( - default=True, - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Throw an exception if no input column is selected/provided. Defaults to true to behave like sklearn. To prevent pipelines from breaking set this to False.", - ) - - return_semantic_type = hyperparams.Enumeration[str]( - values=['https://metadata.datadrivendiscovery.org/types/Attribute', 'https://metadata.datadrivendiscovery.org/types/ConstructedAttribute', 'https://metadata.datadrivendiscovery.org/types/PredictedTarget'], - default='https://metadata.datadrivendiscovery.org/types/PredictedTarget', - description='Decides what semantic type to attach to generated output', - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'] - ) - -class SKLinearSVR(SupervisedLearnerPrimitiveBase[Inputs, Outputs, Params, Hyperparams]): - """ - Primitive wrapping for sklearn LinearSVR - `sklearn documentation `_ - - """ - - __author__ = "JPL MARVIN" - metadata = metadata_base.PrimitiveMetadata({ - "algorithm_types": [metadata_base.PrimitiveAlgorithmType.SUPPORT_VECTOR_MACHINE, ], - "name": "sklearn.svm.classes.LinearSVR", - "primitive_family": metadata_base.PrimitiveFamily.REGRESSION, - "python_path": "d3m.primitives.regression.linear_svr.SKlearn", - "source": {'name': 'JPL', 'contact': 'mailto:shah@jpl.nasa.gov', 'uris': ['https://gitlab.com/datadrivendiscovery/sklearn-wrap/issues', 'https://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVR.html']}, - "version": "2019.11.13", - "id": "f40ffdc0-1d6d-3234-8fd0-a3e4d7a136a7", - "hyperparams_to_tune": ['C'], - 'installation': [ - {'type': metadata_base.PrimitiveInstallationType.PIP, - 'package_uri': 'git+https://gitlab.com/datadrivendiscovery/sklearn-wrap.git@{git_commit}#egg=sklearn_wrap'.format( - git_commit=utils.current_git_commit(os.path.dirname(__file__)), - ), - }] - }) - - def __init__(self, *, - hyperparams: Hyperparams, - random_seed: int = 0, - docker_containers: Dict[str, DockerContainer] = None, - _verbose: int = 0) -> None: - - super().__init__(hyperparams=hyperparams, random_seed=random_seed, docker_containers=docker_containers) - - # False - self._clf = LinearSVR( - C=self.hyperparams['C'], - loss=self.hyperparams['loss'], - epsilon=self.hyperparams['epsilon'], - dual=self.hyperparams['dual'], - tol=self.hyperparams['tol'], - fit_intercept=self.hyperparams['fit_intercept'], - intercept_scaling=self.hyperparams['intercept_scaling'], - max_iter=self.hyperparams['max_iter'], - verbose=_verbose, - random_state=self.random_seed, - ) - - self._inputs = None - self._outputs = None - self._training_inputs = None - self._training_outputs = None - self._target_names = None - self._training_indices = None - self._target_column_indices = None - self._target_columns_metadata: List[OrderedDict] = None - self._input_column_names = None - self._fitted = False - self._new_training_data = False - - def set_training_data(self, *, inputs: Inputs, outputs: Outputs) -> None: - self._inputs = inputs - self._outputs = outputs - self._fitted = False - self._new_training_data = True - - def fit(self, *, timeout: float = None, iterations: int = None) -> CallResult[None]: - if self._inputs is None or self._outputs is None: - raise ValueError("Missing training data.") - - if not self._new_training_data: - return CallResult(None) - self._new_training_data = False - - self._training_inputs, self._training_indices = self._get_columns_to_fit(self._inputs, self.hyperparams) - self._training_outputs, self._target_names, self._target_column_indices = self._get_targets(self._outputs, self.hyperparams) - self._input_column_names = self._training_inputs.columns - - if len(self._training_indices) > 0 and len(self._target_column_indices) > 0: - self._target_columns_metadata = self._get_target_columns_metadata(self._training_outputs.metadata, self.hyperparams) - sk_training_output = self._training_outputs.values - - shape = sk_training_output.shape - if len(shape) == 2 and shape[1] == 1: - sk_training_output = numpy.ravel(sk_training_output) - - self._clf.fit(self._training_inputs, sk_training_output) - self._fitted = True - else: - if self.hyperparams['error_on_no_input']: - raise RuntimeError("No input columns were selected") - self.logger.warn("No input columns were selected") - - return CallResult(None) - - - - def produce(self, *, inputs: Inputs, timeout: float = None, iterations: int = None) -> CallResult[Outputs]: - sk_inputs, columns_to_use = self._get_columns_to_fit(inputs, self.hyperparams) - output = [] - if len(sk_inputs.columns): - try: - sk_output = self._clf.predict(sk_inputs) - except sklearn.exceptions.NotFittedError as error: - raise PrimitiveNotFittedError("Primitive not fitted.") from error - # For primitives that allow predicting without fitting like GaussianProcessRegressor - if not self._fitted: - raise PrimitiveNotFittedError("Primitive not fitted.") - if sparse.issparse(sk_output): - sk_output = sk_output.toarray() - output = self._wrap_predictions(inputs, sk_output) - output.columns = self._target_names - output = [output] - else: - if self.hyperparams['error_on_no_input']: - raise RuntimeError("No input columns were selected") - self.logger.warn("No input columns were selected") - outputs = base_utils.combine_columns(return_result=self.hyperparams['return_result'], - add_index_columns=self.hyperparams['add_index_columns'], - inputs=inputs, column_indices=self._target_column_indices, - columns_list=output) - - return CallResult(outputs) - - - def get_params(self) -> Params: - if not self._fitted: - return Params( - coef_=None, - intercept_=None, - n_iter_=None, - input_column_names=self._input_column_names, - training_indices_=self._training_indices, - target_names_=self._target_names, - target_column_indices_=self._target_column_indices, - target_columns_metadata_=self._target_columns_metadata - ) - - return Params( - coef_=getattr(self._clf, 'coef_', None), - intercept_=getattr(self._clf, 'intercept_', None), - n_iter_=getattr(self._clf, 'n_iter_', None), - input_column_names=self._input_column_names, - training_indices_=self._training_indices, - target_names_=self._target_names, - target_column_indices_=self._target_column_indices, - target_columns_metadata_=self._target_columns_metadata - ) - - def set_params(self, *, params: Params) -> None: - self._clf.coef_ = params['coef_'] - self._clf.intercept_ = params['intercept_'] - self._clf.n_iter_ = params['n_iter_'] - self._input_column_names = params['input_column_names'] - self._training_indices = params['training_indices_'] - self._target_names = params['target_names_'] - self._target_column_indices = params['target_column_indices_'] - self._target_columns_metadata = params['target_columns_metadata_'] - - if params['coef_'] is not None: - self._fitted = True - if params['intercept_'] is not None: - self._fitted = True - if params['n_iter_'] is not None: - self._fitted = True - - - - - - - - @classmethod - def _get_columns_to_fit(cls, inputs: Inputs, hyperparams: Hyperparams): - if not hyperparams['use_semantic_types']: - return inputs, list(range(len(inputs.columns))) - - inputs_metadata = inputs.metadata - - def can_produce_column(column_index: int) -> bool: - return cls._can_produce_column(inputs_metadata, column_index, hyperparams) - - columns_to_produce, columns_not_to_produce = base_utils.get_columns_to_use(inputs_metadata, - use_columns=hyperparams['use_inputs_columns'], - exclude_columns=hyperparams['exclude_inputs_columns'], - can_use_column=can_produce_column) - return inputs.iloc[:, columns_to_produce], columns_to_produce - # return columns_to_produce - - @classmethod - def _can_produce_column(cls, inputs_metadata: metadata_base.DataMetadata, column_index: int, hyperparams: Hyperparams) -> bool: - column_metadata = inputs_metadata.query((metadata_base.ALL_ELEMENTS, column_index)) - - accepted_structural_types = (int, float, numpy.integer, numpy.float64) - accepted_semantic_types = set() - accepted_semantic_types.add("https://metadata.datadrivendiscovery.org/types/Attribute") - if not issubclass(column_metadata['structural_type'], accepted_structural_types): - return False - - semantic_types = set(column_metadata.get('semantic_types', [])) - - if len(semantic_types) == 0: - cls.logger.warning("No semantic types found in column metadata") - return False - # Making sure all accepted_semantic_types are available in semantic_types - if len(accepted_semantic_types - semantic_types) == 0: - return True - - return False - - @classmethod - def _get_targets(cls, data: d3m_dataframe, hyperparams: Hyperparams): - if not hyperparams['use_semantic_types']: - return data, list(data.columns), list(range(len(data.columns))) - - metadata = data.metadata - - def can_produce_column(column_index: int) -> bool: - accepted_semantic_types = set() - accepted_semantic_types.add("https://metadata.datadrivendiscovery.org/types/TrueTarget") - column_metadata = metadata.query((metadata_base.ALL_ELEMENTS, column_index)) - semantic_types = set(column_metadata.get('semantic_types', [])) - if len(semantic_types) == 0: - cls.logger.warning("No semantic types found in column metadata") - return False - # Making sure all accepted_semantic_types are available in semantic_types - if len(accepted_semantic_types - semantic_types) == 0: - return True - return False - - target_column_indices, target_columns_not_to_produce = base_utils.get_columns_to_use(metadata, - use_columns=hyperparams[ - 'use_outputs_columns'], - exclude_columns= - hyperparams[ - 'exclude_outputs_columns'], - can_use_column=can_produce_column) - targets = [] - if target_column_indices: - targets = data.select_columns(target_column_indices) - target_column_names = [] - for idx in target_column_indices: - target_column_names.append(data.columns[idx]) - return targets, target_column_names, target_column_indices - - @classmethod - def _get_target_columns_metadata(cls, outputs_metadata: metadata_base.DataMetadata, hyperparams) -> List[OrderedDict]: - outputs_length = outputs_metadata.query((metadata_base.ALL_ELEMENTS,))['dimension']['length'] - - target_columns_metadata: List[OrderedDict] = [] - for column_index in range(outputs_length): - column_metadata = OrderedDict(outputs_metadata.query_column(column_index)) - - # Update semantic types and prepare it for predicted targets. - semantic_types = set(column_metadata.get('semantic_types', [])) - semantic_types_to_remove = set(["https://metadata.datadrivendiscovery.org/types/TrueTarget","https://metadata.datadrivendiscovery.org/types/SuggestedTarget",]) - add_semantic_types = set(["https://metadata.datadrivendiscovery.org/types/PredictedTarget",]) - add_semantic_types.add(hyperparams["return_semantic_type"]) - semantic_types = semantic_types - semantic_types_to_remove - semantic_types = semantic_types.union(add_semantic_types) - column_metadata['semantic_types'] = list(semantic_types) - - target_columns_metadata.append(column_metadata) - - return target_columns_metadata - - @classmethod - def _update_predictions_metadata(cls, inputs_metadata: metadata_base.DataMetadata, outputs: Optional[Outputs], - target_columns_metadata: List[OrderedDict]) -> metadata_base.DataMetadata: - outputs_metadata = metadata_base.DataMetadata().generate(value=outputs) - - for column_index, column_metadata in enumerate(target_columns_metadata): - column_metadata.pop("structural_type", None) - outputs_metadata = outputs_metadata.update_column(column_index, column_metadata) - - return outputs_metadata - - def _wrap_predictions(self, inputs: Inputs, predictions: ndarray) -> Outputs: - outputs = d3m_dataframe(predictions, generate_metadata=False) - outputs.metadata = self._update_predictions_metadata(inputs.metadata, outputs, self._target_columns_metadata) - return outputs - - - @classmethod - def _add_target_columns_metadata(cls, outputs_metadata: metadata_base.DataMetadata): - outputs_length = outputs_metadata.query((metadata_base.ALL_ELEMENTS,))['dimension']['length'] - - target_columns_metadata: List[OrderedDict] = [] - for column_index in range(outputs_length): - column_metadata = OrderedDict() - semantic_types = [] - semantic_types.append('https://metadata.datadrivendiscovery.org/types/PredictedTarget') - column_name = outputs_metadata.query((metadata_base.ALL_ELEMENTS, column_index)).get("name") - if column_name is None: - column_name = "output_{}".format(column_index) - column_metadata["semantic_types"] = semantic_types - column_metadata["name"] = str(column_name) - target_columns_metadata.append(column_metadata) - - return target_columns_metadata - - -SKLinearSVR.__doc__ = LinearSVR.__doc__ \ No newline at end of file diff --git a/common-primitives/sklearn-wrap/sklearn_wrap/SKLogisticRegression.py b/common-primitives/sklearn-wrap/sklearn_wrap/SKLogisticRegression.py deleted file mode 100644 index f5578d7..0000000 --- a/common-primitives/sklearn-wrap/sklearn_wrap/SKLogisticRegression.py +++ /dev/null @@ -1,582 +0,0 @@ -from typing import Any, Callable, List, Dict, Union, Optional, Sequence, Tuple -from numpy import ndarray -from collections import OrderedDict -from scipy import sparse -import os -import sklearn -import numpy -import typing - -# Custom import commands if any -from sklearn.linear_model.logistic import LogisticRegression - - -from d3m.container.numpy import ndarray as d3m_ndarray -from d3m.container import DataFrame as d3m_dataframe -from d3m.metadata import hyperparams, params, base as metadata_base -from d3m import utils -from d3m.base import utils as base_utils -from d3m.exceptions import PrimitiveNotFittedError -from d3m.primitive_interfaces.base import CallResult, DockerContainer - -from d3m.primitive_interfaces.supervised_learning import SupervisedLearnerPrimitiveBase -from d3m.primitive_interfaces.base import ProbabilisticCompositionalityMixin, ContinueFitMixin -from d3m import exceptions -import pandas - - - -Inputs = d3m_dataframe -Outputs = d3m_dataframe - - -class Params(params.Params): - coef_: Optional[ndarray] - intercept_: Optional[ndarray] - n_iter_: Optional[ndarray] - classes_: Optional[ndarray] - input_column_names: Optional[Any] - target_names_: Optional[Sequence[Any]] - training_indices_: Optional[Sequence[int]] - target_column_indices_: Optional[Sequence[int]] - target_columns_metadata_: Optional[List[OrderedDict]] - - - -class Hyperparams(hyperparams.Hyperparams): - penalty = hyperparams.Choice( - choices={ - 'l1': hyperparams.Hyperparams.define( - configuration=OrderedDict({}) - ), - 'l2': hyperparams.Hyperparams.define( - configuration=OrderedDict({}) - ), - 'none': hyperparams.Hyperparams.define( - configuration=OrderedDict({}) - ), - 'elasticnet': hyperparams.Hyperparams.define( - configuration=OrderedDict({ - 'l1_ratio': hyperparams.Union( - configuration=OrderedDict({ - 'float': hyperparams.Uniform( - lower=0, - upper=1, - default=0.001, - lower_inclusive=True, - upper_inclusive=True, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ), - 'none': hyperparams.Constant( - default=None, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ) - }), - default='float', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - }) - ) - }, - default='l2', - description='Used to specify the norm used in the penalization. The \'newton-cg\', \'sag\' and \'lbfgs\' solvers support only l2 penalties.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - dual = hyperparams.UniformBool( - default=False, - description='Dual or primal formulation. Dual formulation is only implemented for l2 penalty with liblinear solver. Prefer dual=False when n_samples > n_features.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - fit_intercept = hyperparams.UniformBool( - default=True, - description='Specifies if a constant (a.k.a. bias or intercept) should be added to the decision function.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - intercept_scaling = hyperparams.Hyperparameter[float]( - default=1, - description='Useful only when the solver \'liblinear\' is used and self.fit_intercept is set to True. In this case, x becomes [x, self.intercept_scaling], i.e. a "synthetic" feature with constant value equal to intercept_scaling is appended to the instance vector. The intercept becomes ``intercept_scaling * synthetic_feature_weight``. Note! the synthetic feature weight is subject to l1/l2 regularization as all other features. To lessen the effect of regularization on synthetic feature weight (and therefore on the intercept) intercept_scaling has to be increased.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - class_weight = hyperparams.Union( - configuration=OrderedDict({ - 'str': hyperparams.Constant( - default='balanced', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ), - 'none': hyperparams.Constant( - default=None, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ) - }), - default='none', - description='Weights associated with classes in the form ``{class_label: weight}``. If not given, all classes are supposed to have weight one. The "balanced" mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data as ``n_samples / (n_classes * np.bincount(y))``. Note that these weights will be multiplied with sample_weight (passed through the fit method) if sample_weight is specified. .. versionadded:: 0.17 *class_weight=\'balanced\'* instead of deprecated *class_weight=\'auto\'*.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - max_iter = hyperparams.Bounded[int]( - default=100, - lower=0, - upper=None, - description='Useful only for the newton-cg, sag and lbfgs solvers. Maximum number of iterations taken for the solvers to converge.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - solver = hyperparams.Enumeration[str]( - values=['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga'], - default='liblinear', - description='Algorithm to use in the optimization problem. - For small datasets, \'liblinear\' is a good choice, whereas \'sag\' is faster for large ones. - For multiclass problems, only \'newton-cg\', \'sag\' and \'lbfgs\' handle multinomial loss; \'liblinear\' is limited to one-versus-rest schemes. - \'newton-cg\', \'lbfgs\' and \'sag\' only handle L2 penalty. Note that \'sag\' fast convergence is only guaranteed on features with approximately the same scale. You can preprocess the data with a scaler from sklearn.preprocessing. .. versionadded:: 0.17 Stochastic Average Gradient descent solver.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - tol = hyperparams.Bounded[float]( - default=0.0001, - lower=0, - upper=None, - description='Tolerance for stopping criteria.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - C = hyperparams.Hyperparameter[float]( - default=1.0, - description='Inverse of regularization strength; must be a positive float. Like in support vector machines, smaller values specify stronger regularization.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - multi_class = hyperparams.Enumeration[str]( - values=['ovr', 'multinomial'], - default='ovr', - description='Multiclass option can be either \'ovr\' or \'multinomial\'. If the option chosen is \'ovr\', then a binary problem is fit for each label. Else the loss minimised is the multinomial loss fit across the entire probability distribution. Works only for the \'newton-cg\', \'sag\' and \'lbfgs\' solver. .. versionadded:: 0.18 Stochastic Average Gradient descent solver for \'multinomial\' case.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - warm_start = hyperparams.UniformBool( - default=False, - description='When set to True, reuse the solution of the previous call to fit as initialization, otherwise, just erase the previous solution. Useless for liblinear solver. .. versionadded:: 0.17 *warm_start* to support *lbfgs*, *newton-cg*, *sag* solvers.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - n_jobs = hyperparams.Union( - configuration=OrderedDict({ - 'limit': hyperparams.Bounded[int]( - default=1, - lower=1, - upper=None, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ), - 'all_cores': hyperparams.Constant( - default=-1, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ) - }), - default='limit', - description='Number of CPU cores used during the cross-validation loop. If given a value of -1, all cores are used.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/ResourcesUseParameter'] - ) - - use_inputs_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to force primitive to use as training input. If any specified column cannot be parsed, it is skipped.", - ) - use_outputs_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to force primitive to use as training target. If any specified column cannot be parsed, it is skipped.", - ) - exclude_inputs_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to not use as training inputs. Applicable only if \"use_columns\" is not provided.", - ) - exclude_outputs_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to not use as training target. Applicable only if \"use_columns\" is not provided.", - ) - return_result = hyperparams.Enumeration( - values=['append', 'replace', 'new'], - default='new', - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Should parsed columns be appended, should they replace original columns, or should only parsed columns be returned? This hyperparam is ignored if use_semantic_types is set to false.", - ) - use_semantic_types = hyperparams.UniformBool( - default=False, - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Controls whether semantic_types metadata will be used for filtering columns in input dataframe. Setting this to false makes the code ignore return_result and will produce only the output dataframe" - ) - add_index_columns = hyperparams.UniformBool( - default=False, - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Also include primary index columns if input data has them. Applicable only if \"return_result\" is set to \"new\".", - ) - error_on_no_input = hyperparams.UniformBool( - default=True, - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Throw an exception if no input column is selected/provided. Defaults to true to behave like sklearn. To prevent pipelines from breaking set this to False.", - ) - - return_semantic_type = hyperparams.Enumeration[str]( - values=['https://metadata.datadrivendiscovery.org/types/Attribute', 'https://metadata.datadrivendiscovery.org/types/ConstructedAttribute', 'https://metadata.datadrivendiscovery.org/types/PredictedTarget'], - default='https://metadata.datadrivendiscovery.org/types/PredictedTarget', - description='Decides what semantic type to attach to generated output', - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'] - ) - -class SKLogisticRegression(SupervisedLearnerPrimitiveBase[Inputs, Outputs, Params, Hyperparams], - ProbabilisticCompositionalityMixin[Inputs, Outputs, Params, Hyperparams]): - """ - Primitive wrapping for sklearn LogisticRegression - `sklearn documentation `_ - - """ - - __author__ = "JPL MARVIN" - metadata = metadata_base.PrimitiveMetadata({ - "algorithm_types": [metadata_base.PrimitiveAlgorithmType.LOGISTIC_REGRESSION, ], - "name": "sklearn.linear_model.logistic.LogisticRegression", - "primitive_family": metadata_base.PrimitiveFamily.CLASSIFICATION, - "python_path": "d3m.primitives.classification.logistic_regression.SKlearn", - "source": {'name': 'JPL', 'contact': 'mailto:shah@jpl.nasa.gov', 'uris': ['https://gitlab.com/datadrivendiscovery/sklearn-wrap/issues', 'https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html']}, - "version": "2019.11.13", - "id": "b9c81b40-8ed1-3b23-80cf-0d6fe6863962", - "hyperparams_to_tune": ['C', 'penalty'], - 'installation': [ - {'type': metadata_base.PrimitiveInstallationType.PIP, - 'package_uri': 'git+https://gitlab.com/datadrivendiscovery/sklearn-wrap.git@{git_commit}#egg=sklearn_wrap'.format( - git_commit=utils.current_git_commit(os.path.dirname(__file__)), - ), - }] - }) - - def __init__(self, *, - hyperparams: Hyperparams, - random_seed: int = 0, - docker_containers: Dict[str, DockerContainer] = None, - _verbose: int = 0) -> None: - - super().__init__(hyperparams=hyperparams, random_seed=random_seed, docker_containers=docker_containers) - - # False - self._clf = LogisticRegression( - penalty=self.hyperparams['penalty']['choice'], - l1_ratio=self.hyperparams['penalty'].get('l1_ratio', 'float'), - dual=self.hyperparams['dual'], - fit_intercept=self.hyperparams['fit_intercept'], - intercept_scaling=self.hyperparams['intercept_scaling'], - class_weight=self.hyperparams['class_weight'], - max_iter=self.hyperparams['max_iter'], - solver=self.hyperparams['solver'], - tol=self.hyperparams['tol'], - C=self.hyperparams['C'], - multi_class=self.hyperparams['multi_class'], - warm_start=self.hyperparams['warm_start'], - n_jobs=self.hyperparams['n_jobs'], - random_state=self.random_seed, - verbose=_verbose - ) - - self._inputs = None - self._outputs = None - self._training_inputs = None - self._training_outputs = None - self._target_names = None - self._training_indices = None - self._target_column_indices = None - self._target_columns_metadata: List[OrderedDict] = None - self._input_column_names = None - self._fitted = False - self._new_training_data = False - - def set_training_data(self, *, inputs: Inputs, outputs: Outputs) -> None: - self._inputs = inputs - self._outputs = outputs - self._fitted = False - self._new_training_data = True - - def fit(self, *, timeout: float = None, iterations: int = None) -> CallResult[None]: - if self._inputs is None or self._outputs is None: - raise ValueError("Missing training data.") - - if not self._new_training_data: - return CallResult(None) - self._new_training_data = False - - self._training_inputs, self._training_indices = self._get_columns_to_fit(self._inputs, self.hyperparams) - self._training_outputs, self._target_names, self._target_column_indices = self._get_targets(self._outputs, self.hyperparams) - self._input_column_names = self._training_inputs.columns - - if len(self._training_indices) > 0 and len(self._target_column_indices) > 0: - self._target_columns_metadata = self._get_target_columns_metadata(self._training_outputs.metadata, self.hyperparams) - sk_training_output = self._training_outputs.values - - shape = sk_training_output.shape - if len(shape) == 2 and shape[1] == 1: - sk_training_output = numpy.ravel(sk_training_output) - - self._clf.fit(self._training_inputs, sk_training_output) - self._fitted = True - else: - if self.hyperparams['error_on_no_input']: - raise RuntimeError("No input columns were selected") - self.logger.warn("No input columns were selected") - - return CallResult(None) - - - - def produce(self, *, inputs: Inputs, timeout: float = None, iterations: int = None) -> CallResult[Outputs]: - sk_inputs, columns_to_use = self._get_columns_to_fit(inputs, self.hyperparams) - output = [] - if len(sk_inputs.columns): - try: - sk_output = self._clf.predict(sk_inputs) - except sklearn.exceptions.NotFittedError as error: - raise PrimitiveNotFittedError("Primitive not fitted.") from error - # For primitives that allow predicting without fitting like GaussianProcessRegressor - if not self._fitted: - raise PrimitiveNotFittedError("Primitive not fitted.") - if sparse.issparse(sk_output): - sk_output = sk_output.toarray() - output = self._wrap_predictions(inputs, sk_output) - output.columns = self._target_names - output = [output] - else: - if self.hyperparams['error_on_no_input']: - raise RuntimeError("No input columns were selected") - self.logger.warn("No input columns were selected") - outputs = base_utils.combine_columns(return_result=self.hyperparams['return_result'], - add_index_columns=self.hyperparams['add_index_columns'], - inputs=inputs, column_indices=self._target_column_indices, - columns_list=output) - - return CallResult(outputs) - - - def get_params(self) -> Params: - if not self._fitted: - return Params( - coef_=None, - intercept_=None, - n_iter_=None, - classes_=None, - input_column_names=self._input_column_names, - training_indices_=self._training_indices, - target_names_=self._target_names, - target_column_indices_=self._target_column_indices, - target_columns_metadata_=self._target_columns_metadata - ) - - return Params( - coef_=getattr(self._clf, 'coef_', None), - intercept_=getattr(self._clf, 'intercept_', None), - n_iter_=getattr(self._clf, 'n_iter_', None), - classes_=getattr(self._clf, 'classes_', None), - input_column_names=self._input_column_names, - training_indices_=self._training_indices, - target_names_=self._target_names, - target_column_indices_=self._target_column_indices, - target_columns_metadata_=self._target_columns_metadata - ) - - def set_params(self, *, params: Params) -> None: - self._clf.coef_ = params['coef_'] - self._clf.intercept_ = params['intercept_'] - self._clf.n_iter_ = params['n_iter_'] - self._clf.classes_ = params['classes_'] - self._input_column_names = params['input_column_names'] - self._training_indices = params['training_indices_'] - self._target_names = params['target_names_'] - self._target_column_indices = params['target_column_indices_'] - self._target_columns_metadata = params['target_columns_metadata_'] - - if params['coef_'] is not None: - self._fitted = True - if params['intercept_'] is not None: - self._fitted = True - if params['n_iter_'] is not None: - self._fitted = True - if params['classes_'] is not None: - self._fitted = True - - - def log_likelihoods(self, *, - outputs: Outputs, - inputs: Inputs, - timeout: float = None, - iterations: int = None) -> CallResult[Sequence[float]]: - inputs = inputs.iloc[:, self._training_indices] # Get ndarray - outputs = outputs.iloc[:, self._target_column_indices] - - if len(inputs.columns) and len(outputs.columns): - - if outputs.shape[1] != self._clf.n_outputs_: - raise exceptions.InvalidArgumentValueError("\"outputs\" argument does not have the correct number of target columns.") - - log_proba = self._clf.predict_log_proba(inputs) - - # Making it always a list, even when only one target. - if self._clf.n_outputs_ == 1: - log_proba = [log_proba] - classes = [self._clf.classes_] - else: - classes = self._clf.classes_ - - samples_length = inputs.shape[0] - - log_likelihoods = [] - for k in range(self._clf.n_outputs_): - # We have to map each class to its internal (numerical) index used in the learner. - # This allows "outputs" to contain string classes. - outputs_column = outputs.iloc[:, k] - classes_map = pandas.Series(numpy.arange(len(classes[k])), index=classes[k]) - mapped_outputs_column = outputs_column.map(classes_map) - - # For each target column (column in "outputs"), for each sample (row) we pick the log - # likelihood for a given class. - log_likelihoods.append(log_proba[k][numpy.arange(samples_length), mapped_outputs_column]) - - results = d3m_dataframe(dict(enumerate(log_likelihoods)), generate_metadata=True) - results.columns = outputs.columns - - for k in range(self._clf.n_outputs_): - column_metadata = outputs.metadata.query_column(k) - if 'name' in column_metadata: - results.metadata = results.metadata.update_column(k, {'name': column_metadata['name']}) - - else: - results = d3m_dataframe(generate_metadata=True) - - return CallResult(results) - - - - - - @classmethod - def _get_columns_to_fit(cls, inputs: Inputs, hyperparams: Hyperparams): - if not hyperparams['use_semantic_types']: - return inputs, list(range(len(inputs.columns))) - - inputs_metadata = inputs.metadata - - def can_produce_column(column_index: int) -> bool: - return cls._can_produce_column(inputs_metadata, column_index, hyperparams) - - columns_to_produce, columns_not_to_produce = base_utils.get_columns_to_use(inputs_metadata, - use_columns=hyperparams['use_inputs_columns'], - exclude_columns=hyperparams['exclude_inputs_columns'], - can_use_column=can_produce_column) - return inputs.iloc[:, columns_to_produce], columns_to_produce - # return columns_to_produce - - @classmethod - def _can_produce_column(cls, inputs_metadata: metadata_base.DataMetadata, column_index: int, hyperparams: Hyperparams) -> bool: - column_metadata = inputs_metadata.query((metadata_base.ALL_ELEMENTS, column_index)) - - accepted_structural_types = (int, float, numpy.integer, numpy.float64) - accepted_semantic_types = set() - accepted_semantic_types.add("https://metadata.datadrivendiscovery.org/types/Attribute") - if not issubclass(column_metadata['structural_type'], accepted_structural_types): - return False - - semantic_types = set(column_metadata.get('semantic_types', [])) - - if len(semantic_types) == 0: - cls.logger.warning("No semantic types found in column metadata") - return False - # Making sure all accepted_semantic_types are available in semantic_types - if len(accepted_semantic_types - semantic_types) == 0: - return True - - return False - - @classmethod - def _get_targets(cls, data: d3m_dataframe, hyperparams: Hyperparams): - if not hyperparams['use_semantic_types']: - return data, list(data.columns), list(range(len(data.columns))) - - metadata = data.metadata - - def can_produce_column(column_index: int) -> bool: - accepted_semantic_types = set() - accepted_semantic_types.add("https://metadata.datadrivendiscovery.org/types/TrueTarget") - column_metadata = metadata.query((metadata_base.ALL_ELEMENTS, column_index)) - semantic_types = set(column_metadata.get('semantic_types', [])) - if len(semantic_types) == 0: - cls.logger.warning("No semantic types found in column metadata") - return False - # Making sure all accepted_semantic_types are available in semantic_types - if len(accepted_semantic_types - semantic_types) == 0: - return True - return False - - target_column_indices, target_columns_not_to_produce = base_utils.get_columns_to_use(metadata, - use_columns=hyperparams[ - 'use_outputs_columns'], - exclude_columns= - hyperparams[ - 'exclude_outputs_columns'], - can_use_column=can_produce_column) - targets = [] - if target_column_indices: - targets = data.select_columns(target_column_indices) - target_column_names = [] - for idx in target_column_indices: - target_column_names.append(data.columns[idx]) - return targets, target_column_names, target_column_indices - - @classmethod - def _get_target_columns_metadata(cls, outputs_metadata: metadata_base.DataMetadata, hyperparams) -> List[OrderedDict]: - outputs_length = outputs_metadata.query((metadata_base.ALL_ELEMENTS,))['dimension']['length'] - - target_columns_metadata: List[OrderedDict] = [] - for column_index in range(outputs_length): - column_metadata = OrderedDict(outputs_metadata.query_column(column_index)) - - # Update semantic types and prepare it for predicted targets. - semantic_types = set(column_metadata.get('semantic_types', [])) - semantic_types_to_remove = set(["https://metadata.datadrivendiscovery.org/types/TrueTarget","https://metadata.datadrivendiscovery.org/types/SuggestedTarget",]) - add_semantic_types = set(["https://metadata.datadrivendiscovery.org/types/PredictedTarget",]) - add_semantic_types.add(hyperparams["return_semantic_type"]) - semantic_types = semantic_types - semantic_types_to_remove - semantic_types = semantic_types.union(add_semantic_types) - column_metadata['semantic_types'] = list(semantic_types) - - target_columns_metadata.append(column_metadata) - - return target_columns_metadata - - @classmethod - def _update_predictions_metadata(cls, inputs_metadata: metadata_base.DataMetadata, outputs: Optional[Outputs], - target_columns_metadata: List[OrderedDict]) -> metadata_base.DataMetadata: - outputs_metadata = metadata_base.DataMetadata().generate(value=outputs) - - for column_index, column_metadata in enumerate(target_columns_metadata): - column_metadata.pop("structural_type", None) - outputs_metadata = outputs_metadata.update_column(column_index, column_metadata) - - return outputs_metadata - - def _wrap_predictions(self, inputs: Inputs, predictions: ndarray) -> Outputs: - outputs = d3m_dataframe(predictions, generate_metadata=False) - outputs.metadata = self._update_predictions_metadata(inputs.metadata, outputs, self._target_columns_metadata) - return outputs - - - @classmethod - def _add_target_columns_metadata(cls, outputs_metadata: metadata_base.DataMetadata): - outputs_length = outputs_metadata.query((metadata_base.ALL_ELEMENTS,))['dimension']['length'] - - target_columns_metadata: List[OrderedDict] = [] - for column_index in range(outputs_length): - column_metadata = OrderedDict() - semantic_types = [] - semantic_types.append('https://metadata.datadrivendiscovery.org/types/PredictedTarget') - column_name = outputs_metadata.query((metadata_base.ALL_ELEMENTS, column_index)).get("name") - if column_name is None: - column_name = "output_{}".format(column_index) - column_metadata["semantic_types"] = semantic_types - column_metadata["name"] = str(column_name) - target_columns_metadata.append(column_metadata) - - return target_columns_metadata - - -SKLogisticRegression.__doc__ = LogisticRegression.__doc__ \ No newline at end of file diff --git a/common-primitives/sklearn-wrap/sklearn_wrap/SKMLPClassifier.py b/common-primitives/sklearn-wrap/sklearn_wrap/SKMLPClassifier.py deleted file mode 100644 index c0acbcd..0000000 --- a/common-primitives/sklearn-wrap/sklearn_wrap/SKMLPClassifier.py +++ /dev/null @@ -1,730 +0,0 @@ -from typing import Any, Callable, List, Dict, Union, Optional, Sequence, Tuple -from numpy import ndarray -from collections import OrderedDict -from scipy import sparse -import os -import sklearn -import numpy -import typing - -# Custom import commands if any -from sklearn.neural_network.multilayer_perceptron import MLPClassifier - - -from d3m.container.numpy import ndarray as d3m_ndarray -from d3m.container import DataFrame as d3m_dataframe -from d3m.metadata import hyperparams, params, base as metadata_base -from d3m import utils -from d3m.base import utils as base_utils -from d3m.exceptions import PrimitiveNotFittedError -from d3m.primitive_interfaces.base import CallResult, DockerContainer - -from d3m.primitive_interfaces.supervised_learning import SupervisedLearnerPrimitiveBase -from d3m.primitive_interfaces.base import ProbabilisticCompositionalityMixin, ContinueFitMixin -from d3m import exceptions -import pandas - - - -Inputs = d3m_dataframe -Outputs = d3m_dataframe - - -class Params(params.Params): - classes_: Optional[ndarray] - loss_: Optional[float] - coefs_: Optional[Sequence[Any]] - intercepts_: Optional[Sequence[Any]] - n_iter_: Optional[int] - n_layers_: Optional[int] - n_outputs_: Optional[int] - out_activation_: Optional[str] - _best_coefs: Optional[Sequence[Any]] - _best_intercepts: Optional[Sequence[Any]] - _label_binarizer: Optional[sklearn.preprocessing.LabelBinarizer] - _no_improvement_count: Optional[int] - _random_state: Optional[numpy.random.mtrand.RandomState] - best_validation_score_: Optional[numpy.float64] - loss_curve_: Optional[Sequence[Any]] - t_: Optional[int] - _optimizer: Optional[sklearn.neural_network._stochastic_optimizers.AdamOptimizer] - validation_scores_: Optional[Sequence[Any]] - input_column_names: Optional[Any] - target_names_: Optional[Sequence[Any]] - training_indices_: Optional[Sequence[int]] - target_column_indices_: Optional[Sequence[int]] - target_columns_metadata_: Optional[List[OrderedDict]] - - - -class Hyperparams(hyperparams.Hyperparams): - hidden_layer_sizes = hyperparams.List( - elements=hyperparams.Bounded(1, None, 100), - default=(100, ), - min_size=1, - max_size=None, - description='The ith element represents the number of neurons in the ith hidden layer.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - activation = hyperparams.Enumeration[str]( - values=['identity', 'logistic', 'tanh', 'relu'], - default='relu', - description='Activation function for the hidden layer. - \'identity\', no-op activation, useful to implement linear bottleneck, returns f(x) = x - \'logistic\', the logistic sigmoid function, returns f(x) = 1 / (1 + exp(-x)). - \'tanh\', the hyperbolic tan function, returns f(x) = tanh(x). - \'relu\', the rectified linear unit function, returns f(x) = max(0, x)', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - solver = hyperparams.Choice( - choices={ - 'lbfgs': hyperparams.Hyperparams.define( - configuration=OrderedDict({}) - ), - 'sgd': hyperparams.Hyperparams.define( - configuration=OrderedDict({ - 'learning_rate': hyperparams.Enumeration[str]( - values=['constant', 'invscaling', 'adaptive'], - default='constant', - description='Learning rate schedule for weight updates. Only used when solver=’sgd’.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ), - 'learning_rate_init': hyperparams.Bounded[float]( - lower=0, - upper=None, - default=0.001, - description='The initial learning rate used. It controls the step-size in updating the weights. Only used when solver=’sgd’ or ‘adam’.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ), - 'power_t': hyperparams.Bounded[float]( - lower=0, - upper=None, - default=0.5, - description='The exponent for inverse scaling learning rate. Only used when solver=’sgd’.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ), - 'shuffle': hyperparams.UniformBool( - default=True, - description='Whether to shuffle samples in each iteration. Only used when solver=’sgd’ or ‘adam’.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ), - 'momentum': hyperparams.Bounded[float]( - default=0.9, - lower=0, - upper=1, - description='Momentum for gradient descent update. Should be between 0 and 1. Only used when solver=’sgd’.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ), - 'nesterovs_momentum': hyperparams.UniformBool( - default=True, - description='Whether to use Nesterov’s momentum. Only used when solver=’sgd’ and momentum > 0.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ), - 'early_stopping': hyperparams.UniformBool( - default=False, - description='Whether to use early stopping to terminate training when validation score is not improving.If set to true, it will automatically set aside 10% of training data as validation and terminate training when validation score is not improving by at least tol for n_iter_no_change consecutive epochs.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ), - 'n_iter_no_change': hyperparams.Bounded[int]( - default=10, - lower=1, - upper=None, - description='Maximum number of epochs to not meet tol improvement. Only effective when solver=’sgd’ or ‘adam’.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - }) - ), - 'adam': hyperparams.Hyperparams.define( - configuration=OrderedDict({ - 'learning_rate_init': hyperparams.Bounded[float]( - lower=0, - upper=None, - default=0.001, - description='The initial learning rate used. It controls the step-size in updating the weights. Only used when solver=’sgd’ or ‘adam’.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ), - 'shuffle': hyperparams.UniformBool( - default=True, - description='Whether to shuffle samples in each iteration. Only used when solver=’sgd’ or ‘adam’.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ), - 'early_stopping': hyperparams.UniformBool( - default=False, - description='Whether to use early stopping to terminate training when validation score is not improving.If set to true, it will automatically set aside 10% of training data as validation and terminate training when validation score is not improving by at least tol for n_iter_no_change consecutive epochs.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ), - 'beta_1': hyperparams.Bounded[float]( - default=0.9, - lower=0, - upper=1, - description='Exponential decay rate for estimates of first moment vector in adam, should be in [0, 1).', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ), - 'beta_2': hyperparams.Bounded[float]( - default=0.999, - lower=0, - upper=1, - description='Exponential decay rate for estimates of second moment vector in adam, should be in [0, 1).', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ), - 'epsilon': hyperparams.Bounded[float]( - default=1e-08, - lower=0, - upper=None, - description='Value for numerical stability in adam. Only used when solver=’adam’', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ), - 'n_iter_no_change': hyperparams.Bounded[int]( - default=10, - lower=1, - upper=None, - description='Maximum number of epochs to not meet tol improvement. Only effective when solver=’sgd’ or ‘adam’.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - }) - ) - }, - default='adam', - description='The solver for weight optimization. - \'lbfgs\' is an optimizer in the family of quasi-Newton methods. - \'sgd\' refers to stochastic gradient descent. - \'adam\' refers to a stochastic gradient-based optimizer proposed by Kingma, Diederik, and Jimmy Ba Note: The default solver \'adam\' works pretty well on relatively large datasets (with thousands of training samples or more) in terms of both training time and validation score. For small datasets, however, \'lbfgs\' can converge faster and perform better.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - alpha = hyperparams.Bounded[float]( - lower=0, - upper=None, - default=0.0001, - description='L2 penalty (regularization term) parameter.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - batch_size = hyperparams.Union( - configuration=OrderedDict({ - 'int': hyperparams.Bounded[int]( - lower=0, - upper=None, - default=16, - description='Size of minibatches for stochastic optimizers. If the solver is ‘lbfgs’, the classifier will not use minibatch', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ), - 'auto': hyperparams.Constant( - default='auto', - description='When set to “auto”, batch_size=min(200, n_samples)', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ) - }), - default='auto', - description='Size of minibatches for stochastic optimizers. If the solver is \'lbfgs\', the classifier will not use minibatch. When set to "auto", `batch_size=min(200, n_samples)`', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - max_iter = hyperparams.Bounded[float]( - lower=0, - upper=None, - default=200, - description='Maximum number of iterations. The solver iterates until convergence (determined by \'tol\') or this number of iterations. For stochastic solvers (\'sgd\', \'adam\'), note that this determines the number of epochs (how many times each data point will be used), not the number of gradient steps.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - tol = hyperparams.Bounded[float]( - default=0.0001, - lower=0, - upper=None, - description='Tolerance for the optimization. When the loss or score is not improving by at least ``tol`` for ``n_iter_no_change`` consecutive iterations, unless ``learning_rate`` is set to \'adaptive\', convergence is considered to be reached and training stops.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - validation_fraction = hyperparams.Bounded[float]( - default=0.1, - lower=0, - upper=None, - description='The proportion of training data to set aside as validation set for early stopping. Must be between 0 and 1. Only used if early_stopping is True', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - warm_start = hyperparams.UniformBool( - default=False, - description='When set to True, reuse the solution of the previous call to fit as initialization, otherwise, just erase the previous solution. See :term:`the Glossary `.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - - use_inputs_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to force primitive to use as training input. If any specified column cannot be parsed, it is skipped.", - ) - use_outputs_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to force primitive to use as training target. If any specified column cannot be parsed, it is skipped.", - ) - exclude_inputs_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to not use as training inputs. Applicable only if \"use_columns\" is not provided.", - ) - exclude_outputs_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to not use as training target. Applicable only if \"use_columns\" is not provided.", - ) - return_result = hyperparams.Enumeration( - values=['append', 'replace', 'new'], - default='new', - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Should parsed columns be appended, should they replace original columns, or should only parsed columns be returned? This hyperparam is ignored if use_semantic_types is set to false.", - ) - use_semantic_types = hyperparams.UniformBool( - default=False, - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Controls whether semantic_types metadata will be used for filtering columns in input dataframe. Setting this to false makes the code ignore return_result and will produce only the output dataframe" - ) - add_index_columns = hyperparams.UniformBool( - default=False, - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Also include primary index columns if input data has them. Applicable only if \"return_result\" is set to \"new\".", - ) - error_on_no_input = hyperparams.UniformBool( - default=True, - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Throw an exception if no input column is selected/provided. Defaults to true to behave like sklearn. To prevent pipelines from breaking set this to False.", - ) - - return_semantic_type = hyperparams.Enumeration[str]( - values=['https://metadata.datadrivendiscovery.org/types/Attribute', 'https://metadata.datadrivendiscovery.org/types/ConstructedAttribute', 'https://metadata.datadrivendiscovery.org/types/PredictedTarget'], - default='https://metadata.datadrivendiscovery.org/types/PredictedTarget', - description='Decides what semantic type to attach to generated output', - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'] - ) - -class SKMLPClassifier(SupervisedLearnerPrimitiveBase[Inputs, Outputs, Params, Hyperparams], - ProbabilisticCompositionalityMixin[Inputs, Outputs, Params, Hyperparams]): - """ - Primitive wrapping for sklearn MLPClassifier - `sklearn documentation `_ - - """ - - __author__ = "JPL MARVIN" - metadata = metadata_base.PrimitiveMetadata({ - "algorithm_types": [metadata_base.PrimitiveAlgorithmType.MULTILAYER_PERCEPTRON, ], - "name": "sklearn.neural_network.multilayer_perceptron.MLPClassifier", - "primitive_family": metadata_base.PrimitiveFamily.CLASSIFICATION, - "python_path": "d3m.primitives.classification.mlp.SKlearn", - "source": {'name': 'JPL', 'contact': 'mailto:shah@jpl.nasa.gov', 'uris': ['https://gitlab.com/datadrivendiscovery/sklearn-wrap/issues', 'https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html']}, - "version": "2019.11.13", - "id": "89d7ffbd-df5d-352f-a038-311b7d379cd0", - "hyperparams_to_tune": ['hidden_layer_sizes', 'activation', 'solver', 'alpha'], - 'installation': [ - {'type': metadata_base.PrimitiveInstallationType.PIP, - 'package_uri': 'git+https://gitlab.com/datadrivendiscovery/sklearn-wrap.git@{git_commit}#egg=sklearn_wrap'.format( - git_commit=utils.current_git_commit(os.path.dirname(__file__)), - ), - }] - }) - - def __init__(self, *, - hyperparams: Hyperparams, - random_seed: int = 0, - docker_containers: Dict[str, DockerContainer] = None, - _verbose: bool = None) -> None: - - super().__init__(hyperparams=hyperparams, random_seed=random_seed, docker_containers=docker_containers) - - # False - self._clf = MLPClassifier( - hidden_layer_sizes=self.hyperparams['hidden_layer_sizes'], - activation=self.hyperparams['activation'], - solver=self.hyperparams['solver']['choice'], - learning_rate=self.hyperparams['solver'].get('learning_rate', 'constant'), - learning_rate_init=self.hyperparams['solver'].get('learning_rate_init', 0.001), - power_t=self.hyperparams['solver'].get('power_t', 0.5), - shuffle=self.hyperparams['solver'].get('shuffle', True), - momentum=self.hyperparams['solver'].get('momentum', 0.9), - nesterovs_momentum=self.hyperparams['solver'].get('nesterovs_momentum', True), - early_stopping=self.hyperparams['solver'].get('early_stopping', False), - beta_1=self.hyperparams['solver'].get('beta_1', 0.9), - beta_2=self.hyperparams['solver'].get('beta_2', 0.999), - epsilon=self.hyperparams['solver'].get('epsilon', 1e-08), - n_iter_no_change=self.hyperparams['solver'].get('n_iter_no_change', 10), - alpha=self.hyperparams['alpha'], - batch_size=self.hyperparams['batch_size'], - max_iter=self.hyperparams['max_iter'], - tol=self.hyperparams['tol'], - validation_fraction=self.hyperparams['validation_fraction'], - warm_start=self.hyperparams['warm_start'], - random_state=self.random_seed, - verbose=_verbose - ) - - self._inputs = None - self._outputs = None - self._training_inputs = None - self._training_outputs = None - self._target_names = None - self._training_indices = None - self._target_column_indices = None - self._target_columns_metadata: List[OrderedDict] = None - self._input_column_names = None - self._fitted = False - self._new_training_data = False - - def set_training_data(self, *, inputs: Inputs, outputs: Outputs) -> None: - self._inputs = inputs - self._outputs = outputs - self._fitted = False - self._new_training_data = True - - def fit(self, *, timeout: float = None, iterations: int = None) -> CallResult[None]: - if self._inputs is None or self._outputs is None: - raise ValueError("Missing training data.") - - if not self._new_training_data: - return CallResult(None) - self._new_training_data = False - - self._training_inputs, self._training_indices = self._get_columns_to_fit(self._inputs, self.hyperparams) - self._training_outputs, self._target_names, self._target_column_indices = self._get_targets(self._outputs, self.hyperparams) - self._input_column_names = self._training_inputs.columns - - if len(self._training_indices) > 0 and len(self._target_column_indices) > 0: - self._target_columns_metadata = self._get_target_columns_metadata(self._training_outputs.metadata, self.hyperparams) - sk_training_output = self._training_outputs.values - - shape = sk_training_output.shape - if len(shape) == 2 and shape[1] == 1: - sk_training_output = numpy.ravel(sk_training_output) - - self._clf.fit(self._training_inputs, sk_training_output) - self._fitted = True - else: - if self.hyperparams['error_on_no_input']: - raise RuntimeError("No input columns were selected") - self.logger.warn("No input columns were selected") - - return CallResult(None) - - - - def produce(self, *, inputs: Inputs, timeout: float = None, iterations: int = None) -> CallResult[Outputs]: - sk_inputs, columns_to_use = self._get_columns_to_fit(inputs, self.hyperparams) - output = [] - if len(sk_inputs.columns): - try: - sk_output = self._clf.predict(sk_inputs) - except sklearn.exceptions.NotFittedError as error: - raise PrimitiveNotFittedError("Primitive not fitted.") from error - # For primitives that allow predicting without fitting like GaussianProcessRegressor - if not self._fitted: - raise PrimitiveNotFittedError("Primitive not fitted.") - if sparse.issparse(sk_output): - sk_output = sk_output.toarray() - output = self._wrap_predictions(inputs, sk_output) - output.columns = self._target_names - output = [output] - else: - if self.hyperparams['error_on_no_input']: - raise RuntimeError("No input columns were selected") - self.logger.warn("No input columns were selected") - outputs = base_utils.combine_columns(return_result=self.hyperparams['return_result'], - add_index_columns=self.hyperparams['add_index_columns'], - inputs=inputs, column_indices=self._target_column_indices, - columns_list=output) - - return CallResult(outputs) - - - def get_params(self) -> Params: - if not self._fitted: - return Params( - classes_=None, - loss_=None, - coefs_=None, - intercepts_=None, - n_iter_=None, - n_layers_=None, - n_outputs_=None, - out_activation_=None, - _best_coefs=None, - _best_intercepts=None, - _label_binarizer=None, - _no_improvement_count=None, - _random_state=None, - best_validation_score_=None, - loss_curve_=None, - t_=None, - _optimizer=None, - validation_scores_=None, - input_column_names=self._input_column_names, - training_indices_=self._training_indices, - target_names_=self._target_names, - target_column_indices_=self._target_column_indices, - target_columns_metadata_=self._target_columns_metadata - ) - - return Params( - classes_=getattr(self._clf, 'classes_', None), - loss_=getattr(self._clf, 'loss_', None), - coefs_=getattr(self._clf, 'coefs_', None), - intercepts_=getattr(self._clf, 'intercepts_', None), - n_iter_=getattr(self._clf, 'n_iter_', None), - n_layers_=getattr(self._clf, 'n_layers_', None), - n_outputs_=getattr(self._clf, 'n_outputs_', None), - out_activation_=getattr(self._clf, 'out_activation_', None), - _best_coefs=getattr(self._clf, '_best_coefs', None), - _best_intercepts=getattr(self._clf, '_best_intercepts', None), - _label_binarizer=getattr(self._clf, '_label_binarizer', None), - _no_improvement_count=getattr(self._clf, '_no_improvement_count', None), - _random_state=getattr(self._clf, '_random_state', None), - best_validation_score_=getattr(self._clf, 'best_validation_score_', None), - loss_curve_=getattr(self._clf, 'loss_curve_', None), - t_=getattr(self._clf, 't_', None), - _optimizer=getattr(self._clf, '_optimizer', None), - validation_scores_=getattr(self._clf, 'validation_scores_', None), - input_column_names=self._input_column_names, - training_indices_=self._training_indices, - target_names_=self._target_names, - target_column_indices_=self._target_column_indices, - target_columns_metadata_=self._target_columns_metadata - ) - - def set_params(self, *, params: Params) -> None: - self._clf.classes_ = params['classes_'] - self._clf.loss_ = params['loss_'] - self._clf.coefs_ = params['coefs_'] - self._clf.intercepts_ = params['intercepts_'] - self._clf.n_iter_ = params['n_iter_'] - self._clf.n_layers_ = params['n_layers_'] - self._clf.n_outputs_ = params['n_outputs_'] - self._clf.out_activation_ = params['out_activation_'] - self._clf._best_coefs = params['_best_coefs'] - self._clf._best_intercepts = params['_best_intercepts'] - self._clf._label_binarizer = params['_label_binarizer'] - self._clf._no_improvement_count = params['_no_improvement_count'] - self._clf._random_state = params['_random_state'] - self._clf.best_validation_score_ = params['best_validation_score_'] - self._clf.loss_curve_ = params['loss_curve_'] - self._clf.t_ = params['t_'] - self._clf._optimizer = params['_optimizer'] - self._clf.validation_scores_ = params['validation_scores_'] - self._input_column_names = params['input_column_names'] - self._training_indices = params['training_indices_'] - self._target_names = params['target_names_'] - self._target_column_indices = params['target_column_indices_'] - self._target_columns_metadata = params['target_columns_metadata_'] - - if params['classes_'] is not None: - self._fitted = True - if params['loss_'] is not None: - self._fitted = True - if params['coefs_'] is not None: - self._fitted = True - if params['intercepts_'] is not None: - self._fitted = True - if params['n_iter_'] is not None: - self._fitted = True - if params['n_layers_'] is not None: - self._fitted = True - if params['n_outputs_'] is not None: - self._fitted = True - if params['out_activation_'] is not None: - self._fitted = True - if params['_best_coefs'] is not None: - self._fitted = True - if params['_best_intercepts'] is not None: - self._fitted = True - if params['_label_binarizer'] is not None: - self._fitted = True - if params['_no_improvement_count'] is not None: - self._fitted = True - if params['_random_state'] is not None: - self._fitted = True - if params['best_validation_score_'] is not None: - self._fitted = True - if params['loss_curve_'] is not None: - self._fitted = True - if params['t_'] is not None: - self._fitted = True - if params['_optimizer'] is not None: - self._fitted = True - if params['validation_scores_'] is not None: - self._fitted = True - - - def log_likelihoods(self, *, - outputs: Outputs, - inputs: Inputs, - timeout: float = None, - iterations: int = None) -> CallResult[Sequence[float]]: - inputs = inputs.iloc[:, self._training_indices] # Get ndarray - outputs = outputs.iloc[:, self._target_column_indices] - - if len(inputs.columns) and len(outputs.columns): - - if outputs.shape[1] != self._clf.n_outputs_: - raise exceptions.InvalidArgumentValueError("\"outputs\" argument does not have the correct number of target columns.") - - log_proba = self._clf.predict_log_proba(inputs) - - # Making it always a list, even when only one target. - if self._clf.n_outputs_ == 1: - log_proba = [log_proba] - classes = [self._clf.classes_] - else: - classes = self._clf.classes_ - - samples_length = inputs.shape[0] - - log_likelihoods = [] - for k in range(self._clf.n_outputs_): - # We have to map each class to its internal (numerical) index used in the learner. - # This allows "outputs" to contain string classes. - outputs_column = outputs.iloc[:, k] - classes_map = pandas.Series(numpy.arange(len(classes[k])), index=classes[k]) - mapped_outputs_column = outputs_column.map(classes_map) - - # For each target column (column in "outputs"), for each sample (row) we pick the log - # likelihood for a given class. - log_likelihoods.append(log_proba[k][numpy.arange(samples_length), mapped_outputs_column]) - - results = d3m_dataframe(dict(enumerate(log_likelihoods)), generate_metadata=True) - results.columns = outputs.columns - - for k in range(self._clf.n_outputs_): - column_metadata = outputs.metadata.query_column(k) - if 'name' in column_metadata: - results.metadata = results.metadata.update_column(k, {'name': column_metadata['name']}) - - else: - results = d3m_dataframe(generate_metadata=True) - - return CallResult(results) - - - - - - @classmethod - def _get_columns_to_fit(cls, inputs: Inputs, hyperparams: Hyperparams): - if not hyperparams['use_semantic_types']: - return inputs, list(range(len(inputs.columns))) - - inputs_metadata = inputs.metadata - - def can_produce_column(column_index: int) -> bool: - return cls._can_produce_column(inputs_metadata, column_index, hyperparams) - - columns_to_produce, columns_not_to_produce = base_utils.get_columns_to_use(inputs_metadata, - use_columns=hyperparams['use_inputs_columns'], - exclude_columns=hyperparams['exclude_inputs_columns'], - can_use_column=can_produce_column) - return inputs.iloc[:, columns_to_produce], columns_to_produce - # return columns_to_produce - - @classmethod - def _can_produce_column(cls, inputs_metadata: metadata_base.DataMetadata, column_index: int, hyperparams: Hyperparams) -> bool: - column_metadata = inputs_metadata.query((metadata_base.ALL_ELEMENTS, column_index)) - - accepted_structural_types = (int, float, numpy.integer, numpy.float64) - accepted_semantic_types = set() - accepted_semantic_types.add("https://metadata.datadrivendiscovery.org/types/Attribute") - if not issubclass(column_metadata['structural_type'], accepted_structural_types): - return False - - semantic_types = set(column_metadata.get('semantic_types', [])) - - if len(semantic_types) == 0: - cls.logger.warning("No semantic types found in column metadata") - return False - # Making sure all accepted_semantic_types are available in semantic_types - if len(accepted_semantic_types - semantic_types) == 0: - return True - - return False - - @classmethod - def _get_targets(cls, data: d3m_dataframe, hyperparams: Hyperparams): - if not hyperparams['use_semantic_types']: - return data, list(data.columns), list(range(len(data.columns))) - - metadata = data.metadata - - def can_produce_column(column_index: int) -> bool: - accepted_semantic_types = set() - accepted_semantic_types.add("https://metadata.datadrivendiscovery.org/types/TrueTarget") - column_metadata = metadata.query((metadata_base.ALL_ELEMENTS, column_index)) - semantic_types = set(column_metadata.get('semantic_types', [])) - if len(semantic_types) == 0: - cls.logger.warning("No semantic types found in column metadata") - return False - # Making sure all accepted_semantic_types are available in semantic_types - if len(accepted_semantic_types - semantic_types) == 0: - return True - return False - - target_column_indices, target_columns_not_to_produce = base_utils.get_columns_to_use(metadata, - use_columns=hyperparams[ - 'use_outputs_columns'], - exclude_columns= - hyperparams[ - 'exclude_outputs_columns'], - can_use_column=can_produce_column) - targets = [] - if target_column_indices: - targets = data.select_columns(target_column_indices) - target_column_names = [] - for idx in target_column_indices: - target_column_names.append(data.columns[idx]) - return targets, target_column_names, target_column_indices - - @classmethod - def _get_target_columns_metadata(cls, outputs_metadata: metadata_base.DataMetadata, hyperparams) -> List[OrderedDict]: - outputs_length = outputs_metadata.query((metadata_base.ALL_ELEMENTS,))['dimension']['length'] - - target_columns_metadata: List[OrderedDict] = [] - for column_index in range(outputs_length): - column_metadata = OrderedDict(outputs_metadata.query_column(column_index)) - - # Update semantic types and prepare it for predicted targets. - semantic_types = set(column_metadata.get('semantic_types', [])) - semantic_types_to_remove = set(["https://metadata.datadrivendiscovery.org/types/TrueTarget","https://metadata.datadrivendiscovery.org/types/SuggestedTarget",]) - add_semantic_types = set(["https://metadata.datadrivendiscovery.org/types/PredictedTarget",]) - add_semantic_types.add(hyperparams["return_semantic_type"]) - semantic_types = semantic_types - semantic_types_to_remove - semantic_types = semantic_types.union(add_semantic_types) - column_metadata['semantic_types'] = list(semantic_types) - - target_columns_metadata.append(column_metadata) - - return target_columns_metadata - - @classmethod - def _update_predictions_metadata(cls, inputs_metadata: metadata_base.DataMetadata, outputs: Optional[Outputs], - target_columns_metadata: List[OrderedDict]) -> metadata_base.DataMetadata: - outputs_metadata = metadata_base.DataMetadata().generate(value=outputs) - - for column_index, column_metadata in enumerate(target_columns_metadata): - column_metadata.pop("structural_type", None) - outputs_metadata = outputs_metadata.update_column(column_index, column_metadata) - - return outputs_metadata - - def _wrap_predictions(self, inputs: Inputs, predictions: ndarray) -> Outputs: - outputs = d3m_dataframe(predictions, generate_metadata=False) - outputs.metadata = self._update_predictions_metadata(inputs.metadata, outputs, self._target_columns_metadata) - return outputs - - - @classmethod - def _add_target_columns_metadata(cls, outputs_metadata: metadata_base.DataMetadata): - outputs_length = outputs_metadata.query((metadata_base.ALL_ELEMENTS,))['dimension']['length'] - - target_columns_metadata: List[OrderedDict] = [] - for column_index in range(outputs_length): - column_metadata = OrderedDict() - semantic_types = [] - semantic_types.append('https://metadata.datadrivendiscovery.org/types/PredictedTarget') - column_name = outputs_metadata.query((metadata_base.ALL_ELEMENTS, column_index)).get("name") - if column_name is None: - column_name = "output_{}".format(column_index) - column_metadata["semantic_types"] = semantic_types - column_metadata["name"] = str(column_name) - target_columns_metadata.append(column_metadata) - - return target_columns_metadata - - -SKMLPClassifier.__doc__ = MLPClassifier.__doc__ \ No newline at end of file diff --git a/common-primitives/sklearn-wrap/sklearn_wrap/SKMLPRegressor.py b/common-primitives/sklearn-wrap/sklearn_wrap/SKMLPRegressor.py deleted file mode 100644 index df6b0e9..0000000 --- a/common-primitives/sklearn-wrap/sklearn_wrap/SKMLPRegressor.py +++ /dev/null @@ -1,669 +0,0 @@ -from typing import Any, Callable, List, Dict, Union, Optional, Sequence, Tuple -from numpy import ndarray -from collections import OrderedDict -from scipy import sparse -import os -import sklearn -import numpy -import typing - -# Custom import commands if any -from sklearn.neural_network.multilayer_perceptron import MLPRegressor - - -from d3m.container.numpy import ndarray as d3m_ndarray -from d3m.container import DataFrame as d3m_dataframe -from d3m.metadata import hyperparams, params, base as metadata_base -from d3m import utils -from d3m.base import utils as base_utils -from d3m.exceptions import PrimitiveNotFittedError -from d3m.primitive_interfaces.base import CallResult, DockerContainer - -from d3m.primitive_interfaces.supervised_learning import SupervisedLearnerPrimitiveBase -from d3m.primitive_interfaces.base import ProbabilisticCompositionalityMixin, ContinueFitMixin -from d3m import exceptions -import pandas - - - -Inputs = d3m_dataframe -Outputs = d3m_dataframe - - -class Params(params.Params): - loss_: Optional[float] - coefs_: Optional[Sequence[Any]] - intercepts_: Optional[Sequence[Any]] - n_iter_: Optional[int] - n_layers_: Optional[int] - n_outputs_: Optional[int] - out_activation_: Optional[str] - _best_coefs: Optional[Sequence[Any]] - _best_intercepts: Optional[Sequence[Any]] - _no_improvement_count: Optional[int] - _random_state: Optional[numpy.random.mtrand.RandomState] - best_validation_score_: Optional[numpy.float64] - loss_curve_: Optional[Sequence[Any]] - t_: Optional[int] - _optimizer: Optional[sklearn.neural_network._stochastic_optimizers.AdamOptimizer] - validation_scores_: Optional[Sequence[Any]] - input_column_names: Optional[Any] - target_names_: Optional[Sequence[Any]] - training_indices_: Optional[Sequence[int]] - target_column_indices_: Optional[Sequence[int]] - target_columns_metadata_: Optional[List[OrderedDict]] - - - -class Hyperparams(hyperparams.Hyperparams): - hidden_layer_sizes = hyperparams.List( - elements=hyperparams.Bounded(1, None, 100), - default=(100, ), - min_size=1, - max_size=None, - description='The ith element represents the number of neurons in the ith hidden layer.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - activation = hyperparams.Enumeration[str]( - values=['identity', 'logistic', 'tanh', 'relu'], - default='relu', - description='Activation function for the hidden layer. - \'identity\', no-op activation, useful to implement linear bottleneck, returns f(x) = x - \'logistic\', the logistic sigmoid function, returns f(x) = 1 / (1 + exp(-x)). - \'tanh\', the hyperbolic tan function, returns f(x) = tanh(x). - \'relu\', the rectified linear unit function, returns f(x) = max(0, x)', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - solver = hyperparams.Choice( - choices={ - 'lbfgs': hyperparams.Hyperparams.define( - configuration=OrderedDict({}) - ), - 'sgd': hyperparams.Hyperparams.define( - configuration=OrderedDict({ - 'learning_rate': hyperparams.Enumeration[str]( - values=['constant', 'invscaling', 'adaptive'], - default='constant', - description='Learning rate schedule for weight updates. Only used when solver=’sgd’.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ), - 'learning_rate_init': hyperparams.Bounded[float]( - lower=0, - upper=None, - default=0.001, - description='The initial learning rate used. It controls the step-size in updating the weights. Only used when solver=’sgd’ or ‘adam’.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ), - 'power_t': hyperparams.Bounded[float]( - lower=0, - upper=None, - default=0.5, - description='The exponent for inverse scaling learning rate. Only used when solver=’sgd’.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ), - 'shuffle': hyperparams.UniformBool( - default=True, - description='Whether to shuffle samples in each iteration. Only used when solver=’sgd’ or ‘adam’.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ), - 'momentum': hyperparams.Bounded[float]( - default=0.9, - lower=0, - upper=1, - description='Momentum for gradient descent update. Should be between 0 and 1. Only used when solver=’sgd’.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ), - 'nesterovs_momentum': hyperparams.UniformBool( - default=True, - description='Whether to use Nesterov’s momentum. Only used when solver=’sgd’ and momentum > 0.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ), - 'early_stopping': hyperparams.UniformBool( - default=False, - description='Whether to use early stopping to terminate training when validation score is not improving.If set to true, it will automatically set aside 10% of training data as validation and terminate training when validation score is not improving by at least tol for n_iter_no_change consecutive epochs.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ), - 'n_iter_no_change': hyperparams.Bounded[int]( - default=10, - lower=1, - upper=None, - description='Maximum number of epochs to not meet tol improvement. Only effective when solver=’sgd’ or ‘adam’.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - }) - ), - 'adam': hyperparams.Hyperparams.define( - configuration=OrderedDict({ - 'learning_rate_init': hyperparams.Bounded[float]( - lower=0, - upper=None, - default=0.001, - description='The initial learning rate used. It controls the step-size in updating the weights. Only used when solver=’sgd’ or ‘adam’.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ), - 'shuffle': hyperparams.UniformBool( - default=True, - description='Whether to shuffle samples in each iteration. Only used when solver=’sgd’ or ‘adam’.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ), - 'early_stopping': hyperparams.UniformBool( - default=False, - description='Whether to use early stopping to terminate training when validation score is not improving.If set to true, it will automatically set aside 10% of training data as validation and terminate training when validation score is not improving by at least tol for n_iter_no_change consecutive epochs.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ), - 'beta_1': hyperparams.Bounded[float]( - default=0.9, - lower=0, - upper=1, - description='Exponential decay rate for estimates of first moment vector in adam, should be in [0, 1).', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ), - 'beta_2': hyperparams.Bounded[float]( - default=0.999, - lower=0, - upper=1, - description='Exponential decay rate for estimates of second moment vector in adam, should be in [0, 1).', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ), - 'epsilon': hyperparams.Bounded[float]( - default=1e-08, - lower=0, - upper=None, - description='Value for numerical stability in adam. Only used when solver=’adam’', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ), - 'n_iter_no_change': hyperparams.Bounded[int]( - default=10, - lower=1, - upper=None, - description='Maximum number of epochs to not meet tol improvement. Only effective when solver=’sgd’ or ‘adam’.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - }) - ) - }, - default='adam', - description='The solver for weight optimization. - \'lbfgs\' is an optimizer in the family of quasi-Newton methods. - \'sgd\' refers to stochastic gradient descent. - \'adam\' refers to a stochastic gradient-based optimizer proposed by Kingma, Diederik, and Jimmy Ba Note: The default solver \'adam\' works pretty well on relatively large datasets (with thousands of training samples or more) in terms of both training time and validation score. For small datasets, however, \'lbfgs\' can converge faster and perform better.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - alpha = hyperparams.Bounded[float]( - lower=0, - upper=None, - default=0.0001, - description='L2 penalty (regularization term) parameter.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - batch_size = hyperparams.Union( - configuration=OrderedDict({ - 'int': hyperparams.Bounded[int]( - lower=0, - upper=None, - default=16, - description='Size of minibatches for stochastic optimizers. If the solver is \'lbfgs\', the classifier will not use minibatch', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ), - 'auto': hyperparams.Constant( - default='auto', - description='When set to \'auto\', batch_size=min(200, n_samples)', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ) - }), - default='auto', - description='Size of minibatches for stochastic optimizers. If the solver is \'lbfgs\', the classifier will not use minibatch. When set to "auto", `batch_size=min(200, n_samples)`', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - max_iter = hyperparams.Bounded[float]( - lower=0, - upper=None, - default=200, - description='Maximum number of iterations. The solver iterates until convergence (determined by \'tol\') or this number of iterations. For stochastic solvers (\'sgd\', \'adam\'), note that this determines the number of epochs (how many times each data point will be used), not the number of gradient steps.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - tol = hyperparams.Bounded[float]( - default=0.0001, - lower=0, - upper=None, - description='Tolerance for the optimization. When the loss or score is not improving by at least ``tol`` for ``n_iter_no_change`` consecutive iterations, unless ``learning_rate`` is set to \'adaptive\', convergence is considered to be reached and training stops.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - warm_start = hyperparams.UniformBool( - default=False, - description='When set to True, reuse the solution of the previous call to fit as initialization, otherwise, just erase the previous solution. See :term:`the Glossary `.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - validation_fraction = hyperparams.Bounded[float]( - default=0.1, - lower=0, - upper=None, - description='The proportion of training data to set aside as validation set for early stopping. Must be between 0 and 1. Only used if early_stopping is True', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - - use_inputs_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to force primitive to use as training input. If any specified column cannot be parsed, it is skipped.", - ) - use_outputs_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to force primitive to use as training target. If any specified column cannot be parsed, it is skipped.", - ) - exclude_inputs_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to not use as training inputs. Applicable only if \"use_columns\" is not provided.", - ) - exclude_outputs_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to not use as training target. Applicable only if \"use_columns\" is not provided.", - ) - return_result = hyperparams.Enumeration( - values=['append', 'replace', 'new'], - default='new', - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Should parsed columns be appended, should they replace original columns, or should only parsed columns be returned? This hyperparam is ignored if use_semantic_types is set to false.", - ) - use_semantic_types = hyperparams.UniformBool( - default=False, - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Controls whether semantic_types metadata will be used for filtering columns in input dataframe. Setting this to false makes the code ignore return_result and will produce only the output dataframe" - ) - add_index_columns = hyperparams.UniformBool( - default=False, - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Also include primary index columns if input data has them. Applicable only if \"return_result\" is set to \"new\".", - ) - error_on_no_input = hyperparams.UniformBool( - default=True, - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Throw an exception if no input column is selected/provided. Defaults to true to behave like sklearn. To prevent pipelines from breaking set this to False.", - ) - - return_semantic_type = hyperparams.Enumeration[str]( - values=['https://metadata.datadrivendiscovery.org/types/Attribute', 'https://metadata.datadrivendiscovery.org/types/ConstructedAttribute', 'https://metadata.datadrivendiscovery.org/types/PredictedTarget'], - default='https://metadata.datadrivendiscovery.org/types/PredictedTarget', - description='Decides what semantic type to attach to generated output', - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'] - ) - -class SKMLPRegressor(SupervisedLearnerPrimitiveBase[Inputs, Outputs, Params, Hyperparams]): - """ - Primitive wrapping for sklearn MLPRegressor - `sklearn documentation `_ - - """ - - __author__ = "JPL MARVIN" - metadata = metadata_base.PrimitiveMetadata({ - "algorithm_types": [metadata_base.PrimitiveAlgorithmType.MULTILAYER_PERCEPTRON, ], - "name": "sklearn.neural_network.multilayer_perceptron.MLPRegressor", - "primitive_family": metadata_base.PrimitiveFamily.REGRESSION, - "python_path": "d3m.primitives.regression.mlp.SKlearn", - "source": {'name': 'JPL', 'contact': 'mailto:shah@jpl.nasa.gov', 'uris': ['https://gitlab.com/datadrivendiscovery/sklearn-wrap/issues', 'https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPRegressor.html']}, - "version": "2019.11.13", - "id": "a4fedbf8-f69a-3440-9423-559291dfbd61", - "hyperparams_to_tune": ['hidden_layer_sizes', 'activation', 'solver', 'alpha'], - 'installation': [ - {'type': metadata_base.PrimitiveInstallationType.PIP, - 'package_uri': 'git+https://gitlab.com/datadrivendiscovery/sklearn-wrap.git@{git_commit}#egg=sklearn_wrap'.format( - git_commit=utils.current_git_commit(os.path.dirname(__file__)), - ), - }] - }) - - def __init__(self, *, - hyperparams: Hyperparams, - random_seed: int = 0, - docker_containers: Dict[str, DockerContainer] = None, - _verbose: bool = None) -> None: - - super().__init__(hyperparams=hyperparams, random_seed=random_seed, docker_containers=docker_containers) - - # False - self._clf = MLPRegressor( - hidden_layer_sizes=self.hyperparams['hidden_layer_sizes'], - activation=self.hyperparams['activation'], - solver=self.hyperparams['solver']['choice'], - learning_rate=self.hyperparams['solver'].get('learning_rate', 'constant'), - learning_rate_init=self.hyperparams['solver'].get('learning_rate_init', 0.001), - power_t=self.hyperparams['solver'].get('power_t', 0.5), - shuffle=self.hyperparams['solver'].get('shuffle', True), - momentum=self.hyperparams['solver'].get('momentum', 0.9), - nesterovs_momentum=self.hyperparams['solver'].get('nesterovs_momentum', True), - early_stopping=self.hyperparams['solver'].get('early_stopping', False), - beta_1=self.hyperparams['solver'].get('beta_1', 0.9), - beta_2=self.hyperparams['solver'].get('beta_2', 0.999), - epsilon=self.hyperparams['solver'].get('epsilon', 1e-08), - n_iter_no_change=self.hyperparams['solver'].get('n_iter_no_change', 10), - alpha=self.hyperparams['alpha'], - batch_size=self.hyperparams['batch_size'], - max_iter=self.hyperparams['max_iter'], - tol=self.hyperparams['tol'], - warm_start=self.hyperparams['warm_start'], - validation_fraction=self.hyperparams['validation_fraction'], - random_state=self.random_seed, - verbose=_verbose - ) - - self._inputs = None - self._outputs = None - self._training_inputs = None - self._training_outputs = None - self._target_names = None - self._training_indices = None - self._target_column_indices = None - self._target_columns_metadata: List[OrderedDict] = None - self._input_column_names = None - self._fitted = False - self._new_training_data = False - - def set_training_data(self, *, inputs: Inputs, outputs: Outputs) -> None: - self._inputs = inputs - self._outputs = outputs - self._fitted = False - self._new_training_data = True - - def fit(self, *, timeout: float = None, iterations: int = None) -> CallResult[None]: - if self._inputs is None or self._outputs is None: - raise ValueError("Missing training data.") - - if not self._new_training_data: - return CallResult(None) - self._new_training_data = False - - self._training_inputs, self._training_indices = self._get_columns_to_fit(self._inputs, self.hyperparams) - self._training_outputs, self._target_names, self._target_column_indices = self._get_targets(self._outputs, self.hyperparams) - self._input_column_names = self._training_inputs.columns - - if len(self._training_indices) > 0 and len(self._target_column_indices) > 0: - self._target_columns_metadata = self._get_target_columns_metadata(self._training_outputs.metadata, self.hyperparams) - sk_training_output = self._training_outputs.values - - shape = sk_training_output.shape - if len(shape) == 2 and shape[1] == 1: - sk_training_output = numpy.ravel(sk_training_output) - - self._clf.fit(self._training_inputs, sk_training_output) - self._fitted = True - else: - if self.hyperparams['error_on_no_input']: - raise RuntimeError("No input columns were selected") - self.logger.warn("No input columns were selected") - - return CallResult(None) - - - - def produce(self, *, inputs: Inputs, timeout: float = None, iterations: int = None) -> CallResult[Outputs]: - sk_inputs, columns_to_use = self._get_columns_to_fit(inputs, self.hyperparams) - output = [] - if len(sk_inputs.columns): - try: - sk_output = self._clf.predict(sk_inputs) - except sklearn.exceptions.NotFittedError as error: - raise PrimitiveNotFittedError("Primitive not fitted.") from error - # For primitives that allow predicting without fitting like GaussianProcessRegressor - if not self._fitted: - raise PrimitiveNotFittedError("Primitive not fitted.") - if sparse.issparse(sk_output): - sk_output = sk_output.toarray() - output = self._wrap_predictions(inputs, sk_output) - output.columns = self._target_names - output = [output] - else: - if self.hyperparams['error_on_no_input']: - raise RuntimeError("No input columns were selected") - self.logger.warn("No input columns were selected") - outputs = base_utils.combine_columns(return_result=self.hyperparams['return_result'], - add_index_columns=self.hyperparams['add_index_columns'], - inputs=inputs, column_indices=self._target_column_indices, - columns_list=output) - - return CallResult(outputs) - - - def get_params(self) -> Params: - if not self._fitted: - return Params( - loss_=None, - coefs_=None, - intercepts_=None, - n_iter_=None, - n_layers_=None, - n_outputs_=None, - out_activation_=None, - _best_coefs=None, - _best_intercepts=None, - _no_improvement_count=None, - _random_state=None, - best_validation_score_=None, - loss_curve_=None, - t_=None, - _optimizer=None, - validation_scores_=None, - input_column_names=self._input_column_names, - training_indices_=self._training_indices, - target_names_=self._target_names, - target_column_indices_=self._target_column_indices, - target_columns_metadata_=self._target_columns_metadata - ) - - return Params( - loss_=getattr(self._clf, 'loss_', None), - coefs_=getattr(self._clf, 'coefs_', None), - intercepts_=getattr(self._clf, 'intercepts_', None), - n_iter_=getattr(self._clf, 'n_iter_', None), - n_layers_=getattr(self._clf, 'n_layers_', None), - n_outputs_=getattr(self._clf, 'n_outputs_', None), - out_activation_=getattr(self._clf, 'out_activation_', None), - _best_coefs=getattr(self._clf, '_best_coefs', None), - _best_intercepts=getattr(self._clf, '_best_intercepts', None), - _no_improvement_count=getattr(self._clf, '_no_improvement_count', None), - _random_state=getattr(self._clf, '_random_state', None), - best_validation_score_=getattr(self._clf, 'best_validation_score_', None), - loss_curve_=getattr(self._clf, 'loss_curve_', None), - t_=getattr(self._clf, 't_', None), - _optimizer=getattr(self._clf, '_optimizer', None), - validation_scores_=getattr(self._clf, 'validation_scores_', None), - input_column_names=self._input_column_names, - training_indices_=self._training_indices, - target_names_=self._target_names, - target_column_indices_=self._target_column_indices, - target_columns_metadata_=self._target_columns_metadata - ) - - def set_params(self, *, params: Params) -> None: - self._clf.loss_ = params['loss_'] - self._clf.coefs_ = params['coefs_'] - self._clf.intercepts_ = params['intercepts_'] - self._clf.n_iter_ = params['n_iter_'] - self._clf.n_layers_ = params['n_layers_'] - self._clf.n_outputs_ = params['n_outputs_'] - self._clf.out_activation_ = params['out_activation_'] - self._clf._best_coefs = params['_best_coefs'] - self._clf._best_intercepts = params['_best_intercepts'] - self._clf._no_improvement_count = params['_no_improvement_count'] - self._clf._random_state = params['_random_state'] - self._clf.best_validation_score_ = params['best_validation_score_'] - self._clf.loss_curve_ = params['loss_curve_'] - self._clf.t_ = params['t_'] - self._clf._optimizer = params['_optimizer'] - self._clf.validation_scores_ = params['validation_scores_'] - self._input_column_names = params['input_column_names'] - self._training_indices = params['training_indices_'] - self._target_names = params['target_names_'] - self._target_column_indices = params['target_column_indices_'] - self._target_columns_metadata = params['target_columns_metadata_'] - - if params['loss_'] is not None: - self._fitted = True - if params['coefs_'] is not None: - self._fitted = True - if params['intercepts_'] is not None: - self._fitted = True - if params['n_iter_'] is not None: - self._fitted = True - if params['n_layers_'] is not None: - self._fitted = True - if params['n_outputs_'] is not None: - self._fitted = True - if params['out_activation_'] is not None: - self._fitted = True - if params['_best_coefs'] is not None: - self._fitted = True - if params['_best_intercepts'] is not None: - self._fitted = True - if params['_no_improvement_count'] is not None: - self._fitted = True - if params['_random_state'] is not None: - self._fitted = True - if params['best_validation_score_'] is not None: - self._fitted = True - if params['loss_curve_'] is not None: - self._fitted = True - if params['t_'] is not None: - self._fitted = True - if params['_optimizer'] is not None: - self._fitted = True - if params['validation_scores_'] is not None: - self._fitted = True - - - - - - - - @classmethod - def _get_columns_to_fit(cls, inputs: Inputs, hyperparams: Hyperparams): - if not hyperparams['use_semantic_types']: - return inputs, list(range(len(inputs.columns))) - - inputs_metadata = inputs.metadata - - def can_produce_column(column_index: int) -> bool: - return cls._can_produce_column(inputs_metadata, column_index, hyperparams) - - columns_to_produce, columns_not_to_produce = base_utils.get_columns_to_use(inputs_metadata, - use_columns=hyperparams['use_inputs_columns'], - exclude_columns=hyperparams['exclude_inputs_columns'], - can_use_column=can_produce_column) - return inputs.iloc[:, columns_to_produce], columns_to_produce - # return columns_to_produce - - @classmethod - def _can_produce_column(cls, inputs_metadata: metadata_base.DataMetadata, column_index: int, hyperparams: Hyperparams) -> bool: - column_metadata = inputs_metadata.query((metadata_base.ALL_ELEMENTS, column_index)) - - accepted_structural_types = (int, float, numpy.integer, numpy.float64) - accepted_semantic_types = set() - accepted_semantic_types.add("https://metadata.datadrivendiscovery.org/types/Attribute") - if not issubclass(column_metadata['structural_type'], accepted_structural_types): - return False - - semantic_types = set(column_metadata.get('semantic_types', [])) - - if len(semantic_types) == 0: - cls.logger.warning("No semantic types found in column metadata") - return False - # Making sure all accepted_semantic_types are available in semantic_types - if len(accepted_semantic_types - semantic_types) == 0: - return True - - return False - - @classmethod - def _get_targets(cls, data: d3m_dataframe, hyperparams: Hyperparams): - if not hyperparams['use_semantic_types']: - return data, list(data.columns), list(range(len(data.columns))) - - metadata = data.metadata - - def can_produce_column(column_index: int) -> bool: - accepted_semantic_types = set() - accepted_semantic_types.add("https://metadata.datadrivendiscovery.org/types/TrueTarget") - column_metadata = metadata.query((metadata_base.ALL_ELEMENTS, column_index)) - semantic_types = set(column_metadata.get('semantic_types', [])) - if len(semantic_types) == 0: - cls.logger.warning("No semantic types found in column metadata") - return False - # Making sure all accepted_semantic_types are available in semantic_types - if len(accepted_semantic_types - semantic_types) == 0: - return True - return False - - target_column_indices, target_columns_not_to_produce = base_utils.get_columns_to_use(metadata, - use_columns=hyperparams[ - 'use_outputs_columns'], - exclude_columns= - hyperparams[ - 'exclude_outputs_columns'], - can_use_column=can_produce_column) - targets = [] - if target_column_indices: - targets = data.select_columns(target_column_indices) - target_column_names = [] - for idx in target_column_indices: - target_column_names.append(data.columns[idx]) - return targets, target_column_names, target_column_indices - - @classmethod - def _get_target_columns_metadata(cls, outputs_metadata: metadata_base.DataMetadata, hyperparams) -> List[OrderedDict]: - outputs_length = outputs_metadata.query((metadata_base.ALL_ELEMENTS,))['dimension']['length'] - - target_columns_metadata: List[OrderedDict] = [] - for column_index in range(outputs_length): - column_metadata = OrderedDict(outputs_metadata.query_column(column_index)) - - # Update semantic types and prepare it for predicted targets. - semantic_types = set(column_metadata.get('semantic_types', [])) - semantic_types_to_remove = set(["https://metadata.datadrivendiscovery.org/types/TrueTarget","https://metadata.datadrivendiscovery.org/types/SuggestedTarget",]) - add_semantic_types = set(["https://metadata.datadrivendiscovery.org/types/PredictedTarget",]) - add_semantic_types.add(hyperparams["return_semantic_type"]) - semantic_types = semantic_types - semantic_types_to_remove - semantic_types = semantic_types.union(add_semantic_types) - column_metadata['semantic_types'] = list(semantic_types) - - target_columns_metadata.append(column_metadata) - - return target_columns_metadata - - @classmethod - def _update_predictions_metadata(cls, inputs_metadata: metadata_base.DataMetadata, outputs: Optional[Outputs], - target_columns_metadata: List[OrderedDict]) -> metadata_base.DataMetadata: - outputs_metadata = metadata_base.DataMetadata().generate(value=outputs) - - for column_index, column_metadata in enumerate(target_columns_metadata): - column_metadata.pop("structural_type", None) - outputs_metadata = outputs_metadata.update_column(column_index, column_metadata) - - return outputs_metadata - - def _wrap_predictions(self, inputs: Inputs, predictions: ndarray) -> Outputs: - outputs = d3m_dataframe(predictions, generate_metadata=False) - outputs.metadata = self._update_predictions_metadata(inputs.metadata, outputs, self._target_columns_metadata) - return outputs - - - @classmethod - def _add_target_columns_metadata(cls, outputs_metadata: metadata_base.DataMetadata): - outputs_length = outputs_metadata.query((metadata_base.ALL_ELEMENTS,))['dimension']['length'] - - target_columns_metadata: List[OrderedDict] = [] - for column_index in range(outputs_length): - column_metadata = OrderedDict() - semantic_types = [] - semantic_types.append('https://metadata.datadrivendiscovery.org/types/PredictedTarget') - column_name = outputs_metadata.query((metadata_base.ALL_ELEMENTS, column_index)).get("name") - if column_name is None: - column_name = "output_{}".format(column_index) - column_metadata["semantic_types"] = semantic_types - column_metadata["name"] = str(column_name) - target_columns_metadata.append(column_metadata) - - return target_columns_metadata - - -SKMLPRegressor.__doc__ = MLPRegressor.__doc__ \ No newline at end of file diff --git a/common-primitives/sklearn-wrap/sklearn_wrap/SKMaxAbsScaler.py b/common-primitives/sklearn-wrap/sklearn_wrap/SKMaxAbsScaler.py deleted file mode 100644 index 50eaf4d..0000000 --- a/common-primitives/sklearn-wrap/sklearn_wrap/SKMaxAbsScaler.py +++ /dev/null @@ -1,339 +0,0 @@ -from typing import Any, Callable, List, Dict, Union, Optional, Sequence, Tuple -from numpy import ndarray -from collections import OrderedDict -from scipy import sparse -import os -import sklearn -import numpy -import typing - -# Custom import commands if any -from sklearn.preprocessing.data import MaxAbsScaler - - -from d3m.container.numpy import ndarray as d3m_ndarray -from d3m.container import DataFrame as d3m_dataframe -from d3m.metadata import hyperparams, params, base as metadata_base -from d3m import utils -from d3m.base import utils as base_utils -from d3m.exceptions import PrimitiveNotFittedError -from d3m.primitive_interfaces.base import CallResult, DockerContainer -from d3m.primitive_interfaces.unsupervised_learning import UnsupervisedLearnerPrimitiveBase - - -Inputs = d3m_dataframe -Outputs = d3m_dataframe - - -class Params(params.Params): - scale_: Optional[ndarray] - max_abs_: Optional[ndarray] - n_samples_seen_: Optional[int] - input_column_names: Optional[Any] - target_names_: Optional[Sequence[Any]] - training_indices_: Optional[Sequence[int]] - target_column_indices_: Optional[Sequence[int]] - target_columns_metadata_: Optional[List[OrderedDict]] - - - -class Hyperparams(hyperparams.Hyperparams): - - use_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to force primitive to operate on. If any specified column cannot be parsed, it is skipped.", - ) - exclude_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to not operate on. Applicable only if \"use_columns\" is not provided.", - ) - return_result = hyperparams.Enumeration( - values=['append', 'replace', 'new'], - default='new', - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Should parsed columns be appended, should they replace original columns, or should only parsed columns be returned? This hyperparam is ignored if use_semantic_types is set to false.", - ) - use_semantic_types = hyperparams.UniformBool( - default=False, - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Controls whether semantic_types metadata will be used for filtering columns in input dataframe. Setting this to false makes the code ignore return_result and will produce only the output dataframe" - ) - add_index_columns = hyperparams.UniformBool( - default=False, - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Also include primary index columns if input data has them. Applicable only if \"return_result\" is set to \"new\".", - ) - error_on_no_input = hyperparams.UniformBool( - default=True, - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Throw an exception if no input column is selected/provided. Defaults to true to behave like sklearn. To prevent pipelines from breaking set this to False.", - ) - - return_semantic_type = hyperparams.Enumeration[str]( - values=['https://metadata.datadrivendiscovery.org/types/Attribute', 'https://metadata.datadrivendiscovery.org/types/ConstructedAttribute'], - default='https://metadata.datadrivendiscovery.org/types/Attribute', - description='Decides what semantic type to attach to generated attributes', - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'] - ) - -class SKMaxAbsScaler(UnsupervisedLearnerPrimitiveBase[Inputs, Outputs, Params, Hyperparams]): - """ - Primitive wrapping for sklearn MaxAbsScaler - `sklearn documentation `_ - - """ - - __author__ = "JPL MARVIN" - metadata = metadata_base.PrimitiveMetadata({ - "algorithm_types": [metadata_base.PrimitiveAlgorithmType.FEATURE_SCALING, ], - "name": "sklearn.preprocessing.data.MaxAbsScaler", - "primitive_family": metadata_base.PrimitiveFamily.DATA_PREPROCESSING, - "python_path": "d3m.primitives.data_preprocessing.max_abs_scaler.SKlearn", - "source": {'name': 'JPL', 'contact': 'mailto:shah@jpl.nasa.gov', 'uris': ['https://gitlab.com/datadrivendiscovery/sklearn-wrap/issues', 'https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MaxAbsScaler.html']}, - "version": "2019.11.13", - "id": "64d2ef5d-b221-3033-8342-76d0293fa99c", - 'installation': [ - {'type': metadata_base.PrimitiveInstallationType.PIP, - 'package_uri': 'git+https://gitlab.com/datadrivendiscovery/sklearn-wrap.git@{git_commit}#egg=sklearn_wrap'.format( - git_commit=utils.current_git_commit(os.path.dirname(__file__)), - ), - }] - }) - - def __init__(self, *, - hyperparams: Hyperparams, - random_seed: int = 0, - docker_containers: Dict[str, DockerContainer] = None) -> None: - - super().__init__(hyperparams=hyperparams, random_seed=random_seed, docker_containers=docker_containers) - - # False - self._clf = MaxAbsScaler( - ) - - self._inputs = None - self._outputs = None - self._training_inputs = None - self._training_outputs = None - self._target_names = None - self._training_indices = None - self._target_column_indices = None - self._target_columns_metadata: List[OrderedDict] = None - self._input_column_names = None - self._fitted = False - - - def set_training_data(self, *, inputs: Inputs) -> None: - self._inputs = inputs - self._fitted = False - - def fit(self, *, timeout: float = None, iterations: int = None) -> CallResult[None]: - if self._fitted: - return CallResult(None) - - self._training_inputs, self._training_indices = self._get_columns_to_fit(self._inputs, self.hyperparams) - self._input_column_names = self._training_inputs.columns - - if self._training_inputs is None: - return CallResult(None) - - if len(self._training_indices) > 0: - self._clf.fit(self._training_inputs) - self._fitted = True - else: - if self.hyperparams['error_on_no_input']: - raise RuntimeError("No input columns were selected") - self.logger.warn("No input columns were selected") - return CallResult(None) - - def produce(self, *, inputs: Inputs, timeout: float = None, iterations: int = None) -> CallResult[Outputs]: - if not self._fitted: - raise PrimitiveNotFittedError("Primitive not fitted.") - sk_inputs = inputs - if self.hyperparams['use_semantic_types']: - sk_inputs = inputs.iloc[:, self._training_indices] - output_columns = [] - if len(self._training_indices) > 0: - sk_output = self._clf.transform(sk_inputs) - if sparse.issparse(sk_output): - sk_output = sk_output.toarray() - outputs = self._wrap_predictions(inputs, sk_output) - if len(outputs.columns) == len(self._input_column_names): - outputs.columns = self._input_column_names - output_columns = [outputs] - else: - if self.hyperparams['error_on_no_input']: - raise RuntimeError("No input columns were selected") - self.logger.warn("No input columns were selected") - outputs = base_utils.combine_columns(return_result=self.hyperparams['return_result'], - add_index_columns=self.hyperparams['add_index_columns'], - inputs=inputs, column_indices=self._training_indices, - columns_list=output_columns) - return CallResult(outputs) - - - def get_params(self) -> Params: - if not self._fitted: - return Params( - scale_=None, - max_abs_=None, - n_samples_seen_=None, - input_column_names=self._input_column_names, - training_indices_=self._training_indices, - target_names_=self._target_names, - target_column_indices_=self._target_column_indices, - target_columns_metadata_=self._target_columns_metadata - ) - - return Params( - scale_=getattr(self._clf, 'scale_', None), - max_abs_=getattr(self._clf, 'max_abs_', None), - n_samples_seen_=getattr(self._clf, 'n_samples_seen_', None), - input_column_names=self._input_column_names, - training_indices_=self._training_indices, - target_names_=self._target_names, - target_column_indices_=self._target_column_indices, - target_columns_metadata_=self._target_columns_metadata - ) - - def set_params(self, *, params: Params) -> None: - self._clf.scale_ = params['scale_'] - self._clf.max_abs_ = params['max_abs_'] - self._clf.n_samples_seen_ = params['n_samples_seen_'] - self._input_column_names = params['input_column_names'] - self._training_indices = params['training_indices_'] - self._target_names = params['target_names_'] - self._target_column_indices = params['target_column_indices_'] - self._target_columns_metadata = params['target_columns_metadata_'] - - if params['scale_'] is not None: - self._fitted = True - if params['max_abs_'] is not None: - self._fitted = True - if params['n_samples_seen_'] is not None: - self._fitted = True - - - - - - @classmethod - def _get_columns_to_fit(cls, inputs: Inputs, hyperparams: Hyperparams): - if not hyperparams['use_semantic_types']: - return inputs, list(range(len(inputs.columns))) - - inputs_metadata = inputs.metadata - - def can_produce_column(column_index: int) -> bool: - return cls._can_produce_column(inputs_metadata, column_index, hyperparams) - - columns_to_produce, columns_not_to_produce = base_utils.get_columns_to_use(inputs_metadata, - use_columns=hyperparams['use_columns'], - exclude_columns=hyperparams['exclude_columns'], - can_use_column=can_produce_column) - return inputs.iloc[:, columns_to_produce], columns_to_produce - # return columns_to_produce - - @classmethod - def _can_produce_column(cls, inputs_metadata: metadata_base.DataMetadata, column_index: int, hyperparams: Hyperparams) -> bool: - column_metadata = inputs_metadata.query((metadata_base.ALL_ELEMENTS, column_index)) - - accepted_structural_types = (int, float, numpy.integer, numpy.float64) - accepted_semantic_types = set() - accepted_semantic_types.add("https://metadata.datadrivendiscovery.org/types/Attribute") - if not issubclass(column_metadata['structural_type'], accepted_structural_types): - return False - - semantic_types = set(column_metadata.get('semantic_types', [])) - - if len(semantic_types) == 0: - cls.logger.warning("No semantic types found in column metadata") - return False - - # Making sure all accepted_semantic_types are available in semantic_types - if len(accepted_semantic_types - semantic_types) == 0: - return True - - return False - - - @classmethod - def _get_target_columns_metadata(cls, outputs_metadata: metadata_base.DataMetadata, hyperparams) -> List[OrderedDict]: - outputs_length = outputs_metadata.query((metadata_base.ALL_ELEMENTS,))['dimension']['length'] - - target_columns_metadata: List[OrderedDict] = [] - for column_index in range(outputs_length): - column_metadata = OrderedDict(outputs_metadata.query_column(column_index)) - - # Update semantic types and prepare it for predicted targets. - semantic_types = set(column_metadata.get('semantic_types', [])) - semantic_types_to_remove = set([]) - add_semantic_types = [] - add_semantic_types.add(hyperparams["return_semantic_type"]) - semantic_types = semantic_types - semantic_types_to_remove - semantic_types = semantic_types.union(add_semantic_types) - column_metadata['semantic_types'] = list(semantic_types) - - target_columns_metadata.append(column_metadata) - - return target_columns_metadata - - @classmethod - def _update_predictions_metadata(cls, inputs_metadata: metadata_base.DataMetadata, outputs: Optional[Outputs], - target_columns_metadata: List[OrderedDict]) -> metadata_base.DataMetadata: - outputs_metadata = metadata_base.DataMetadata().generate(value=outputs) - - for column_index, column_metadata in enumerate(target_columns_metadata): - column_metadata.pop("structural_type", None) - outputs_metadata = outputs_metadata.update_column(column_index, column_metadata) - - return outputs_metadata - - def _wrap_predictions(self, inputs: Inputs, predictions: ndarray) -> Outputs: - outputs = d3m_dataframe(predictions, generate_metadata=True) - target_columns_metadata = self._copy_inputs_metadata(inputs.metadata, self._training_indices, outputs.metadata, self.hyperparams) - outputs.metadata = self._update_predictions_metadata(inputs.metadata, outputs, target_columns_metadata) - return outputs - - - @classmethod - def _copy_inputs_metadata(cls, inputs_metadata: metadata_base.DataMetadata, input_indices: List[int], - outputs_metadata: metadata_base.DataMetadata, hyperparams): - outputs_length = outputs_metadata.query((metadata_base.ALL_ELEMENTS,))['dimension']['length'] - target_columns_metadata: List[OrderedDict] = [] - for column_index in input_indices: - column_name = inputs_metadata.query((metadata_base.ALL_ELEMENTS, column_index)).get("name") - if column_name is None: - column_name = "output_{}".format(column_index) - - column_metadata = OrderedDict(inputs_metadata.query_column(column_index)) - semantic_types = set(column_metadata.get('semantic_types', [])) - semantic_types_to_remove = set([]) - add_semantic_types = set() - add_semantic_types.add(hyperparams["return_semantic_type"]) - semantic_types = semantic_types - semantic_types_to_remove - semantic_types = semantic_types.union(add_semantic_types) - column_metadata['semantic_types'] = list(semantic_types) - - column_metadata["name"] = str(column_name) - target_columns_metadata.append(column_metadata) - - # If outputs has more columns than index, add Attribute Type to all remaining - if outputs_length > len(input_indices): - for column_index in range(len(input_indices), outputs_length): - column_metadata = OrderedDict() - semantic_types = set() - semantic_types.add(hyperparams["return_semantic_type"]) - column_name = "output_{}".format(column_index) - column_metadata["semantic_types"] = list(semantic_types) - column_metadata["name"] = str(column_name) - target_columns_metadata.append(column_metadata) - - return target_columns_metadata - - -SKMaxAbsScaler.__doc__ = MaxAbsScaler.__doc__ \ No newline at end of file diff --git a/common-primitives/sklearn-wrap/sklearn_wrap/SKMinMaxScaler.py b/common-primitives/sklearn-wrap/sklearn_wrap/SKMinMaxScaler.py deleted file mode 100644 index dc8fc78..0000000 --- a/common-primitives/sklearn-wrap/sklearn_wrap/SKMinMaxScaler.py +++ /dev/null @@ -1,366 +0,0 @@ -from typing import Any, Callable, List, Dict, Union, Optional, Sequence, Tuple -from numpy import ndarray -from collections import OrderedDict -from scipy import sparse -import os -import sklearn -import numpy -import typing - -# Custom import commands if any -from sklearn.preprocessing.data import MinMaxScaler - - -from d3m.container.numpy import ndarray as d3m_ndarray -from d3m.container import DataFrame as d3m_dataframe -from d3m.metadata import hyperparams, params, base as metadata_base -from d3m import utils -from d3m.base import utils as base_utils -from d3m.exceptions import PrimitiveNotFittedError -from d3m.primitive_interfaces.base import CallResult, DockerContainer -from d3m.primitive_interfaces.unsupervised_learning import UnsupervisedLearnerPrimitiveBase - - -Inputs = d3m_dataframe -Outputs = d3m_dataframe - - -class Params(params.Params): - min_: Optional[ndarray] - scale_: Optional[ndarray] - data_min_: Optional[ndarray] - data_max_: Optional[ndarray] - data_range_: Optional[ndarray] - n_samples_seen_: Optional[int] - input_column_names: Optional[Any] - target_names_: Optional[Sequence[Any]] - training_indices_: Optional[Sequence[int]] - target_column_indices_: Optional[Sequence[int]] - target_columns_metadata_: Optional[List[OrderedDict]] - - - -class Hyperparams(hyperparams.Hyperparams): - feature_range = hyperparams.SortedSet( - elements=hyperparams.Hyperparameter[int](0), - default=(0, 1), - min_size=2, - max_size=2, - description='Desired range of transformed data.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - - use_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to force primitive to operate on. If any specified column cannot be parsed, it is skipped.", - ) - exclude_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to not operate on. Applicable only if \"use_columns\" is not provided.", - ) - return_result = hyperparams.Enumeration( - values=['append', 'replace', 'new'], - default='new', - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Should parsed columns be appended, should they replace original columns, or should only parsed columns be returned? This hyperparam is ignored if use_semantic_types is set to false.", - ) - use_semantic_types = hyperparams.UniformBool( - default=False, - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Controls whether semantic_types metadata will be used for filtering columns in input dataframe. Setting this to false makes the code ignore return_result and will produce only the output dataframe" - ) - add_index_columns = hyperparams.UniformBool( - default=False, - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Also include primary index columns if input data has them. Applicable only if \"return_result\" is set to \"new\".", - ) - error_on_no_input = hyperparams.UniformBool( - default=True, - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Throw an exception if no input column is selected/provided. Defaults to true to behave like sklearn. To prevent pipelines from breaking set this to False.", - ) - - return_semantic_type = hyperparams.Enumeration[str]( - values=['https://metadata.datadrivendiscovery.org/types/Attribute', 'https://metadata.datadrivendiscovery.org/types/ConstructedAttribute'], - default='https://metadata.datadrivendiscovery.org/types/Attribute', - description='Decides what semantic type to attach to generated attributes', - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'] - ) - -class SKMinMaxScaler(UnsupervisedLearnerPrimitiveBase[Inputs, Outputs, Params, Hyperparams]): - """ - Primitive wrapping for sklearn MinMaxScaler - `sklearn documentation `_ - - """ - - __author__ = "JPL MARVIN" - metadata = metadata_base.PrimitiveMetadata({ - "algorithm_types": [metadata_base.PrimitiveAlgorithmType.FEATURE_SCALING, ], - "name": "sklearn.preprocessing.data.MinMaxScaler", - "primitive_family": metadata_base.PrimitiveFamily.DATA_PREPROCESSING, - "python_path": "d3m.primitives.data_preprocessing.min_max_scaler.SKlearn", - "source": {'name': 'JPL', 'contact': 'mailto:shah@jpl.nasa.gov', 'uris': ['https://gitlab.com/datadrivendiscovery/sklearn-wrap/issues', 'https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html']}, - "version": "2019.11.13", - "id": "08d0579d-38da-307b-8b75-6a213ef2972e", - 'installation': [ - {'type': metadata_base.PrimitiveInstallationType.PIP, - 'package_uri': 'git+https://gitlab.com/datadrivendiscovery/sklearn-wrap.git@{git_commit}#egg=sklearn_wrap'.format( - git_commit=utils.current_git_commit(os.path.dirname(__file__)), - ), - }] - }) - - def __init__(self, *, - hyperparams: Hyperparams, - random_seed: int = 0, - docker_containers: Dict[str, DockerContainer] = None) -> None: - - super().__init__(hyperparams=hyperparams, random_seed=random_seed, docker_containers=docker_containers) - - # False - self._clf = MinMaxScaler( - feature_range=self.hyperparams['feature_range'], - ) - - self._inputs = None - self._outputs = None - self._training_inputs = None - self._training_outputs = None - self._target_names = None - self._training_indices = None - self._target_column_indices = None - self._target_columns_metadata: List[OrderedDict] = None - self._input_column_names = None - self._fitted = False - - - def set_training_data(self, *, inputs: Inputs) -> None: - self._inputs = inputs - self._fitted = False - - def fit(self, *, timeout: float = None, iterations: int = None) -> CallResult[None]: - if self._fitted: - return CallResult(None) - - self._training_inputs, self._training_indices = self._get_columns_to_fit(self._inputs, self.hyperparams) - self._input_column_names = self._training_inputs.columns - - if self._training_inputs is None: - return CallResult(None) - - if len(self._training_indices) > 0: - self._clf.fit(self._training_inputs) - self._fitted = True - else: - if self.hyperparams['error_on_no_input']: - raise RuntimeError("No input columns were selected") - self.logger.warn("No input columns were selected") - return CallResult(None) - - def produce(self, *, inputs: Inputs, timeout: float = None, iterations: int = None) -> CallResult[Outputs]: - if not self._fitted: - raise PrimitiveNotFittedError("Primitive not fitted.") - sk_inputs = inputs - if self.hyperparams['use_semantic_types']: - sk_inputs = inputs.iloc[:, self._training_indices] - output_columns = [] - if len(self._training_indices) > 0: - sk_output = self._clf.transform(sk_inputs) - if sparse.issparse(sk_output): - sk_output = sk_output.toarray() - outputs = self._wrap_predictions(inputs, sk_output) - if len(outputs.columns) == len(self._input_column_names): - outputs.columns = self._input_column_names - output_columns = [outputs] - else: - if self.hyperparams['error_on_no_input']: - raise RuntimeError("No input columns were selected") - self.logger.warn("No input columns were selected") - outputs = base_utils.combine_columns(return_result=self.hyperparams['return_result'], - add_index_columns=self.hyperparams['add_index_columns'], - inputs=inputs, column_indices=self._training_indices, - columns_list=output_columns) - return CallResult(outputs) - - - def get_params(self) -> Params: - if not self._fitted: - return Params( - min_=None, - scale_=None, - data_min_=None, - data_max_=None, - data_range_=None, - n_samples_seen_=None, - input_column_names=self._input_column_names, - training_indices_=self._training_indices, - target_names_=self._target_names, - target_column_indices_=self._target_column_indices, - target_columns_metadata_=self._target_columns_metadata - ) - - return Params( - min_=getattr(self._clf, 'min_', None), - scale_=getattr(self._clf, 'scale_', None), - data_min_=getattr(self._clf, 'data_min_', None), - data_max_=getattr(self._clf, 'data_max_', None), - data_range_=getattr(self._clf, 'data_range_', None), - n_samples_seen_=getattr(self._clf, 'n_samples_seen_', None), - input_column_names=self._input_column_names, - training_indices_=self._training_indices, - target_names_=self._target_names, - target_column_indices_=self._target_column_indices, - target_columns_metadata_=self._target_columns_metadata - ) - - def set_params(self, *, params: Params) -> None: - self._clf.min_ = params['min_'] - self._clf.scale_ = params['scale_'] - self._clf.data_min_ = params['data_min_'] - self._clf.data_max_ = params['data_max_'] - self._clf.data_range_ = params['data_range_'] - self._clf.n_samples_seen_ = params['n_samples_seen_'] - self._input_column_names = params['input_column_names'] - self._training_indices = params['training_indices_'] - self._target_names = params['target_names_'] - self._target_column_indices = params['target_column_indices_'] - self._target_columns_metadata = params['target_columns_metadata_'] - - if params['min_'] is not None: - self._fitted = True - if params['scale_'] is not None: - self._fitted = True - if params['data_min_'] is not None: - self._fitted = True - if params['data_max_'] is not None: - self._fitted = True - if params['data_range_'] is not None: - self._fitted = True - if params['n_samples_seen_'] is not None: - self._fitted = True - - - - - - @classmethod - def _get_columns_to_fit(cls, inputs: Inputs, hyperparams: Hyperparams): - if not hyperparams['use_semantic_types']: - return inputs, list(range(len(inputs.columns))) - - inputs_metadata = inputs.metadata - - def can_produce_column(column_index: int) -> bool: - return cls._can_produce_column(inputs_metadata, column_index, hyperparams) - - columns_to_produce, columns_not_to_produce = base_utils.get_columns_to_use(inputs_metadata, - use_columns=hyperparams['use_columns'], - exclude_columns=hyperparams['exclude_columns'], - can_use_column=can_produce_column) - return inputs.iloc[:, columns_to_produce], columns_to_produce - # return columns_to_produce - - @classmethod - def _can_produce_column(cls, inputs_metadata: metadata_base.DataMetadata, column_index: int, hyperparams: Hyperparams) -> bool: - column_metadata = inputs_metadata.query((metadata_base.ALL_ELEMENTS, column_index)) - - accepted_structural_types = (int, float, numpy.integer, numpy.float64) - accepted_semantic_types = set() - accepted_semantic_types.add("https://metadata.datadrivendiscovery.org/types/Attribute") - if not issubclass(column_metadata['structural_type'], accepted_structural_types): - return False - - semantic_types = set(column_metadata.get('semantic_types', [])) - - if len(semantic_types) == 0: - cls.logger.warning("No semantic types found in column metadata") - return False - - # Making sure all accepted_semantic_types are available in semantic_types - if len(accepted_semantic_types - semantic_types) == 0: - return True - - return False - - - @classmethod - def _get_target_columns_metadata(cls, outputs_metadata: metadata_base.DataMetadata, hyperparams) -> List[OrderedDict]: - outputs_length = outputs_metadata.query((metadata_base.ALL_ELEMENTS,))['dimension']['length'] - - target_columns_metadata: List[OrderedDict] = [] - for column_index in range(outputs_length): - column_metadata = OrderedDict(outputs_metadata.query_column(column_index)) - - # Update semantic types and prepare it for predicted targets. - semantic_types = set(column_metadata.get('semantic_types', [])) - semantic_types_to_remove = set([]) - add_semantic_types = [] - add_semantic_types.add(hyperparams["return_semantic_type"]) - semantic_types = semantic_types - semantic_types_to_remove - semantic_types = semantic_types.union(add_semantic_types) - column_metadata['semantic_types'] = list(semantic_types) - - target_columns_metadata.append(column_metadata) - - return target_columns_metadata - - @classmethod - def _update_predictions_metadata(cls, inputs_metadata: metadata_base.DataMetadata, outputs: Optional[Outputs], - target_columns_metadata: List[OrderedDict]) -> metadata_base.DataMetadata: - outputs_metadata = metadata_base.DataMetadata().generate(value=outputs) - - for column_index, column_metadata in enumerate(target_columns_metadata): - column_metadata.pop("structural_type", None) - outputs_metadata = outputs_metadata.update_column(column_index, column_metadata) - - return outputs_metadata - - def _wrap_predictions(self, inputs: Inputs, predictions: ndarray) -> Outputs: - outputs = d3m_dataframe(predictions, generate_metadata=True) - target_columns_metadata = self._copy_inputs_metadata(inputs.metadata, self._training_indices, outputs.metadata, self.hyperparams) - outputs.metadata = self._update_predictions_metadata(inputs.metadata, outputs, target_columns_metadata) - return outputs - - - @classmethod - def _copy_inputs_metadata(cls, inputs_metadata: metadata_base.DataMetadata, input_indices: List[int], - outputs_metadata: metadata_base.DataMetadata, hyperparams): - outputs_length = outputs_metadata.query((metadata_base.ALL_ELEMENTS,))['dimension']['length'] - target_columns_metadata: List[OrderedDict] = [] - for column_index in input_indices: - column_name = inputs_metadata.query((metadata_base.ALL_ELEMENTS, column_index)).get("name") - if column_name is None: - column_name = "output_{}".format(column_index) - - column_metadata = OrderedDict(inputs_metadata.query_column(column_index)) - semantic_types = set(column_metadata.get('semantic_types', [])) - semantic_types_to_remove = set([]) - add_semantic_types = set() - add_semantic_types.add(hyperparams["return_semantic_type"]) - semantic_types = semantic_types - semantic_types_to_remove - semantic_types = semantic_types.union(add_semantic_types) - column_metadata['semantic_types'] = list(semantic_types) - - column_metadata["name"] = str(column_name) - target_columns_metadata.append(column_metadata) - - # If outputs has more columns than index, add Attribute Type to all remaining - if outputs_length > len(input_indices): - for column_index in range(len(input_indices), outputs_length): - column_metadata = OrderedDict() - semantic_types = set() - semantic_types.add(hyperparams["return_semantic_type"]) - column_name = "output_{}".format(column_index) - column_metadata["semantic_types"] = list(semantic_types) - column_metadata["name"] = str(column_name) - target_columns_metadata.append(column_metadata) - - return target_columns_metadata - - -SKMinMaxScaler.__doc__ = MinMaxScaler.__doc__ \ No newline at end of file diff --git a/common-primitives/sklearn-wrap/sklearn_wrap/SKMissingIndicator.py b/common-primitives/sklearn-wrap/sklearn_wrap/SKMissingIndicator.py deleted file mode 100644 index 929389f..0000000 --- a/common-primitives/sklearn-wrap/sklearn_wrap/SKMissingIndicator.py +++ /dev/null @@ -1,373 +0,0 @@ -from typing import Any, Callable, List, Dict, Union, Optional, Sequence, Tuple -from numpy import ndarray -from collections import OrderedDict -from scipy import sparse -import os -import sklearn -import numpy -import typing - -# Custom import commands if any -from sklearn.impute import MissingIndicator -from sklearn.impute._base import _get_mask - - -from d3m.container.numpy import ndarray as d3m_ndarray -from d3m.container import DataFrame as d3m_dataframe -from d3m.metadata import hyperparams, params, base as metadata_base -from d3m import utils -from d3m.base import utils as base_utils -from d3m.exceptions import PrimitiveNotFittedError -from d3m.primitive_interfaces.base import CallResult, DockerContainer -from d3m.primitive_interfaces.unsupervised_learning import UnsupervisedLearnerPrimitiveBase - - -Inputs = d3m_dataframe -Outputs = d3m_dataframe - - -class Params(params.Params): - features_: Optional[ndarray] - _n_features: Optional[int] - input_column_names: Optional[Any] - target_names_: Optional[Sequence[Any]] - training_indices_: Optional[Sequence[int]] - target_column_indices_: Optional[Sequence[int]] - target_columns_metadata_: Optional[List[OrderedDict]] - - - -class Hyperparams(hyperparams.Hyperparams): - missing_values = hyperparams.Union( - configuration=OrderedDict({ - 'int': hyperparams.Hyperparameter[int]( - default=0, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ), - 'np.nan': hyperparams.Hyperparameter[float]( - default=numpy.nan, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ) - }), - default='np.nan', - description='The placeholder for the missing values. All occurrences of `missing_values` will be indicated (True in the output array), the other values will be marked as False.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'] - ) - features = hyperparams.Enumeration[str]( - values=['missing-only', 'all'], - default='missing-only', - description='Whether the imputer mask should represent all or a subset of features. - If "missing-only" (default), the imputer mask will only represent features containing missing values during fit time. - If "all", the imputer mask will represent all features.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - error_on_new = hyperparams.UniformBool( - default=True, - description='If True (default), transform will raise an error when there are features with missing values in transform that have no missing values in fit. This is applicable only when ``features="missing-only"``.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - - use_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to force primitive to operate on. If any specified column cannot be parsed, it is skipped.", - ) - exclude_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to not operate on. Applicable only if \"use_columns\" is not provided.", - ) - return_result = hyperparams.Enumeration( - values=['append', 'replace', 'new'], - default='new', - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Should parsed columns be appended, should they replace original columns, or should only parsed columns be returned? This hyperparam is ignored if use_semantic_types is set to false.", - ) - use_semantic_types = hyperparams.UniformBool( - default=False, - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Controls whether semantic_types metadata will be used for filtering columns in input dataframe. Setting this to false makes the code ignore return_result and will produce only the output dataframe" - ) - add_index_columns = hyperparams.UniformBool( - default=False, - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Also include primary index columns if input data has them. Applicable only if \"return_result\" is set to \"new\".", - ) - error_on_no_input = hyperparams.UniformBool( - default=True, - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Throw an exception if no input column is selected/provided. Defaults to true to behave like sklearn. To prevent pipelines from breaking set this to False.", - ) - - return_semantic_type = hyperparams.Enumeration[str]( - values=['https://metadata.datadrivendiscovery.org/types/Attribute', 'https://metadata.datadrivendiscovery.org/types/ConstructedAttribute'], - default='https://metadata.datadrivendiscovery.org/types/Attribute', - description='Decides what semantic type to attach to generated attributes', - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'] - ) - -class SKMissingIndicator(UnsupervisedLearnerPrimitiveBase[Inputs, Outputs, Params, Hyperparams]): - """ - Primitive wrapping for sklearn MissingIndicator - `sklearn documentation `_ - - """ - - __author__ = "JPL MARVIN" - metadata = metadata_base.PrimitiveMetadata({ - "algorithm_types": [metadata_base.PrimitiveAlgorithmType.IMPUTATION, ], - "name": "sklearn.impute.MissingIndicator", - "primitive_family": metadata_base.PrimitiveFamily.DATA_CLEANING, - "python_path": "d3m.primitives.data_cleaning.missing_indicator.SKlearn", - "source": {'name': 'JPL', 'contact': 'mailto:shah@jpl.nasa.gov', 'uris': ['https://gitlab.com/datadrivendiscovery/sklearn-wrap/issues', 'https://scikit-learn.org/stable/modules/generated/sklearn.impute.MissingIndicator.html']}, - "version": "2019.11.13", - "id": "94c5c918-9ad5-3496-8e52-2359056e0120", - 'installation': [ - {'type': metadata_base.PrimitiveInstallationType.PIP, - 'package_uri': 'git+https://gitlab.com/datadrivendiscovery/sklearn-wrap.git@{git_commit}#egg=sklearn_wrap'.format( - git_commit=utils.current_git_commit(os.path.dirname(__file__)), - ), - }] - }) - - def __init__(self, *, - hyperparams: Hyperparams, - random_seed: int = 0, - docker_containers: Dict[str, DockerContainer] = None) -> None: - - super().__init__(hyperparams=hyperparams, random_seed=random_seed, docker_containers=docker_containers) - - # False - self._clf = MissingIndicator( - missing_values=self.hyperparams['missing_values'], - features=self.hyperparams['features'], - error_on_new=self.hyperparams['error_on_new'], - ) - - self._inputs = None - self._outputs = None - self._training_inputs = None - self._training_outputs = None - self._target_names = None - self._training_indices = None - self._target_column_indices = None - self._target_columns_metadata: List[OrderedDict] = None - self._input_column_names = None - self._fitted = False - - - def set_training_data(self, *, inputs: Inputs) -> None: - self._inputs = inputs - self._fitted = False - - def fit(self, *, timeout: float = None, iterations: int = None) -> CallResult[None]: - if self._fitted: - return CallResult(None) - - self._training_inputs, self._training_indices, _ = self._get_columns_to_fit(self._inputs, self.hyperparams) - self._input_column_names = self._training_inputs.columns - - if self._training_inputs is None: - return CallResult(None) - - if len(self._training_indices) > 0: - self._clf.fit(self._training_inputs) - self._fitted = True - else: - if self.hyperparams['error_on_no_input']: - raise RuntimeError("No input columns were selected") - self.logger.warn("No input columns were selected") - return CallResult(None) - - def produce(self, *, inputs: Inputs, timeout: float = None, iterations: int = None) -> CallResult[Outputs]: - sk_inputs, columns_to_use, _ = self._get_columns_to_fit(inputs, self.hyperparams) - output = [] - if len(sk_inputs.columns): - try: - sk_output = self._clf.transform(sk_inputs) - except sklearn.exceptions.NotFittedError as error: - raise PrimitiveNotFittedError("Primitive not fitted.") from error - if sparse.issparse(sk_output): - sk_output = sk_output.toarray() - target_columns_metadata = self._copy_columns_metadata(inputs.metadata, self._training_indices, self.hyperparams) - output = self._wrap_predictions(inputs, sk_output, target_columns_metadata) - - output.columns = [inputs.columns[idx] for idx in range(len(inputs.columns)) if idx in self._training_indices] - output = [output] - else: - if self.hyperparams['error_on_no_input']: - raise RuntimeError("No input columns were selected") - self.logger.warn("No input columns were selected") - _, _, dropped_cols = self._get_columns_to_fit(inputs, self.hyperparams) - outputs = base_utils.combine_columns(return_result=self.hyperparams['return_result'], - add_index_columns=self.hyperparams['add_index_columns'], - inputs=inputs, column_indices=self._training_indices + dropped_cols, - columns_list=output) - return CallResult(outputs) - - - def get_params(self) -> Params: - if not self._fitted: - return Params( - features_=None, - _n_features=None, - input_column_names=self._input_column_names, - training_indices_=self._training_indices, - target_names_=self._target_names, - target_column_indices_=self._target_column_indices, - target_columns_metadata_=self._target_columns_metadata - ) - - return Params( - features_=getattr(self._clf, 'features_', None), - _n_features=getattr(self._clf, '_n_features', None), - input_column_names=self._input_column_names, - training_indices_=self._training_indices, - target_names_=self._target_names, - target_column_indices_=self._target_column_indices, - target_columns_metadata_=self._target_columns_metadata - ) - - def set_params(self, *, params: Params) -> None: - self._clf.features_ = params['features_'] - self._clf._n_features = params['_n_features'] - self._input_column_names = params['input_column_names'] - self._training_indices = params['training_indices_'] - self._target_names = params['target_names_'] - self._target_column_indices = params['target_column_indices_'] - self._target_columns_metadata = params['target_columns_metadata_'] - - if params['features_'] is not None: - self._fitted = True - if params['_n_features'] is not None: - self._fitted = True - - - - - - @classmethod - def _get_columns_to_fit(cls, inputs: Inputs, hyperparams: Hyperparams): - - if not hyperparams['use_semantic_types']: - columns_to_produce = list(range(len(inputs.columns))) - - else: - inputs_metadata = inputs.metadata - - def can_produce_column(column_index: int) -> bool: - return cls._can_produce_column(inputs_metadata, column_index, hyperparams) - - columns_to_produce, columns_not_to_produce = base_utils.get_columns_to_use(inputs_metadata, - use_columns=hyperparams['use_columns'], - exclude_columns=hyperparams['exclude_columns'], - can_use_column=can_produce_column) - - columns_to_drop = cls._get_columns_to_drop(inputs, columns_to_produce, hyperparams) - for col in columns_to_drop: - columns_to_produce.remove(col) - - return inputs.iloc[:, columns_to_produce], columns_to_produce, columns_to_drop - - @classmethod - def _get_columns_to_drop(cls, inputs: Inputs, column_indices: List[int], hyperparams: Hyperparams): - """ - Check for columns that contain missing_values that need to be imputed - If strategy is constant and missin_values is nan, then all nan columns will not be dropped - :param inputs: - :param column_indices: - :return: - """ - columns_to_remove = [] - if hyperparams['features'] == "missing-only": - for _, col in enumerate(column_indices): - inp = inputs.iloc[:, [col]].values - mask = _get_mask(inp, hyperparams['missing_values']) - if not mask.any(): - columns_to_remove.append(col) - return columns_to_remove - - @classmethod - def _can_produce_column(cls, inputs_metadata: metadata_base.DataMetadata, column_index: int, hyperparams: Hyperparams) -> bool: - column_metadata = inputs_metadata.query((metadata_base.ALL_ELEMENTS, column_index)) - - accepted_structural_types = (int, float, numpy.integer, numpy.float64) - accepted_semantic_types = set() - accepted_semantic_types.add("https://metadata.datadrivendiscovery.org/types/Attribute") - if not issubclass(column_metadata['structural_type'], accepted_structural_types): - return False - - semantic_types = set(column_metadata.get('semantic_types', [])) - - if len(semantic_types) == 0: - cls.logger.warning("No semantic types found in column metadata") - return False - # Making sure all accepted_semantic_types are available in semantic_types - if len(accepted_semantic_types - semantic_types) == 0: - return True - - return False - - - @classmethod - def _get_target_columns_metadata(cls, outputs_metadata: metadata_base.DataMetadata, hyperparams) -> List[OrderedDict]: - outputs_length = outputs_metadata.query((metadata_base.ALL_ELEMENTS,))['dimension']['length'] - - target_columns_metadata: List[OrderedDict] = [] - for column_index in range(outputs_length): - column_metadata = OrderedDict(outputs_metadata.query_column(column_index)) - - # Update semantic types and prepare it for predicted targets. - semantic_types = set(column_metadata.get('semantic_types', [])) - semantic_types_to_remove = set([]) - add_semantic_types = [] - add_semantic_types.add(hyperparams["return_semantic_type"]) - semantic_types = semantic_types - semantic_types_to_remove - semantic_types = semantic_types.union(add_semantic_types) - column_metadata['semantic_types'] = list(semantic_types) - - target_columns_metadata.append(column_metadata) - - return target_columns_metadata - - @classmethod - def _update_predictions_metadata(cls, inputs_metadata: metadata_base.DataMetadata, outputs: Optional[Outputs], - target_columns_metadata: List[OrderedDict]) -> metadata_base.DataMetadata: - outputs_metadata = metadata_base.DataMetadata().generate(value=outputs) - - for column_index, column_metadata in enumerate(target_columns_metadata): - column_metadata.pop("structural_type", None) - outputs_metadata = outputs_metadata.update_column(column_index, column_metadata) - - return outputs_metadata - - def _wrap_predictions(self, inputs: Inputs, predictions: ndarray, target_columns_metadata) -> Outputs: - outputs = d3m_dataframe(predictions, generate_metadata=False) - outputs.metadata = self._update_predictions_metadata(inputs.metadata, outputs, target_columns_metadata) - return outputs - - - - @classmethod - def _copy_columns_metadata(cls, inputs_metadata: metadata_base.DataMetadata, column_indices, hyperparams) -> List[OrderedDict]: - outputs_length = inputs_metadata.query((metadata_base.ALL_ELEMENTS,))['dimension']['length'] - - target_columns_metadata: List[OrderedDict] = [] - for column_index in column_indices: - column_name = inputs_metadata.query((metadata_base.ALL_ELEMENTS, column_index)).get("name") - column_metadata = OrderedDict(inputs_metadata.query_column(column_index)) - semantic_types = set(column_metadata.get('semantic_types', [])) - semantic_types_to_remove = set([]) - add_semantic_types = set() - add_semantic_types.add(hyperparams["return_semantic_type"]) - semantic_types = semantic_types - semantic_types_to_remove - semantic_types = semantic_types.union(add_semantic_types) - column_metadata['semantic_types'] = list(semantic_types) - - column_metadata["name"] = str(column_name) - target_columns_metadata.append(column_metadata) - - return target_columns_metadata - - -SKMissingIndicator.__doc__ = MissingIndicator.__doc__ \ No newline at end of file diff --git a/common-primitives/sklearn-wrap/sklearn_wrap/SKMultinomialNB.py b/common-primitives/sklearn-wrap/sklearn_wrap/SKMultinomialNB.py deleted file mode 100644 index b429050..0000000 --- a/common-primitives/sklearn-wrap/sklearn_wrap/SKMultinomialNB.py +++ /dev/null @@ -1,488 +0,0 @@ -from typing import Any, Callable, List, Dict, Union, Optional, Sequence, Tuple -from numpy import ndarray -from collections import OrderedDict -from scipy import sparse -import os -import sklearn -import numpy -import typing - -# Custom import commands if any -from sklearn.naive_bayes import MultinomialNB - - -from d3m.container.numpy import ndarray as d3m_ndarray -from d3m.container import DataFrame as d3m_dataframe -from d3m.metadata import hyperparams, params, base as metadata_base -from d3m import utils -from d3m.base import utils as base_utils -from d3m.exceptions import PrimitiveNotFittedError -from d3m.primitive_interfaces.base import CallResult, DockerContainer - -from d3m.primitive_interfaces.supervised_learning import SupervisedLearnerPrimitiveBase -from d3m.primitive_interfaces.base import ProbabilisticCompositionalityMixin, ContinueFitMixin -from d3m import exceptions -import pandas - - - -Inputs = d3m_dataframe -Outputs = d3m_dataframe - - -class Params(params.Params): - class_log_prior_: Optional[ndarray] - feature_log_prob_: Optional[ndarray] - class_count_: Optional[ndarray] - feature_count_: Optional[ndarray] - classes_: Optional[ndarray] - input_column_names: Optional[Any] - target_names_: Optional[Sequence[Any]] - training_indices_: Optional[Sequence[int]] - target_column_indices_: Optional[Sequence[int]] - target_columns_metadata_: Optional[List[OrderedDict]] - - - -class Hyperparams(hyperparams.Hyperparams): - alpha = hyperparams.Hyperparameter[float]( - default=1, - description='Additive (Laplace/Lidstone) smoothing parameter (0 for no smoothing).', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - fit_prior = hyperparams.UniformBool( - default=True, - description='Whether to learn class prior probabilities or not. If false, a uniform prior will be used.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - - use_inputs_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to force primitive to use as training input. If any specified column cannot be parsed, it is skipped.", - ) - use_outputs_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to force primitive to use as training target. If any specified column cannot be parsed, it is skipped.", - ) - exclude_inputs_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to not use as training inputs. Applicable only if \"use_columns\" is not provided.", - ) - exclude_outputs_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to not use as training target. Applicable only if \"use_columns\" is not provided.", - ) - return_result = hyperparams.Enumeration( - values=['append', 'replace', 'new'], - default='new', - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Should parsed columns be appended, should they replace original columns, or should only parsed columns be returned? This hyperparam is ignored if use_semantic_types is set to false.", - ) - use_semantic_types = hyperparams.UniformBool( - default=False, - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Controls whether semantic_types metadata will be used for filtering columns in input dataframe. Setting this to false makes the code ignore return_result and will produce only the output dataframe" - ) - add_index_columns = hyperparams.UniformBool( - default=False, - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Also include primary index columns if input data has them. Applicable only if \"return_result\" is set to \"new\".", - ) - error_on_no_input = hyperparams.UniformBool( - default=True, - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Throw an exception if no input column is selected/provided. Defaults to true to behave like sklearn. To prevent pipelines from breaking set this to False.", - ) - - return_semantic_type = hyperparams.Enumeration[str]( - values=['https://metadata.datadrivendiscovery.org/types/Attribute', 'https://metadata.datadrivendiscovery.org/types/ConstructedAttribute', 'https://metadata.datadrivendiscovery.org/types/PredictedTarget'], - default='https://metadata.datadrivendiscovery.org/types/PredictedTarget', - description='Decides what semantic type to attach to generated output', - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'] - ) - -class SKMultinomialNB(SupervisedLearnerPrimitiveBase[Inputs, Outputs, Params, Hyperparams], - ProbabilisticCompositionalityMixin[Inputs, Outputs, Params, Hyperparams], - ContinueFitMixin[Inputs, Outputs, Params, Hyperparams]): - """ - Primitive wrapping for sklearn MultinomialNB - `sklearn documentation `_ - - """ - - __author__ = "JPL MARVIN" - metadata = metadata_base.PrimitiveMetadata({ - "algorithm_types": [metadata_base.PrimitiveAlgorithmType.NAIVE_BAYES_CLASSIFIER, ], - "name": "sklearn.naive_bayes.MultinomialNB", - "primitive_family": metadata_base.PrimitiveFamily.CLASSIFICATION, - "python_path": "d3m.primitives.classification.multinomial_naive_bayes.SKlearn", - "source": {'name': 'JPL', 'contact': 'mailto:shah@jpl.nasa.gov', 'uris': ['https://gitlab.com/datadrivendiscovery/sklearn-wrap/issues', 'https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html']}, - "version": "2019.11.13", - "id": "adf13b4b-9fe5-38a2-a1ea-d1b1cc342576", - "hyperparams_to_tune": ['alpha', 'fit_prior'], - 'installation': [ - {'type': metadata_base.PrimitiveInstallationType.PIP, - 'package_uri': 'git+https://gitlab.com/datadrivendiscovery/sklearn-wrap.git@{git_commit}#egg=sklearn_wrap'.format( - git_commit=utils.current_git_commit(os.path.dirname(__file__)), - ), - }] - }) - - def __init__(self, *, - hyperparams: Hyperparams, - random_seed: int = 0, - docker_containers: Dict[str, DockerContainer] = None) -> None: - - super().__init__(hyperparams=hyperparams, random_seed=random_seed, docker_containers=docker_containers) - - # False - self._clf = MultinomialNB( - alpha=self.hyperparams['alpha'], - fit_prior=self.hyperparams['fit_prior'], - ) - - self._inputs = None - self._outputs = None - self._training_inputs = None - self._training_outputs = None - self._target_names = None - self._training_indices = None - self._target_column_indices = None - self._target_columns_metadata: List[OrderedDict] = None - self._input_column_names = None - self._fitted = False - self._new_training_data = False - - def set_training_data(self, *, inputs: Inputs, outputs: Outputs) -> None: - self._inputs = inputs - self._outputs = outputs - self._fitted = False - self._new_training_data = True - - def fit(self, *, timeout: float = None, iterations: int = None) -> CallResult[None]: - if self._inputs is None or self._outputs is None: - raise ValueError("Missing training data.") - - if not self._new_training_data: - return CallResult(None) - self._new_training_data = False - - self._training_inputs, self._training_indices = self._get_columns_to_fit(self._inputs, self.hyperparams) - self._training_outputs, self._target_names, self._target_column_indices = self._get_targets(self._outputs, self.hyperparams) - self._input_column_names = self._training_inputs.columns - - if len(self._training_indices) > 0 and len(self._target_column_indices) > 0: - self._target_columns_metadata = self._get_target_columns_metadata(self._training_outputs.metadata, self.hyperparams) - sk_training_output = self._training_outputs.values - - shape = sk_training_output.shape - if len(shape) == 2 and shape[1] == 1: - sk_training_output = numpy.ravel(sk_training_output) - - self._clf.fit(self._training_inputs, sk_training_output) - self._fitted = True - else: - if self.hyperparams['error_on_no_input']: - raise RuntimeError("No input columns were selected") - self.logger.warn("No input columns were selected") - - return CallResult(None) - - def continue_fit(self, *, timeout: float = None, iterations: int = None) -> CallResult[None]: - if self._training_inputs is None or self._training_outputs is None: - raise ValueError("Missing training data.") - - if not self._new_training_data: - return CallResult(None) - self._new_training_data = False - - if len(self._training_indices) > 0 and len(self._target_column_indices) > 0: - self._target_columns_metadata = self._get_target_columns_metadata(self._training_outputs.metadata, self.hyperparams) - sk_training_output = self._training_outputs.values - - shape = sk_training_output.shape - if len(shape) == 2 and shape[1] == 1: - sk_training_output = numpy.ravel(sk_training_output) - - self._clf.partial_fit(self._training_inputs, sk_training_output) - self._fitted = True - else: - if self.hyperparams['error_on_no_input']: - raise RuntimeError("No input columns were selected") - self.logger.warn("No input columns were selected") - - return CallResult(None) - - - def produce(self, *, inputs: Inputs, timeout: float = None, iterations: int = None) -> CallResult[Outputs]: - sk_inputs, columns_to_use = self._get_columns_to_fit(inputs, self.hyperparams) - output = [] - if len(sk_inputs.columns): - try: - sk_output = self._clf.predict(sk_inputs) - except sklearn.exceptions.NotFittedError as error: - raise PrimitiveNotFittedError("Primitive not fitted.") from error - # For primitives that allow predicting without fitting like GaussianProcessRegressor - if not self._fitted: - raise PrimitiveNotFittedError("Primitive not fitted.") - if sparse.issparse(sk_output): - sk_output = sk_output.toarray() - output = self._wrap_predictions(inputs, sk_output) - output.columns = self._target_names - output = [output] - else: - if self.hyperparams['error_on_no_input']: - raise RuntimeError("No input columns were selected") - self.logger.warn("No input columns were selected") - outputs = base_utils.combine_columns(return_result=self.hyperparams['return_result'], - add_index_columns=self.hyperparams['add_index_columns'], - inputs=inputs, column_indices=self._target_column_indices, - columns_list=output) - - return CallResult(outputs) - - - def get_params(self) -> Params: - if not self._fitted: - return Params( - class_log_prior_=None, - feature_log_prob_=None, - class_count_=None, - feature_count_=None, - classes_=None, - input_column_names=self._input_column_names, - training_indices_=self._training_indices, - target_names_=self._target_names, - target_column_indices_=self._target_column_indices, - target_columns_metadata_=self._target_columns_metadata - ) - - return Params( - class_log_prior_=getattr(self._clf, 'class_log_prior_', None), - feature_log_prob_=getattr(self._clf, 'feature_log_prob_', None), - class_count_=getattr(self._clf, 'class_count_', None), - feature_count_=getattr(self._clf, 'feature_count_', None), - classes_=getattr(self._clf, 'classes_', None), - input_column_names=self._input_column_names, - training_indices_=self._training_indices, - target_names_=self._target_names, - target_column_indices_=self._target_column_indices, - target_columns_metadata_=self._target_columns_metadata - ) - - def set_params(self, *, params: Params) -> None: - self._clf.class_log_prior_ = params['class_log_prior_'] - self._clf.feature_log_prob_ = params['feature_log_prob_'] - self._clf.class_count_ = params['class_count_'] - self._clf.feature_count_ = params['feature_count_'] - self._clf.classes_ = params['classes_'] - self._input_column_names = params['input_column_names'] - self._training_indices = params['training_indices_'] - self._target_names = params['target_names_'] - self._target_column_indices = params['target_column_indices_'] - self._target_columns_metadata = params['target_columns_metadata_'] - - if params['class_log_prior_'] is not None: - self._fitted = True - if params['feature_log_prob_'] is not None: - self._fitted = True - if params['class_count_'] is not None: - self._fitted = True - if params['feature_count_'] is not None: - self._fitted = True - if params['classes_'] is not None: - self._fitted = True - - - def log_likelihoods(self, *, - outputs: Outputs, - inputs: Inputs, - timeout: float = None, - iterations: int = None) -> CallResult[Sequence[float]]: - inputs = inputs.iloc[:, self._training_indices] # Get ndarray - outputs = outputs.iloc[:, self._target_column_indices] - - if len(inputs.columns) and len(outputs.columns): - - if outputs.shape[1] != self._clf.n_outputs_: - raise exceptions.InvalidArgumentValueError("\"outputs\" argument does not have the correct number of target columns.") - - log_proba = self._clf.predict_log_proba(inputs) - - # Making it always a list, even when only one target. - if self._clf.n_outputs_ == 1: - log_proba = [log_proba] - classes = [self._clf.classes_] - else: - classes = self._clf.classes_ - - samples_length = inputs.shape[0] - - log_likelihoods = [] - for k in range(self._clf.n_outputs_): - # We have to map each class to its internal (numerical) index used in the learner. - # This allows "outputs" to contain string classes. - outputs_column = outputs.iloc[:, k] - classes_map = pandas.Series(numpy.arange(len(classes[k])), index=classes[k]) - mapped_outputs_column = outputs_column.map(classes_map) - - # For each target column (column in "outputs"), for each sample (row) we pick the log - # likelihood for a given class. - log_likelihoods.append(log_proba[k][numpy.arange(samples_length), mapped_outputs_column]) - - results = d3m_dataframe(dict(enumerate(log_likelihoods)), generate_metadata=True) - results.columns = outputs.columns - - for k in range(self._clf.n_outputs_): - column_metadata = outputs.metadata.query_column(k) - if 'name' in column_metadata: - results.metadata = results.metadata.update_column(k, {'name': column_metadata['name']}) - - else: - results = d3m_dataframe(generate_metadata=True) - - return CallResult(results) - - - - - - @classmethod - def _get_columns_to_fit(cls, inputs: Inputs, hyperparams: Hyperparams): - if not hyperparams['use_semantic_types']: - return inputs, list(range(len(inputs.columns))) - - inputs_metadata = inputs.metadata - - def can_produce_column(column_index: int) -> bool: - return cls._can_produce_column(inputs_metadata, column_index, hyperparams) - - columns_to_produce, columns_not_to_produce = base_utils.get_columns_to_use(inputs_metadata, - use_columns=hyperparams['use_inputs_columns'], - exclude_columns=hyperparams['exclude_inputs_columns'], - can_use_column=can_produce_column) - return inputs.iloc[:, columns_to_produce], columns_to_produce - # return columns_to_produce - - @classmethod - def _can_produce_column(cls, inputs_metadata: metadata_base.DataMetadata, column_index: int, hyperparams: Hyperparams) -> bool: - column_metadata = inputs_metadata.query((metadata_base.ALL_ELEMENTS, column_index)) - - accepted_structural_types = (int, float, numpy.integer, numpy.float64) - accepted_semantic_types = set() - accepted_semantic_types.add("https://metadata.datadrivendiscovery.org/types/Attribute") - if not issubclass(column_metadata['structural_type'], accepted_structural_types): - return False - - semantic_types = set(column_metadata.get('semantic_types', [])) - - if len(semantic_types) == 0: - cls.logger.warning("No semantic types found in column metadata") - return False - # Making sure all accepted_semantic_types are available in semantic_types - if len(accepted_semantic_types - semantic_types) == 0: - return True - - return False - - @classmethod - def _get_targets(cls, data: d3m_dataframe, hyperparams: Hyperparams): - if not hyperparams['use_semantic_types']: - return data, list(data.columns), list(range(len(data.columns))) - - metadata = data.metadata - - def can_produce_column(column_index: int) -> bool: - accepted_semantic_types = set() - accepted_semantic_types.add("https://metadata.datadrivendiscovery.org/types/TrueTarget") - column_metadata = metadata.query((metadata_base.ALL_ELEMENTS, column_index)) - semantic_types = set(column_metadata.get('semantic_types', [])) - if len(semantic_types) == 0: - cls.logger.warning("No semantic types found in column metadata") - return False - # Making sure all accepted_semantic_types are available in semantic_types - if len(accepted_semantic_types - semantic_types) == 0: - return True - return False - - target_column_indices, target_columns_not_to_produce = base_utils.get_columns_to_use(metadata, - use_columns=hyperparams[ - 'use_outputs_columns'], - exclude_columns= - hyperparams[ - 'exclude_outputs_columns'], - can_use_column=can_produce_column) - targets = [] - if target_column_indices: - targets = data.select_columns(target_column_indices) - target_column_names = [] - for idx in target_column_indices: - target_column_names.append(data.columns[idx]) - return targets, target_column_names, target_column_indices - - @classmethod - def _get_target_columns_metadata(cls, outputs_metadata: metadata_base.DataMetadata, hyperparams) -> List[OrderedDict]: - outputs_length = outputs_metadata.query((metadata_base.ALL_ELEMENTS,))['dimension']['length'] - - target_columns_metadata: List[OrderedDict] = [] - for column_index in range(outputs_length): - column_metadata = OrderedDict(outputs_metadata.query_column(column_index)) - - # Update semantic types and prepare it for predicted targets. - semantic_types = set(column_metadata.get('semantic_types', [])) - semantic_types_to_remove = set(["https://metadata.datadrivendiscovery.org/types/TrueTarget","https://metadata.datadrivendiscovery.org/types/SuggestedTarget",]) - add_semantic_types = set(["https://metadata.datadrivendiscovery.org/types/PredictedTarget",]) - add_semantic_types.add(hyperparams["return_semantic_type"]) - semantic_types = semantic_types - semantic_types_to_remove - semantic_types = semantic_types.union(add_semantic_types) - column_metadata['semantic_types'] = list(semantic_types) - - target_columns_metadata.append(column_metadata) - - return target_columns_metadata - - @classmethod - def _update_predictions_metadata(cls, inputs_metadata: metadata_base.DataMetadata, outputs: Optional[Outputs], - target_columns_metadata: List[OrderedDict]) -> metadata_base.DataMetadata: - outputs_metadata = metadata_base.DataMetadata().generate(value=outputs) - - for column_index, column_metadata in enumerate(target_columns_metadata): - column_metadata.pop("structural_type", None) - outputs_metadata = outputs_metadata.update_column(column_index, column_metadata) - - return outputs_metadata - - def _wrap_predictions(self, inputs: Inputs, predictions: ndarray) -> Outputs: - outputs = d3m_dataframe(predictions, generate_metadata=False) - outputs.metadata = self._update_predictions_metadata(inputs.metadata, outputs, self._target_columns_metadata) - return outputs - - - @classmethod - def _add_target_columns_metadata(cls, outputs_metadata: metadata_base.DataMetadata): - outputs_length = outputs_metadata.query((metadata_base.ALL_ELEMENTS,))['dimension']['length'] - - target_columns_metadata: List[OrderedDict] = [] - for column_index in range(outputs_length): - column_metadata = OrderedDict() - semantic_types = [] - semantic_types.append('https://metadata.datadrivendiscovery.org/types/PredictedTarget') - column_name = outputs_metadata.query((metadata_base.ALL_ELEMENTS, column_index)).get("name") - if column_name is None: - column_name = "output_{}".format(column_index) - column_metadata["semantic_types"] = semantic_types - column_metadata["name"] = str(column_name) - target_columns_metadata.append(column_metadata) - - return target_columns_metadata - - -SKMultinomialNB.__doc__ = MultinomialNB.__doc__ \ No newline at end of file diff --git a/common-primitives/sklearn-wrap/sklearn_wrap/SKNearestCentroid.py b/common-primitives/sklearn-wrap/sklearn_wrap/SKNearestCentroid.py deleted file mode 100644 index 62bc158..0000000 --- a/common-primitives/sklearn-wrap/sklearn_wrap/SKNearestCentroid.py +++ /dev/null @@ -1,408 +0,0 @@ -from typing import Any, Callable, List, Dict, Union, Optional, Sequence, Tuple -from numpy import ndarray -from collections import OrderedDict -from scipy import sparse -import os -import sklearn -import numpy -import typing - -# Custom import commands if any -from sklearn.neighbors.nearest_centroid import NearestCentroid - - -from d3m.container.numpy import ndarray as d3m_ndarray -from d3m.container import DataFrame as d3m_dataframe -from d3m.metadata import hyperparams, params, base as metadata_base -from d3m import utils -from d3m.base import utils as base_utils -from d3m.exceptions import PrimitiveNotFittedError -from d3m.primitive_interfaces.base import CallResult, DockerContainer - -from d3m.primitive_interfaces.supervised_learning import SupervisedLearnerPrimitiveBase -from d3m.primitive_interfaces.base import ProbabilisticCompositionalityMixin, ContinueFitMixin -from d3m import exceptions -import pandas - - - -Inputs = d3m_dataframe -Outputs = d3m_dataframe - - -class Params(params.Params): - centroids_: Optional[ndarray] - classes_: Optional[ndarray] - input_column_names: Optional[Any] - target_names_: Optional[Sequence[Any]] - training_indices_: Optional[Sequence[int]] - target_column_indices_: Optional[Sequence[int]] - target_columns_metadata_: Optional[List[OrderedDict]] - - - -class Hyperparams(hyperparams.Hyperparams): - metric = hyperparams.Enumeration[str]( - default='euclidean', - values=['euclidean', 'manhattan'], - description='The metric to use when calculating distance between instances in a feature array. If metric is a string or callable, it must be one of the options allowed by metrics.pairwise.pairwise_distances for its metric parameter. The centroids for the samples corresponding to each class is the point from which the sum of the distances (according to the metric) of all samples that belong to that particular class are minimized. If the "manhattan" metric is provided, this centroid is the median and for all other metrics, the centroid is now set to be the mean.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - shrink_threshold = hyperparams.Union( - configuration=OrderedDict({ - 'float': hyperparams.Bounded[float]( - lower=0, - upper=None, - default=0, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ), - 'none': hyperparams.Constant( - default=None, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ) - }), - default='none', - description='Threshold for shrinking centroids to remove features.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - - use_inputs_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to force primitive to use as training input. If any specified column cannot be parsed, it is skipped.", - ) - use_outputs_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to force primitive to use as training target. If any specified column cannot be parsed, it is skipped.", - ) - exclude_inputs_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to not use as training inputs. Applicable only if \"use_columns\" is not provided.", - ) - exclude_outputs_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to not use as training target. Applicable only if \"use_columns\" is not provided.", - ) - return_result = hyperparams.Enumeration( - values=['append', 'replace', 'new'], - default='new', - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Should parsed columns be appended, should they replace original columns, or should only parsed columns be returned? This hyperparam is ignored if use_semantic_types is set to false.", - ) - use_semantic_types = hyperparams.UniformBool( - default=False, - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Controls whether semantic_types metadata will be used for filtering columns in input dataframe. Setting this to false makes the code ignore return_result and will produce only the output dataframe" - ) - add_index_columns = hyperparams.UniformBool( - default=False, - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Also include primary index columns if input data has them. Applicable only if \"return_result\" is set to \"new\".", - ) - error_on_no_input = hyperparams.UniformBool( - default=True, - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Throw an exception if no input column is selected/provided. Defaults to true to behave like sklearn. To prevent pipelines from breaking set this to False.", - ) - - return_semantic_type = hyperparams.Enumeration[str]( - values=['https://metadata.datadrivendiscovery.org/types/Attribute', 'https://metadata.datadrivendiscovery.org/types/ConstructedAttribute', 'https://metadata.datadrivendiscovery.org/types/PredictedTarget'], - default='https://metadata.datadrivendiscovery.org/types/PredictedTarget', - description='Decides what semantic type to attach to generated output', - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'] - ) - -class SKNearestCentroid(SupervisedLearnerPrimitiveBase[Inputs, Outputs, Params, Hyperparams]): - """ - Primitive wrapping for sklearn NearestCentroid - `sklearn documentation `_ - - """ - - __author__ = "JPL MARVIN" - metadata = metadata_base.PrimitiveMetadata({ - "algorithm_types": [metadata_base.PrimitiveAlgorithmType.NEAREST_CENTROID_CLASSIFIER, ], - "name": "sklearn.neighbors.nearest_centroid.NearestCentroid", - "primitive_family": metadata_base.PrimitiveFamily.CLASSIFICATION, - "python_path": "d3m.primitives.classification.nearest_centroid.SKlearn", - "source": {'name': 'JPL', 'contact': 'mailto:shah@jpl.nasa.gov', 'uris': ['https://gitlab.com/datadrivendiscovery/sklearn-wrap/issues', 'https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.NearestCentroid.html']}, - "version": "2019.11.13", - "id": "90e7b335-5af0-35ad-932c-9c771fe84693", - 'installation': [ - {'type': metadata_base.PrimitiveInstallationType.PIP, - 'package_uri': 'git+https://gitlab.com/datadrivendiscovery/sklearn-wrap.git@{git_commit}#egg=sklearn_wrap'.format( - git_commit=utils.current_git_commit(os.path.dirname(__file__)), - ), - }] - }) - - def __init__(self, *, - hyperparams: Hyperparams, - random_seed: int = 0, - docker_containers: Dict[str, DockerContainer] = None) -> None: - - super().__init__(hyperparams=hyperparams, random_seed=random_seed, docker_containers=docker_containers) - - # False - self._clf = NearestCentroid( - metric=self.hyperparams['metric'], - shrink_threshold=self.hyperparams['shrink_threshold'], - ) - - self._inputs = None - self._outputs = None - self._training_inputs = None - self._training_outputs = None - self._target_names = None - self._training_indices = None - self._target_column_indices = None - self._target_columns_metadata: List[OrderedDict] = None - self._input_column_names = None - self._fitted = False - self._new_training_data = False - - def set_training_data(self, *, inputs: Inputs, outputs: Outputs) -> None: - self._inputs = inputs - self._outputs = outputs - self._fitted = False - self._new_training_data = True - - def fit(self, *, timeout: float = None, iterations: int = None) -> CallResult[None]: - if self._inputs is None or self._outputs is None: - raise ValueError("Missing training data.") - - if not self._new_training_data: - return CallResult(None) - self._new_training_data = False - - self._training_inputs, self._training_indices = self._get_columns_to_fit(self._inputs, self.hyperparams) - self._training_outputs, self._target_names, self._target_column_indices = self._get_targets(self._outputs, self.hyperparams) - self._input_column_names = self._training_inputs.columns - - if len(self._training_indices) > 0 and len(self._target_column_indices) > 0: - self._target_columns_metadata = self._get_target_columns_metadata(self._training_outputs.metadata, self.hyperparams) - sk_training_output = self._training_outputs.values - - shape = sk_training_output.shape - if len(shape) == 2 and shape[1] == 1: - sk_training_output = numpy.ravel(sk_training_output) - - self._clf.fit(self._training_inputs, sk_training_output) - self._fitted = True - else: - if self.hyperparams['error_on_no_input']: - raise RuntimeError("No input columns were selected") - self.logger.warn("No input columns were selected") - - return CallResult(None) - - - - def produce(self, *, inputs: Inputs, timeout: float = None, iterations: int = None) -> CallResult[Outputs]: - sk_inputs, columns_to_use = self._get_columns_to_fit(inputs, self.hyperparams) - output = [] - if len(sk_inputs.columns): - try: - sk_output = self._clf.predict(sk_inputs) - except sklearn.exceptions.NotFittedError as error: - raise PrimitiveNotFittedError("Primitive not fitted.") from error - # For primitives that allow predicting without fitting like GaussianProcessRegressor - if not self._fitted: - raise PrimitiveNotFittedError("Primitive not fitted.") - if sparse.issparse(sk_output): - sk_output = sk_output.toarray() - output = self._wrap_predictions(inputs, sk_output) - output.columns = self._target_names - output = [output] - else: - if self.hyperparams['error_on_no_input']: - raise RuntimeError("No input columns were selected") - self.logger.warn("No input columns were selected") - outputs = base_utils.combine_columns(return_result=self.hyperparams['return_result'], - add_index_columns=self.hyperparams['add_index_columns'], - inputs=inputs, column_indices=self._target_column_indices, - columns_list=output) - - return CallResult(outputs) - - - def get_params(self) -> Params: - if not self._fitted: - return Params( - centroids_=None, - classes_=None, - input_column_names=self._input_column_names, - training_indices_=self._training_indices, - target_names_=self._target_names, - target_column_indices_=self._target_column_indices, - target_columns_metadata_=self._target_columns_metadata - ) - - return Params( - centroids_=getattr(self._clf, 'centroids_', None), - classes_=getattr(self._clf, 'classes_', None), - input_column_names=self._input_column_names, - training_indices_=self._training_indices, - target_names_=self._target_names, - target_column_indices_=self._target_column_indices, - target_columns_metadata_=self._target_columns_metadata - ) - - def set_params(self, *, params: Params) -> None: - self._clf.centroids_ = params['centroids_'] - self._clf.classes_ = params['classes_'] - self._input_column_names = params['input_column_names'] - self._training_indices = params['training_indices_'] - self._target_names = params['target_names_'] - self._target_column_indices = params['target_column_indices_'] - self._target_columns_metadata = params['target_columns_metadata_'] - - if params['centroids_'] is not None: - self._fitted = True - if params['classes_'] is not None: - self._fitted = True - - - - - - - - @classmethod - def _get_columns_to_fit(cls, inputs: Inputs, hyperparams: Hyperparams): - if not hyperparams['use_semantic_types']: - return inputs, list(range(len(inputs.columns))) - - inputs_metadata = inputs.metadata - - def can_produce_column(column_index: int) -> bool: - return cls._can_produce_column(inputs_metadata, column_index, hyperparams) - - columns_to_produce, columns_not_to_produce = base_utils.get_columns_to_use(inputs_metadata, - use_columns=hyperparams['use_inputs_columns'], - exclude_columns=hyperparams['exclude_inputs_columns'], - can_use_column=can_produce_column) - return inputs.iloc[:, columns_to_produce], columns_to_produce - # return columns_to_produce - - @classmethod - def _can_produce_column(cls, inputs_metadata: metadata_base.DataMetadata, column_index: int, hyperparams: Hyperparams) -> bool: - column_metadata = inputs_metadata.query((metadata_base.ALL_ELEMENTS, column_index)) - - accepted_structural_types = (int, float, numpy.integer, numpy.float64) - accepted_semantic_types = set() - accepted_semantic_types.add("https://metadata.datadrivendiscovery.org/types/Attribute") - if not issubclass(column_metadata['structural_type'], accepted_structural_types): - return False - - semantic_types = set(column_metadata.get('semantic_types', [])) - - if len(semantic_types) == 0: - cls.logger.warning("No semantic types found in column metadata") - return False - # Making sure all accepted_semantic_types are available in semantic_types - if len(accepted_semantic_types - semantic_types) == 0: - return True - - return False - - @classmethod - def _get_targets(cls, data: d3m_dataframe, hyperparams: Hyperparams): - if not hyperparams['use_semantic_types']: - return data, list(data.columns), list(range(len(data.columns))) - - metadata = data.metadata - - def can_produce_column(column_index: int) -> bool: - accepted_semantic_types = set() - accepted_semantic_types.add("https://metadata.datadrivendiscovery.org/types/TrueTarget") - column_metadata = metadata.query((metadata_base.ALL_ELEMENTS, column_index)) - semantic_types = set(column_metadata.get('semantic_types', [])) - if len(semantic_types) == 0: - cls.logger.warning("No semantic types found in column metadata") - return False - # Making sure all accepted_semantic_types are available in semantic_types - if len(accepted_semantic_types - semantic_types) == 0: - return True - return False - - target_column_indices, target_columns_not_to_produce = base_utils.get_columns_to_use(metadata, - use_columns=hyperparams[ - 'use_outputs_columns'], - exclude_columns= - hyperparams[ - 'exclude_outputs_columns'], - can_use_column=can_produce_column) - targets = [] - if target_column_indices: - targets = data.select_columns(target_column_indices) - target_column_names = [] - for idx in target_column_indices: - target_column_names.append(data.columns[idx]) - return targets, target_column_names, target_column_indices - - @classmethod - def _get_target_columns_metadata(cls, outputs_metadata: metadata_base.DataMetadata, hyperparams) -> List[OrderedDict]: - outputs_length = outputs_metadata.query((metadata_base.ALL_ELEMENTS,))['dimension']['length'] - - target_columns_metadata: List[OrderedDict] = [] - for column_index in range(outputs_length): - column_metadata = OrderedDict(outputs_metadata.query_column(column_index)) - - # Update semantic types and prepare it for predicted targets. - semantic_types = set(column_metadata.get('semantic_types', [])) - semantic_types_to_remove = set(["https://metadata.datadrivendiscovery.org/types/TrueTarget","https://metadata.datadrivendiscovery.org/types/SuggestedTarget",]) - add_semantic_types = set(["https://metadata.datadrivendiscovery.org/types/PredictedTarget",]) - add_semantic_types.add(hyperparams["return_semantic_type"]) - semantic_types = semantic_types - semantic_types_to_remove - semantic_types = semantic_types.union(add_semantic_types) - column_metadata['semantic_types'] = list(semantic_types) - - target_columns_metadata.append(column_metadata) - - return target_columns_metadata - - @classmethod - def _update_predictions_metadata(cls, inputs_metadata: metadata_base.DataMetadata, outputs: Optional[Outputs], - target_columns_metadata: List[OrderedDict]) -> metadata_base.DataMetadata: - outputs_metadata = metadata_base.DataMetadata().generate(value=outputs) - - for column_index, column_metadata in enumerate(target_columns_metadata): - column_metadata.pop("structural_type", None) - outputs_metadata = outputs_metadata.update_column(column_index, column_metadata) - - return outputs_metadata - - def _wrap_predictions(self, inputs: Inputs, predictions: ndarray) -> Outputs: - outputs = d3m_dataframe(predictions, generate_metadata=False) - outputs.metadata = self._update_predictions_metadata(inputs.metadata, outputs, self._target_columns_metadata) - return outputs - - - @classmethod - def _add_target_columns_metadata(cls, outputs_metadata: metadata_base.DataMetadata): - outputs_length = outputs_metadata.query((metadata_base.ALL_ELEMENTS,))['dimension']['length'] - - target_columns_metadata: List[OrderedDict] = [] - for column_index in range(outputs_length): - column_metadata = OrderedDict() - semantic_types = [] - semantic_types.append('https://metadata.datadrivendiscovery.org/types/PredictedTarget') - column_name = outputs_metadata.query((metadata_base.ALL_ELEMENTS, column_index)).get("name") - if column_name is None: - column_name = "output_{}".format(column_index) - column_metadata["semantic_types"] = semantic_types - column_metadata["name"] = str(column_name) - target_columns_metadata.append(column_metadata) - - return target_columns_metadata - - -SKNearestCentroid.__doc__ = NearestCentroid.__doc__ \ No newline at end of file diff --git a/common-primitives/sklearn-wrap/sklearn_wrap/SKNormalizer.py b/common-primitives/sklearn-wrap/sklearn_wrap/SKNormalizer.py deleted file mode 100644 index b358b7c..0000000 --- a/common-primitives/sklearn-wrap/sklearn_wrap/SKNormalizer.py +++ /dev/null @@ -1,329 +0,0 @@ -from typing import Any, Callable, List, Dict, Union, Optional, Sequence, Tuple -from numpy import ndarray -from collections import OrderedDict -from scipy import sparse -import os -import sklearn -import numpy -import typing - -# Custom import commands if any -from sklearn.preprocessing.data import Normalizer - - -from d3m.container.numpy import ndarray as d3m_ndarray -from d3m.container import DataFrame as d3m_dataframe -from d3m.metadata import hyperparams, params, base as metadata_base -from d3m import utils -from d3m.base import utils as base_utils -from d3m.exceptions import PrimitiveNotFittedError -from d3m.primitive_interfaces.base import CallResult, DockerContainer -from d3m.primitive_interfaces.unsupervised_learning import UnsupervisedLearnerPrimitiveBase - - -Inputs = d3m_dataframe -Outputs = d3m_dataframe - - -class Params(params.Params): - input_column_names: Optional[Any] - target_names_: Optional[Sequence[Any]] - training_indices_: Optional[Sequence[int]] - target_column_indices_: Optional[Sequence[int]] - target_columns_metadata_: Optional[List[OrderedDict]] - - - -class Hyperparams(hyperparams.Hyperparams): - norm = hyperparams.Enumeration[str]( - default='l2', - values=['l1', 'l2', 'max'], - description='The norm to use to normalize each non zero sample.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - - use_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to force primitive to operate on. If any specified column cannot be parsed, it is skipped.", - ) - exclude_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to not operate on. Applicable only if \"use_columns\" is not provided.", - ) - return_result = hyperparams.Enumeration( - values=['append', 'replace', 'new'], - default='new', - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Should parsed columns be appended, should they replace original columns, or should only parsed columns be returned? This hyperparam is ignored if use_semantic_types is set to false.", - ) - use_semantic_types = hyperparams.UniformBool( - default=False, - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Controls whether semantic_types metadata will be used for filtering columns in input dataframe. Setting this to false makes the code ignore return_result and will produce only the output dataframe" - ) - add_index_columns = hyperparams.UniformBool( - default=False, - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Also include primary index columns if input data has them. Applicable only if \"return_result\" is set to \"new\".", - ) - error_on_no_input = hyperparams.UniformBool( - default=True, - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Throw an exception if no input column is selected/provided. Defaults to true to behave like sklearn. To prevent pipelines from breaking set this to False.", - ) - - return_semantic_type = hyperparams.Enumeration[str]( - values=['https://metadata.datadrivendiscovery.org/types/Attribute', 'https://metadata.datadrivendiscovery.org/types/ConstructedAttribute'], - default='https://metadata.datadrivendiscovery.org/types/Attribute', - description='Decides what semantic type to attach to generated attributes', - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'] - ) - -class SKNormalizer(UnsupervisedLearnerPrimitiveBase[Inputs, Outputs, Params, Hyperparams]): - """ - Primitive wrapping for sklearn Normalizer - `sklearn documentation `_ - - """ - - __author__ = "JPL MARVIN" - metadata = metadata_base.PrimitiveMetadata({ - "algorithm_types": [metadata_base.PrimitiveAlgorithmType.DATA_NORMALIZATION, ], - "name": "sklearn.preprocessing.data.Normalizer", - "primitive_family": metadata_base.PrimitiveFamily.DATA_PREPROCESSING, - "python_path": "d3m.primitives.data_preprocessing.normalizer.SKlearn", - "source": {'name': 'JPL', 'contact': 'mailto:shah@jpl.nasa.gov', 'uris': ['https://gitlab.com/datadrivendiscovery/sklearn-wrap/issues', 'https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.Normalizer.html']}, - "version": "2019.11.13", - "id": "980b3a2d-1574-31f3-8326-ddc62f8fc2c3", - "hyperparams_to_tune": ['norm'], - 'installation': [ - {'type': metadata_base.PrimitiveInstallationType.PIP, - 'package_uri': 'git+https://gitlab.com/datadrivendiscovery/sklearn-wrap.git@{git_commit}#egg=sklearn_wrap'.format( - git_commit=utils.current_git_commit(os.path.dirname(__file__)), - ), - }] - }) - - def __init__(self, *, - hyperparams: Hyperparams, - random_seed: int = 0, - docker_containers: Dict[str, DockerContainer] = None) -> None: - - super().__init__(hyperparams=hyperparams, random_seed=random_seed, docker_containers=docker_containers) - - # False - self._clf = Normalizer( - norm=self.hyperparams['norm'], - ) - - self._inputs = None - self._outputs = None - self._training_inputs = None - self._training_outputs = None - self._target_names = None - self._training_indices = None - self._target_column_indices = None - self._target_columns_metadata: List[OrderedDict] = None - self._input_column_names = None - self._fitted = False - - - def set_training_data(self, *, inputs: Inputs) -> None: - self._inputs = inputs - self._fitted = False - - def fit(self, *, timeout: float = None, iterations: int = None) -> CallResult[None]: - if self._fitted: - return CallResult(None) - - self._training_inputs, self._training_indices = self._get_columns_to_fit(self._inputs, self.hyperparams) - self._input_column_names = self._training_inputs.columns - - if self._training_inputs is None: - return CallResult(None) - - if len(self._training_indices) > 0: - self._clf.fit(self._training_inputs) - self._fitted = True - else: - if self.hyperparams['error_on_no_input']: - raise RuntimeError("No input columns were selected") - self.logger.warn("No input columns were selected") - return CallResult(None) - - def produce(self, *, inputs: Inputs, timeout: float = None, iterations: int = None) -> CallResult[Outputs]: - if not self._fitted: - raise PrimitiveNotFittedError("Primitive not fitted.") - sk_inputs = inputs - if self.hyperparams['use_semantic_types']: - sk_inputs = inputs.iloc[:, self._training_indices] - output_columns = [] - if len(self._training_indices) > 0: - sk_output = self._clf.transform(sk_inputs) - if sparse.issparse(sk_output): - sk_output = sk_output.toarray() - outputs = self._wrap_predictions(inputs, sk_output) - if len(outputs.columns) == len(self._input_column_names): - outputs.columns = self._input_column_names - output_columns = [outputs] - else: - if self.hyperparams['error_on_no_input']: - raise RuntimeError("No input columns were selected") - self.logger.warn("No input columns were selected") - outputs = base_utils.combine_columns(return_result=self.hyperparams['return_result'], - add_index_columns=self.hyperparams['add_index_columns'], - inputs=inputs, column_indices=self._training_indices, - columns_list=output_columns) - return CallResult(outputs) - - - def get_params(self) -> Params: - if not self._fitted: - return Params( - input_column_names=self._input_column_names, - training_indices_=self._training_indices, - target_names_=self._target_names, - target_column_indices_=self._target_column_indices, - target_columns_metadata_=self._target_columns_metadata - ) - - return Params( - input_column_names=self._input_column_names, - training_indices_=self._training_indices, - target_names_=self._target_names, - target_column_indices_=self._target_column_indices, - target_columns_metadata_=self._target_columns_metadata - ) - - def set_params(self, *, params: Params) -> None: - self._input_column_names = params['input_column_names'] - self._training_indices = params['training_indices_'] - self._target_names = params['target_names_'] - self._target_column_indices = params['target_column_indices_'] - self._target_columns_metadata = params['target_columns_metadata_'] - self._fitted = True - - - - - - @classmethod - def _get_columns_to_fit(cls, inputs: Inputs, hyperparams: Hyperparams): - if not hyperparams['use_semantic_types']: - return inputs, list(range(len(inputs.columns))) - - inputs_metadata = inputs.metadata - - def can_produce_column(column_index: int) -> bool: - return cls._can_produce_column(inputs_metadata, column_index, hyperparams) - - columns_to_produce, columns_not_to_produce = base_utils.get_columns_to_use(inputs_metadata, - use_columns=hyperparams['use_columns'], - exclude_columns=hyperparams['exclude_columns'], - can_use_column=can_produce_column) - return inputs.iloc[:, columns_to_produce], columns_to_produce - # return columns_to_produce - - @classmethod - def _can_produce_column(cls, inputs_metadata: metadata_base.DataMetadata, column_index: int, hyperparams: Hyperparams) -> bool: - column_metadata = inputs_metadata.query((metadata_base.ALL_ELEMENTS, column_index)) - - accepted_structural_types = (int, float, numpy.integer, numpy.float64) - accepted_semantic_types = set() - accepted_semantic_types.add("https://metadata.datadrivendiscovery.org/types/Attribute") - if not issubclass(column_metadata['structural_type'], accepted_structural_types): - return False - - semantic_types = set(column_metadata.get('semantic_types', [])) - - if len(semantic_types) == 0: - cls.logger.warning("No semantic types found in column metadata") - return False - - # Making sure all accepted_semantic_types are available in semantic_types - if len(accepted_semantic_types - semantic_types) == 0: - return True - - return False - - - @classmethod - def _get_target_columns_metadata(cls, outputs_metadata: metadata_base.DataMetadata, hyperparams) -> List[OrderedDict]: - outputs_length = outputs_metadata.query((metadata_base.ALL_ELEMENTS,))['dimension']['length'] - - target_columns_metadata: List[OrderedDict] = [] - for column_index in range(outputs_length): - column_metadata = OrderedDict(outputs_metadata.query_column(column_index)) - - # Update semantic types and prepare it for predicted targets. - semantic_types = set(column_metadata.get('semantic_types', [])) - semantic_types_to_remove = set([]) - add_semantic_types = [] - add_semantic_types.add(hyperparams["return_semantic_type"]) - semantic_types = semantic_types - semantic_types_to_remove - semantic_types = semantic_types.union(add_semantic_types) - column_metadata['semantic_types'] = list(semantic_types) - - target_columns_metadata.append(column_metadata) - - return target_columns_metadata - - @classmethod - def _update_predictions_metadata(cls, inputs_metadata: metadata_base.DataMetadata, outputs: Optional[Outputs], - target_columns_metadata: List[OrderedDict]) -> metadata_base.DataMetadata: - outputs_metadata = metadata_base.DataMetadata().generate(value=outputs) - - for column_index, column_metadata in enumerate(target_columns_metadata): - column_metadata.pop("structural_type", None) - outputs_metadata = outputs_metadata.update_column(column_index, column_metadata) - - return outputs_metadata - - def _wrap_predictions(self, inputs: Inputs, predictions: ndarray) -> Outputs: - outputs = d3m_dataframe(predictions, generate_metadata=True) - target_columns_metadata = self._copy_inputs_metadata(inputs.metadata, self._training_indices, outputs.metadata, self.hyperparams) - outputs.metadata = self._update_predictions_metadata(inputs.metadata, outputs, target_columns_metadata) - return outputs - - - @classmethod - def _copy_inputs_metadata(cls, inputs_metadata: metadata_base.DataMetadata, input_indices: List[int], - outputs_metadata: metadata_base.DataMetadata, hyperparams): - outputs_length = outputs_metadata.query((metadata_base.ALL_ELEMENTS,))['dimension']['length'] - target_columns_metadata: List[OrderedDict] = [] - for column_index in input_indices: - column_name = inputs_metadata.query((metadata_base.ALL_ELEMENTS, column_index)).get("name") - if column_name is None: - column_name = "output_{}".format(column_index) - - column_metadata = OrderedDict(inputs_metadata.query_column(column_index)) - semantic_types = set(column_metadata.get('semantic_types', [])) - semantic_types_to_remove = set([]) - add_semantic_types = set() - add_semantic_types.add(hyperparams["return_semantic_type"]) - semantic_types = semantic_types - semantic_types_to_remove - semantic_types = semantic_types.union(add_semantic_types) - column_metadata['semantic_types'] = list(semantic_types) - - column_metadata["name"] = str(column_name) - target_columns_metadata.append(column_metadata) - - # If outputs has more columns than index, add Attribute Type to all remaining - if outputs_length > len(input_indices): - for column_index in range(len(input_indices), outputs_length): - column_metadata = OrderedDict() - semantic_types = set() - semantic_types.add(hyperparams["return_semantic_type"]) - column_name = "output_{}".format(column_index) - column_metadata["semantic_types"] = list(semantic_types) - column_metadata["name"] = str(column_name) - target_columns_metadata.append(column_metadata) - - return target_columns_metadata - - -SKNormalizer.__doc__ = Normalizer.__doc__ \ No newline at end of file diff --git a/common-primitives/sklearn-wrap/sklearn_wrap/SKNystroem.py b/common-primitives/sklearn-wrap/sklearn_wrap/SKNystroem.py deleted file mode 100644 index b92c92f..0000000 --- a/common-primitives/sklearn-wrap/sklearn_wrap/SKNystroem.py +++ /dev/null @@ -1,522 +0,0 @@ -from typing import Any, Callable, List, Dict, Union, Optional, Sequence, Tuple -from numpy import ndarray -from collections import OrderedDict -from scipy import sparse -import os -import sklearn -import numpy -import typing - -# Custom import commands if any -from sklearn.kernel_approximation import Nystroem - - -from d3m.container.numpy import ndarray as d3m_ndarray -from d3m.container import DataFrame as d3m_dataframe -from d3m.metadata import hyperparams, params, base as metadata_base -from d3m import utils -from d3m.base import utils as base_utils -from d3m.exceptions import PrimitiveNotFittedError -from d3m.primitive_interfaces.base import CallResult, DockerContainer -from d3m.primitive_interfaces.unsupervised_learning import UnsupervisedLearnerPrimitiveBase - - -Inputs = d3m_dataframe -Outputs = d3m_dataframe - - -class Params(params.Params): - components_: Optional[ndarray] - component_indices_: Optional[ndarray] - normalization_: Optional[ndarray] - input_column_names: Optional[Any] - target_names_: Optional[Sequence[Any]] - training_indices_: Optional[Sequence[int]] - target_column_indices_: Optional[Sequence[int]] - target_columns_metadata_: Optional[List[OrderedDict]] - - - -class Hyperparams(hyperparams.Hyperparams): - kernel = hyperparams.Choice( - choices={ - 'rbf': hyperparams.Hyperparams.define( - configuration=OrderedDict({ - 'gamma': hyperparams.Union( - configuration=OrderedDict({ - 'float': hyperparams.Constant( - default=0.1, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ), - 'none': hyperparams.Constant( - default=None, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ) - }), - default='none', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - }) - ), - 'laplacian': hyperparams.Hyperparams.define( - configuration=OrderedDict({ - 'gamma': hyperparams.Union( - configuration=OrderedDict({ - 'float': hyperparams.Constant( - default=0.1, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ), - 'none': hyperparams.Constant( - default=None, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ) - }), - default='none', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - }) - ), - 'polynomial': hyperparams.Hyperparams.define( - configuration=OrderedDict({ - 'gamma': hyperparams.Union( - configuration=OrderedDict({ - 'float': hyperparams.Constant( - default=0.1, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ), - 'none': hyperparams.Constant( - default=None, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ) - }), - default='none', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ), - 'coef0': hyperparams.Union( - configuration=OrderedDict({ - 'float': hyperparams.Constant( - default=1, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ), - 'none': hyperparams.Constant( - default=None, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ) - }), - default='none', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ), - 'degree': hyperparams.Union( - configuration=OrderedDict({ - 'float': hyperparams.Constant( - default=1, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ), - 'none': hyperparams.Constant( - default=None, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ) - }), - default='none', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - }) - ), - 'exponential': hyperparams.Hyperparams.define( - configuration=OrderedDict({ - 'gamma': hyperparams.Union( - configuration=OrderedDict({ - 'float': hyperparams.Constant( - default=0.1, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ), - 'none': hyperparams.Constant( - default=None, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ) - }), - default='none', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - }) - ), - 'chi2': hyperparams.Hyperparams.define( - configuration=OrderedDict({ - 'gamma': hyperparams.Union( - configuration=OrderedDict({ - 'float': hyperparams.Constant( - default=0.1, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ), - 'none': hyperparams.Constant( - default=None, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ) - }), - default='none', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - }) - ), - 'sigmoid': hyperparams.Hyperparams.define( - configuration=OrderedDict({ - 'gamma': hyperparams.Union( - configuration=OrderedDict({ - 'float': hyperparams.Constant( - default=0.1, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ), - 'none': hyperparams.Constant( - default=None, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ) - }), - default='none', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ), - 'coef0': hyperparams.Union( - configuration=OrderedDict({ - 'float': hyperparams.Constant( - default=1, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ), - 'none': hyperparams.Constant( - default=None, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ) - }), - default='none', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - }) - ), - 'cosine': hyperparams.Hyperparams.define( - configuration=OrderedDict({}) - ), - 'poly': hyperparams.Hyperparams.define( - configuration=OrderedDict({}) - ), - 'linear': hyperparams.Hyperparams.define( - configuration=OrderedDict({}) - ), - 'additive_chi2': hyperparams.Hyperparams.define( - configuration=OrderedDict({}) - ) - }, - default='rbf', - description='Kernel map to be approximated. A callable should accept two arguments and the keyword arguments passed to this object as kernel_params, and should return a floating point number.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - n_components = hyperparams.Bounded[int]( - default=100, - lower=0, - upper=None, - description='Number of features to construct. How many data points will be used to construct the mapping.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - - use_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to force primitive to operate on. If any specified column cannot be parsed, it is skipped.", - ) - exclude_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to not operate on. Applicable only if \"use_columns\" is not provided.", - ) - return_result = hyperparams.Enumeration( - values=['append', 'replace', 'new'], - default='new', - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Should parsed columns be appended, should they replace original columns, or should only parsed columns be returned? This hyperparam is ignored if use_semantic_types is set to false.", - ) - use_semantic_types = hyperparams.UniformBool( - default=False, - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Controls whether semantic_types metadata will be used for filtering columns in input dataframe. Setting this to false makes the code ignore return_result and will produce only the output dataframe" - ) - add_index_columns = hyperparams.UniformBool( - default=False, - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Also include primary index columns if input data has them. Applicable only if \"return_result\" is set to \"new\".", - ) - error_on_no_input = hyperparams.UniformBool( - default=True, - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Throw an exception if no input column is selected/provided. Defaults to true to behave like sklearn. To prevent pipelines from breaking set this to False.", - ) - - return_semantic_type = hyperparams.Enumeration[str]( - values=['https://metadata.datadrivendiscovery.org/types/Attribute', 'https://metadata.datadrivendiscovery.org/types/ConstructedAttribute'], - default='https://metadata.datadrivendiscovery.org/types/Attribute', - description='Decides what semantic type to attach to generated attributes', - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'] - ) - -class SKNystroem(UnsupervisedLearnerPrimitiveBase[Inputs, Outputs, Params, Hyperparams]): - """ - Primitive wrapping for sklearn Nystroem - `sklearn documentation `_ - - """ - - __author__ = "JPL MARVIN" - metadata = metadata_base.PrimitiveMetadata({ - "algorithm_types": [metadata_base.PrimitiveAlgorithmType.KERNEL_METHOD, ], - "name": "sklearn.kernel_approximation.Nystroem", - "primitive_family": metadata_base.PrimitiveFamily.DATA_PREPROCESSING, - "python_path": "d3m.primitives.data_preprocessing.nystroem.SKlearn", - "source": {'name': 'JPL', 'contact': 'mailto:shah@jpl.nasa.gov', 'uris': ['https://gitlab.com/datadrivendiscovery/sklearn-wrap/issues', 'https://scikit-learn.org/stable/modules/generated/sklearn.kernel_approximation.Nystroem.html']}, - "version": "2019.11.13", - "id": "ca3a4357-a49f-31f0-82ed-244b66e29426", - "hyperparams_to_tune": ['kernel'], - 'installation': [ - {'type': metadata_base.PrimitiveInstallationType.PIP, - 'package_uri': 'git+https://gitlab.com/datadrivendiscovery/sklearn-wrap.git@{git_commit}#egg=sklearn_wrap'.format( - git_commit=utils.current_git_commit(os.path.dirname(__file__)), - ), - }] - }) - - def __init__(self, *, - hyperparams: Hyperparams, - random_seed: int = 0, - docker_containers: Dict[str, DockerContainer] = None) -> None: - - super().__init__(hyperparams=hyperparams, random_seed=random_seed, docker_containers=docker_containers) - - # False - self._clf = Nystroem( - kernel=self.hyperparams['kernel']['choice'], - degree=self.hyperparams['kernel'].get('degree', 'none'), - gamma=self.hyperparams['kernel'].get('gamma', 'none'), - coef0=self.hyperparams['kernel'].get('coef0', 'none'), - n_components=self.hyperparams['n_components'], - random_state=self.random_seed, - ) - - self._inputs = None - self._outputs = None - self._training_inputs = None - self._training_outputs = None - self._target_names = None - self._training_indices = None - self._target_column_indices = None - self._target_columns_metadata: List[OrderedDict] = None - self._input_column_names = None - self._fitted = False - - - def set_training_data(self, *, inputs: Inputs) -> None: - self._inputs = inputs - self._fitted = False - - def fit(self, *, timeout: float = None, iterations: int = None) -> CallResult[None]: - if self._fitted: - return CallResult(None) - - self._training_inputs, self._training_indices = self._get_columns_to_fit(self._inputs, self.hyperparams) - self._input_column_names = self._training_inputs.columns - - if self._training_inputs is None: - return CallResult(None) - - if len(self._training_indices) > 0: - self._clf.fit(self._training_inputs) - self._fitted = True - else: - if self.hyperparams['error_on_no_input']: - raise RuntimeError("No input columns were selected") - self.logger.warn("No input columns were selected") - return CallResult(None) - - def produce(self, *, inputs: Inputs, timeout: float = None, iterations: int = None) -> CallResult[Outputs]: - if not self._fitted: - raise PrimitiveNotFittedError("Primitive not fitted.") - sk_inputs = inputs - if self.hyperparams['use_semantic_types']: - sk_inputs = inputs.iloc[:, self._training_indices] - output_columns = [] - if len(self._training_indices) > 0: - sk_output = self._clf.transform(sk_inputs) - if sparse.issparse(sk_output): - sk_output = sk_output.toarray() - outputs = self._wrap_predictions(inputs, sk_output) - if len(outputs.columns) == len(self._input_column_names): - outputs.columns = self._input_column_names - output_columns = [outputs] - else: - if self.hyperparams['error_on_no_input']: - raise RuntimeError("No input columns were selected") - self.logger.warn("No input columns were selected") - outputs = base_utils.combine_columns(return_result=self.hyperparams['return_result'], - add_index_columns=self.hyperparams['add_index_columns'], - inputs=inputs, column_indices=self._training_indices, - columns_list=output_columns) - return CallResult(outputs) - - - def get_params(self) -> Params: - if not self._fitted: - return Params( - components_=None, - component_indices_=None, - normalization_=None, - input_column_names=self._input_column_names, - training_indices_=self._training_indices, - target_names_=self._target_names, - target_column_indices_=self._target_column_indices, - target_columns_metadata_=self._target_columns_metadata - ) - - return Params( - components_=getattr(self._clf, 'components_', None), - component_indices_=getattr(self._clf, 'component_indices_', None), - normalization_=getattr(self._clf, 'normalization_', None), - input_column_names=self._input_column_names, - training_indices_=self._training_indices, - target_names_=self._target_names, - target_column_indices_=self._target_column_indices, - target_columns_metadata_=self._target_columns_metadata - ) - - def set_params(self, *, params: Params) -> None: - self._clf.components_ = params['components_'] - self._clf.component_indices_ = params['component_indices_'] - self._clf.normalization_ = params['normalization_'] - self._input_column_names = params['input_column_names'] - self._training_indices = params['training_indices_'] - self._target_names = params['target_names_'] - self._target_column_indices = params['target_column_indices_'] - self._target_columns_metadata = params['target_columns_metadata_'] - - if params['components_'] is not None: - self._fitted = True - if params['component_indices_'] is not None: - self._fitted = True - if params['normalization_'] is not None: - self._fitted = True - - - - - - @classmethod - def _get_columns_to_fit(cls, inputs: Inputs, hyperparams: Hyperparams): - if not hyperparams['use_semantic_types']: - return inputs, list(range(len(inputs.columns))) - - inputs_metadata = inputs.metadata - - def can_produce_column(column_index: int) -> bool: - return cls._can_produce_column(inputs_metadata, column_index, hyperparams) - - columns_to_produce, columns_not_to_produce = base_utils.get_columns_to_use(inputs_metadata, - use_columns=hyperparams['use_columns'], - exclude_columns=hyperparams['exclude_columns'], - can_use_column=can_produce_column) - return inputs.iloc[:, columns_to_produce], columns_to_produce - # return columns_to_produce - - @classmethod - def _can_produce_column(cls, inputs_metadata: metadata_base.DataMetadata, column_index: int, hyperparams: Hyperparams) -> bool: - column_metadata = inputs_metadata.query((metadata_base.ALL_ELEMENTS, column_index)) - - accepted_structural_types = (int, float, numpy.integer, numpy.float64) - accepted_semantic_types = set() - accepted_semantic_types.add("https://metadata.datadrivendiscovery.org/types/Attribute") - if not issubclass(column_metadata['structural_type'], accepted_structural_types): - return False - - semantic_types = set(column_metadata.get('semantic_types', [])) - - if len(semantic_types) == 0: - cls.logger.warning("No semantic types found in column metadata") - return False - - # Making sure all accepted_semantic_types are available in semantic_types - if len(accepted_semantic_types - semantic_types) == 0: - return True - - return False - - - @classmethod - def _get_target_columns_metadata(cls, outputs_metadata: metadata_base.DataMetadata, hyperparams) -> List[OrderedDict]: - outputs_length = outputs_metadata.query((metadata_base.ALL_ELEMENTS,))['dimension']['length'] - - target_columns_metadata: List[OrderedDict] = [] - for column_index in range(outputs_length): - column_metadata = OrderedDict(outputs_metadata.query_column(column_index)) - - # Update semantic types and prepare it for predicted targets. - semantic_types = set(column_metadata.get('semantic_types', [])) - semantic_types_to_remove = set([]) - add_semantic_types = [] - add_semantic_types.add(hyperparams["return_semantic_type"]) - semantic_types = semantic_types - semantic_types_to_remove - semantic_types = semantic_types.union(add_semantic_types) - column_metadata['semantic_types'] = list(semantic_types) - - target_columns_metadata.append(column_metadata) - - return target_columns_metadata - - @classmethod - def _update_predictions_metadata(cls, inputs_metadata: metadata_base.DataMetadata, outputs: Optional[Outputs], - target_columns_metadata: List[OrderedDict]) -> metadata_base.DataMetadata: - outputs_metadata = metadata_base.DataMetadata().generate(value=outputs) - - for column_index, column_metadata in enumerate(target_columns_metadata): - column_metadata.pop("structural_type", None) - outputs_metadata = outputs_metadata.update_column(column_index, column_metadata) - - return outputs_metadata - - def _wrap_predictions(self, inputs: Inputs, predictions: ndarray) -> Outputs: - outputs = d3m_dataframe(predictions, generate_metadata=True) - target_columns_metadata = self._copy_inputs_metadata(inputs.metadata, self._training_indices, outputs.metadata, self.hyperparams) - outputs.metadata = self._update_predictions_metadata(inputs.metadata, outputs, target_columns_metadata) - return outputs - - - @classmethod - def _copy_inputs_metadata(cls, inputs_metadata: metadata_base.DataMetadata, input_indices: List[int], - outputs_metadata: metadata_base.DataMetadata, hyperparams): - outputs_length = outputs_metadata.query((metadata_base.ALL_ELEMENTS,))['dimension']['length'] - target_columns_metadata: List[OrderedDict] = [] - for column_index in input_indices: - column_name = inputs_metadata.query((metadata_base.ALL_ELEMENTS, column_index)).get("name") - if column_name is None: - column_name = "output_{}".format(column_index) - - column_metadata = OrderedDict(inputs_metadata.query_column(column_index)) - semantic_types = set(column_metadata.get('semantic_types', [])) - semantic_types_to_remove = set([]) - add_semantic_types = set() - add_semantic_types.add(hyperparams["return_semantic_type"]) - semantic_types = semantic_types - semantic_types_to_remove - semantic_types = semantic_types.union(add_semantic_types) - column_metadata['semantic_types'] = list(semantic_types) - - column_metadata["name"] = str(column_name) - target_columns_metadata.append(column_metadata) - - # If outputs has more columns than index, add Attribute Type to all remaining - if outputs_length > len(input_indices): - for column_index in range(len(input_indices), outputs_length): - column_metadata = OrderedDict() - semantic_types = set() - semantic_types.add(hyperparams["return_semantic_type"]) - column_name = "output_{}".format(column_index) - column_metadata["semantic_types"] = list(semantic_types) - column_metadata["name"] = str(column_name) - target_columns_metadata.append(column_metadata) - - return target_columns_metadata - - -SKNystroem.__doc__ = Nystroem.__doc__ \ No newline at end of file diff --git a/common-primitives/sklearn-wrap/sklearn_wrap/SKOneHotEncoder.py b/common-primitives/sklearn-wrap/sklearn_wrap/SKOneHotEncoder.py deleted file mode 100644 index 536c585..0000000 --- a/common-primitives/sklearn-wrap/sklearn_wrap/SKOneHotEncoder.py +++ /dev/null @@ -1,420 +0,0 @@ -from typing import Any, Callable, List, Dict, Union, Optional, Sequence, Tuple -from numpy import ndarray -from collections import OrderedDict -from scipy import sparse -import os -import sklearn -import numpy -import typing - -# Custom import commands if any -from sklearn.preprocessing.data import OneHotEncoder -from numpy import float as npfloat - - -from d3m.container.numpy import ndarray as d3m_ndarray -from d3m.container import DataFrame as d3m_dataframe -from d3m.metadata import hyperparams, params, base as metadata_base -from d3m import utils -from d3m.base import utils as base_utils -from d3m.exceptions import PrimitiveNotFittedError -from d3m.primitive_interfaces.base import CallResult, DockerContainer -from d3m.primitive_interfaces.unsupervised_learning import UnsupervisedLearnerPrimitiveBase - - -Inputs = d3m_dataframe -Outputs = d3m_dataframe - - -class Params(params.Params): - _active_features_: Optional[ndarray] - _categorical_features: Optional[Union[str, ndarray]] - _categories: Optional[Sequence[Any]] - _feature_indices_: Optional[ndarray] - _legacy_mode: Optional[bool] - _n_values_: Optional[ndarray] - _n_values: Optional[Union[str, ndarray]] - categories_: Optional[Sequence[Any]] - input_column_names: Optional[Any] - target_names_: Optional[Sequence[Any]] - training_indices_: Optional[Sequence[int]] - target_column_indices_: Optional[Sequence[int]] - target_columns_metadata_: Optional[List[OrderedDict]] - - - -class Hyperparams(hyperparams.Hyperparams): - n_values = hyperparams.Union( - configuration=OrderedDict({ - 'auto': hyperparams.Constant( - default='auto', - description='Determine value range from training data.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ), - 'int': hyperparams.Bounded[int]( - lower=0, - upper=None, - default=10, - description='Number of categorical values per feature. Each feature value should be in range(n_values).', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ), - 'list': hyperparams.List( - default=[], - elements=hyperparams.Hyperparameter[int](1), - description='n_values[i] is the number of categorical values in X[:, i]. Each feature value should be in range(n_values[i]).', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ) - }), - default='auto', - description='Number of values per feature. - \'auto\' : determine value range from training data. - int : number of categorical values per feature. Each feature value should be in ``range(n_values)`` - array : ``n_values[i]`` is the number of categorical values in ``X[:, i]``. Each feature value should be in ``range(n_values[i])``', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - sparse = hyperparams.UniformBool( - default=True, - description='Will return sparse matrix if set True else will return an array.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - handle_unknown = hyperparams.Enumeration[str]( - values=['error', 'ignore'], - default='error', - description='Whether to raise an error or ignore if a unknown categorical feature is present during transform.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - categories = hyperparams.Constant( - default='auto', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - - use_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to force primitive to operate on. If any specified column cannot be parsed, it is skipped.", - ) - exclude_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to not operate on. Applicable only if \"use_columns\" is not provided.", - ) - return_result = hyperparams.Enumeration( - values=['append', 'replace', 'new'], - default='new', - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Should parsed columns be appended, should they replace original columns, or should only parsed columns be returned? This hyperparam is ignored if use_semantic_types is set to false.", - ) - use_semantic_types = hyperparams.UniformBool( - default=False, - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Controls whether semantic_types metadata will be used for filtering columns in input dataframe. Setting this to false makes the code ignore return_result and will produce only the output dataframe" - ) - add_index_columns = hyperparams.UniformBool( - default=False, - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Also include primary index columns if input data has them. Applicable only if \"return_result\" is set to \"new\".", - ) - error_on_no_input = hyperparams.UniformBool( - default=True, - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Throw an exception if no input column is selected/provided. Defaults to true to behave like sklearn. To prevent pipelines from breaking set this to False.", - ) - encode_target_columns = hyperparams.UniformBool( - default=False, - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Should it encode also target columns?", - ) - return_semantic_type = hyperparams.Enumeration[str]( - values=['https://metadata.datadrivendiscovery.org/types/Attribute', 'https://metadata.datadrivendiscovery.org/types/ConstructedAttribute'], - default='https://metadata.datadrivendiscovery.org/types/Attribute', - description='Decides what semantic type to attach to generated attributes', - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'] - ) - -class SKOneHotEncoder(UnsupervisedLearnerPrimitiveBase[Inputs, Outputs, Params, Hyperparams]): - """ - Primitive wrapping for sklearn OneHotEncoder - `sklearn documentation `_ - - """ - - __author__ = "JPL MARVIN" - metadata = metadata_base.PrimitiveMetadata({ - "algorithm_types": [metadata_base.PrimitiveAlgorithmType.ENCODE_ONE_HOT, ], - "name": "sklearn.preprocessing.data.OneHotEncoder", - "primitive_family": metadata_base.PrimitiveFamily.DATA_TRANSFORMATION, - "python_path": "d3m.primitives.data_transformation.one_hot_encoder.SKlearn", - "source": {'name': 'JPL', 'contact': 'mailto:shah@jpl.nasa.gov', 'uris': ['https://gitlab.com/datadrivendiscovery/sklearn-wrap/issues', 'https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html']}, - "version": "2019.11.13", - "id": "c977e879-1bf5-3829-b5b0-39b00233aff5", - 'installation': [ - {'type': metadata_base.PrimitiveInstallationType.PIP, - 'package_uri': 'git+https://gitlab.com/datadrivendiscovery/sklearn-wrap.git@{git_commit}#egg=sklearn_wrap'.format( - git_commit=utils.current_git_commit(os.path.dirname(__file__)), - ), - }] - }) - - def __init__(self, *, - hyperparams: Hyperparams, - random_seed: int = 0, - docker_containers: Dict[str, DockerContainer] = None) -> None: - - super().__init__(hyperparams=hyperparams, random_seed=random_seed, docker_containers=docker_containers) - - # False - self._clf = OneHotEncoder( - n_values=self.hyperparams['n_values'], - sparse=self.hyperparams['sparse'], - handle_unknown=self.hyperparams['handle_unknown'], - categories=self.hyperparams['categories'], - ) - - self._inputs = None - self._outputs = None - self._training_inputs = None - self._training_outputs = None - self._target_names = None - self._training_indices = None - self._target_column_indices = None - self._target_columns_metadata: List[OrderedDict] = None - self._input_column_names = None - self._fitted = False - - - def set_training_data(self, *, inputs: Inputs) -> None: - self._inputs = inputs - self._fitted = False - - def fit(self, *, timeout: float = None, iterations: int = None) -> CallResult[None]: - if self._fitted: - return CallResult(None) - - self._training_inputs, self._training_indices = self._get_columns_to_fit(self._inputs, self.hyperparams) - self._input_column_names = self._training_inputs.columns - - if self._training_inputs is None: - return CallResult(None) - - if len(self._training_indices) > 0: - self._clf.fit(self._training_inputs) - self._fitted = True - else: - if self.hyperparams['error_on_no_input']: - raise RuntimeError("No input columns were selected") - self.logger.warn("No input columns were selected") - return CallResult(None) - - def produce(self, *, inputs: Inputs, timeout: float = None, iterations: int = None) -> CallResult[Outputs]: - if not self._fitted: - raise PrimitiveNotFittedError("Primitive not fitted.") - sk_inputs = inputs - if self.hyperparams['use_semantic_types']: - sk_inputs = inputs.iloc[:, self._training_indices] - output_columns = [] - if len(self._training_indices) > 0: - sk_output = self._clf.transform(sk_inputs) - if sparse.issparse(sk_output): - sk_output = sk_output.toarray() - outputs = self._wrap_predictions(inputs, sk_output) - if len(outputs.columns) == len(self._input_column_names): - outputs.columns = self._input_column_names - output_columns = [outputs] - else: - if self.hyperparams['error_on_no_input']: - raise RuntimeError("No input columns were selected") - self.logger.warn("No input columns were selected") - outputs = base_utils.combine_columns(return_result=self.hyperparams['return_result'], - add_index_columns=self.hyperparams['add_index_columns'], - inputs=inputs, column_indices=self._training_indices, - columns_list=output_columns) - return CallResult(outputs) - - - def get_params(self) -> Params: - if not self._fitted: - return Params( - _active_features_=None, - _categorical_features=None, - _categories=None, - _feature_indices_=None, - _legacy_mode=None, - _n_values_=None, - _n_values=None, - categories_=None, - input_column_names=self._input_column_names, - training_indices_=self._training_indices, - target_names_=self._target_names, - target_column_indices_=self._target_column_indices, - target_columns_metadata_=self._target_columns_metadata - ) - - return Params( - _active_features_=getattr(self._clf, '_active_features_', None), - _categorical_features=getattr(self._clf, '_categorical_features', None), - _categories=getattr(self._clf, '_categories', None), - _feature_indices_=getattr(self._clf, '_feature_indices_', None), - _legacy_mode=getattr(self._clf, '_legacy_mode', None), - _n_values_=getattr(self._clf, '_n_values_', None), - _n_values=getattr(self._clf, '_n_values', None), - categories_=getattr(self._clf, 'categories_', None), - input_column_names=self._input_column_names, - training_indices_=self._training_indices, - target_names_=self._target_names, - target_column_indices_=self._target_column_indices, - target_columns_metadata_=self._target_columns_metadata - ) - - def set_params(self, *, params: Params) -> None: - self._clf._active_features_ = params['_active_features_'] - self._clf._categorical_features = params['_categorical_features'] - self._clf._categories = params['_categories'] - self._clf._feature_indices_ = params['_feature_indices_'] - self._clf._legacy_mode = params['_legacy_mode'] - self._clf._n_values_ = params['_n_values_'] - self._clf._n_values = params['_n_values'] - self._clf.categories_ = params['categories_'] - self._input_column_names = params['input_column_names'] - self._training_indices = params['training_indices_'] - self._target_names = params['target_names_'] - self._target_column_indices = params['target_column_indices_'] - self._target_columns_metadata = params['target_columns_metadata_'] - - if params['_active_features_'] is not None: - self._fitted = True - if params['_categorical_features'] is not None: - self._fitted = True - if params['_categories'] is not None: - self._fitted = True - if params['_feature_indices_'] is not None: - self._fitted = True - if params['_legacy_mode'] is not None: - self._fitted = True - if params['_n_values_'] is not None: - self._fitted = True - if params['_n_values'] is not None: - self._fitted = True - if params['categories_'] is not None: - self._fitted = True - - - - - - @classmethod - def _get_columns_to_fit(cls, inputs: Inputs, hyperparams: Hyperparams): - if not hyperparams['use_semantic_types']: - return inputs, list(range(len(inputs.columns))) - - inputs_metadata = inputs.metadata - - def can_produce_column(column_index: int) -> bool: - return cls._can_produce_column(inputs_metadata, column_index, hyperparams) - - columns_to_produce, columns_not_to_produce = base_utils.get_columns_to_use(inputs_metadata, - use_columns=hyperparams['use_columns'], - exclude_columns=hyperparams['exclude_columns'], - can_use_column=can_produce_column) - return inputs.iloc[:, columns_to_produce], columns_to_produce - # return columns_to_produce - - @classmethod - def _can_produce_column(cls, inputs_metadata: metadata_base.DataMetadata, column_index: int, hyperparams: Hyperparams) -> bool: - column_metadata = inputs_metadata.query((metadata_base.ALL_ELEMENTS, column_index)) - - accepted_structural_types = (int,float,numpy.integer,numpy.float64,str,) - accepted_semantic_types = set(["https://metadata.datadrivendiscovery.org/types/CategoricalData","https://metadata.datadrivendiscovery.org/types/Attribute",]) - accepted_semantic_types.add("https://metadata.datadrivendiscovery.org/types/Attribute") - if not issubclass(column_metadata['structural_type'], accepted_structural_types): - return False - - semantic_types = set(column_metadata.get('semantic_types', [])) - - if len(semantic_types) == 0: - cls.logger.warning("No semantic types found in column metadata") - return False - - if hyperparams['encode_target_columns'] and 'https://metadata.datadrivendiscovery.org/types/Target' in semantic_types: - return True - # Making sure all accepted_semantic_types are available in semantic_types - if len(accepted_semantic_types - semantic_types) == 0: - return True - - return False - - - @classmethod - def _get_target_columns_metadata(cls, outputs_metadata: metadata_base.DataMetadata, hyperparams) -> List[OrderedDict]: - outputs_length = outputs_metadata.query((metadata_base.ALL_ELEMENTS,))['dimension']['length'] - - target_columns_metadata: List[OrderedDict] = [] - for column_index in range(outputs_length): - column_metadata = OrderedDict(outputs_metadata.query_column(column_index)) - - # Update semantic types and prepare it for predicted targets. - semantic_types = set(column_metadata.get('semantic_types', [])) - semantic_types_to_remove = set([]) - add_semantic_types = [] - add_semantic_types.add(hyperparams["return_semantic_type"]) - semantic_types = semantic_types - semantic_types_to_remove - semantic_types = semantic_types.union(add_semantic_types) - column_metadata['semantic_types'] = list(semantic_types) - - target_columns_metadata.append(column_metadata) - - return target_columns_metadata - - @classmethod - def _update_predictions_metadata(cls, inputs_metadata: metadata_base.DataMetadata, outputs: Optional[Outputs], - target_columns_metadata: List[OrderedDict]) -> metadata_base.DataMetadata: - outputs_metadata = metadata_base.DataMetadata().generate(value=outputs) - - for column_index, column_metadata in enumerate(target_columns_metadata): - column_metadata.pop("structural_type", None) - outputs_metadata = outputs_metadata.update_column(column_index, column_metadata) - - return outputs_metadata - - def _wrap_predictions(self, inputs: Inputs, predictions: ndarray) -> Outputs: - outputs = d3m_dataframe(predictions, generate_metadata=True) - target_columns_metadata = self._copy_inputs_metadata(inputs.metadata, self._training_indices, outputs.metadata, self.hyperparams) - outputs.metadata = self._update_predictions_metadata(inputs.metadata, outputs, target_columns_metadata) - return outputs - - - @classmethod - def _copy_inputs_metadata(cls, inputs_metadata: metadata_base.DataMetadata, input_indices: List[int], - outputs_metadata: metadata_base.DataMetadata, hyperparams): - outputs_length = outputs_metadata.query((metadata_base.ALL_ELEMENTS,))['dimension']['length'] - target_columns_metadata: List[OrderedDict] = [] - for column_index in input_indices: - column_name = inputs_metadata.query((metadata_base.ALL_ELEMENTS, column_index)).get("name") - if column_name is None: - column_name = "output_{}".format(column_index) - - column_metadata = OrderedDict(inputs_metadata.query_column(column_index)) - semantic_types = set(column_metadata.get('semantic_types', [])) - semantic_types_to_remove = set([]) - add_semantic_types = set() - add_semantic_types.add(hyperparams["return_semantic_type"]) - semantic_types = semantic_types - semantic_types_to_remove - semantic_types = semantic_types.union(add_semantic_types) - column_metadata['semantic_types'] = list(semantic_types) - - column_metadata["name"] = str(column_name) - target_columns_metadata.append(column_metadata) - - # If outputs has more columns than index, add Attribute Type to all remaining - if outputs_length > len(input_indices): - for column_index in range(len(input_indices), outputs_length): - column_metadata = OrderedDict() - semantic_types = set() - semantic_types.add(hyperparams["return_semantic_type"]) - column_name = "output_{}".format(column_index) - column_metadata["semantic_types"] = list(semantic_types) - column_metadata["name"] = str(column_name) - target_columns_metadata.append(column_metadata) - - return target_columns_metadata - - -SKOneHotEncoder.__doc__ = OneHotEncoder.__doc__ \ No newline at end of file diff --git a/common-primitives/sklearn-wrap/sklearn_wrap/SKOrdinalEncoder.py b/common-primitives/sklearn-wrap/sklearn_wrap/SKOrdinalEncoder.py deleted file mode 100644 index 7396073..0000000 --- a/common-primitives/sklearn-wrap/sklearn_wrap/SKOrdinalEncoder.py +++ /dev/null @@ -1,343 +0,0 @@ -from typing import Any, Callable, List, Dict, Union, Optional, Sequence, Tuple -from numpy import ndarray -from collections import OrderedDict -from scipy import sparse -import os -import sklearn -import numpy -import typing - -# Custom import commands if any -from sklearn.preprocessing._encoders import OrdinalEncoder - - -from d3m.container.numpy import ndarray as d3m_ndarray -from d3m.container import DataFrame as d3m_dataframe -from d3m.metadata import hyperparams, params, base as metadata_base -from d3m import utils -from d3m.base import utils as base_utils -from d3m.exceptions import PrimitiveNotFittedError -from d3m.primitive_interfaces.base import CallResult, DockerContainer -from d3m.primitive_interfaces.unsupervised_learning import UnsupervisedLearnerPrimitiveBase - - -Inputs = d3m_dataframe -Outputs = d3m_dataframe - - -class Params(params.Params): - categories_: Optional[Optional[Sequence[Any]]] - _categories: Optional[str] - input_column_names: Optional[Any] - target_names_: Optional[Sequence[Any]] - training_indices_: Optional[Sequence[int]] - target_column_indices_: Optional[Sequence[int]] - target_columns_metadata_: Optional[List[OrderedDict]] - - - -class Hyperparams(hyperparams.Hyperparams): - categories = hyperparams.Constant( - default='auto', - description='Categories (unique values) per feature: - \'auto\' : Determine categories automatically from the training data. - list : ``categories[i]`` holds the categories expected in the ith column. The passed categories should not mix strings and numeric values, and should be sorted in case of numeric values. The used categories can be found in the ``categories_`` attribute.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - - use_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to force primitive to operate on. If any specified column cannot be parsed, it is skipped.", - ) - exclude_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to not operate on. Applicable only if \"use_columns\" is not provided.", - ) - return_result = hyperparams.Enumeration( - values=['append', 'replace', 'new'], - default='new', - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Should parsed columns be appended, should they replace original columns, or should only parsed columns be returned? This hyperparam is ignored if use_semantic_types is set to false.", - ) - use_semantic_types = hyperparams.UniformBool( - default=False, - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Controls whether semantic_types metadata will be used for filtering columns in input dataframe. Setting this to false makes the code ignore return_result and will produce only the output dataframe" - ) - add_index_columns = hyperparams.UniformBool( - default=False, - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Also include primary index columns if input data has them. Applicable only if \"return_result\" is set to \"new\".", - ) - error_on_no_input = hyperparams.UniformBool( - default=True, - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Throw an exception if no input column is selected/provided. Defaults to true to behave like sklearn. To prevent pipelines from breaking set this to False.", - ) - - return_semantic_type = hyperparams.Enumeration[str]( - values=['https://metadata.datadrivendiscovery.org/types/Attribute', 'https://metadata.datadrivendiscovery.org/types/ConstructedAttribute'], - default='https://metadata.datadrivendiscovery.org/types/Attribute', - description='Decides what semantic type to attach to generated attributes', - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'] - ) - -class SKOrdinalEncoder(UnsupervisedLearnerPrimitiveBase[Inputs, Outputs, Params, Hyperparams]): - """ - Primitive wrapping for sklearn OrdinalEncoder - `sklearn documentation `_ - - """ - - __author__ = "JPL MARVIN" - metadata = metadata_base.PrimitiveMetadata({ - "algorithm_types": [metadata_base.PrimitiveAlgorithmType.CATEGORY_ENCODER, ], - "name": "sklearn.preprocessing._encoders.OrdinalEncoder", - "primitive_family": metadata_base.PrimitiveFamily.DATA_TRANSFORMATION, - "python_path": "d3m.primitives.data_transformation.ordinal_encoder.SKlearn", - "source": {'name': 'JPL', 'contact': 'mailto:shah@jpl.nasa.gov', 'uris': ['https://gitlab.com/datadrivendiscovery/sklearn-wrap/issues', 'https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OrdinalEncoder.html']}, - "version": "2019.11.13", - "id": "a048aaa7-4475-3834-b739-de3105ec7217", - 'installation': [ - {'type': metadata_base.PrimitiveInstallationType.PIP, - 'package_uri': 'git+https://gitlab.com/datadrivendiscovery/sklearn-wrap.git@{git_commit}#egg=sklearn_wrap'.format( - git_commit=utils.current_git_commit(os.path.dirname(__file__)), - ), - }] - }) - - def __init__(self, *, - hyperparams: Hyperparams, - random_seed: int = 0, - docker_containers: Dict[str, DockerContainer] = None) -> None: - - super().__init__(hyperparams=hyperparams, random_seed=random_seed, docker_containers=docker_containers) - - # False - self._clf = OrdinalEncoder( - categories=self.hyperparams['categories'], - ) - - self._inputs = None - self._outputs = None - self._training_inputs = None - self._training_outputs = None - self._target_names = None - self._training_indices = None - self._target_column_indices = None - self._target_columns_metadata: List[OrderedDict] = None - self._input_column_names = None - self._fitted = False - - - def set_training_data(self, *, inputs: Inputs) -> None: - self._inputs = inputs - self._fitted = False - - def fit(self, *, timeout: float = None, iterations: int = None) -> CallResult[None]: - if self._fitted: - return CallResult(None) - - self._training_inputs, self._training_indices = self._get_columns_to_fit(self._inputs, self.hyperparams) - self._input_column_names = self._training_inputs.columns - - if self._training_inputs is None: - return CallResult(None) - - if len(self._training_indices) > 0: - self._clf.fit(self._training_inputs) - self._fitted = True - else: - if self.hyperparams['error_on_no_input']: - raise RuntimeError("No input columns were selected") - self.logger.warn("No input columns were selected") - return CallResult(None) - - def produce(self, *, inputs: Inputs, timeout: float = None, iterations: int = None) -> CallResult[Outputs]: - if not self._fitted: - raise PrimitiveNotFittedError("Primitive not fitted.") - sk_inputs = inputs - if self.hyperparams['use_semantic_types']: - sk_inputs = inputs.iloc[:, self._training_indices] - output_columns = [] - if len(self._training_indices) > 0: - sk_output = self._clf.transform(sk_inputs) - if sparse.issparse(sk_output): - sk_output = sk_output.toarray() - outputs = self._wrap_predictions(inputs, sk_output) - if len(outputs.columns) == len(self._input_column_names): - outputs.columns = self._input_column_names - output_columns = [outputs] - else: - if self.hyperparams['error_on_no_input']: - raise RuntimeError("No input columns were selected") - self.logger.warn("No input columns were selected") - outputs = base_utils.combine_columns(return_result=self.hyperparams['return_result'], - add_index_columns=self.hyperparams['add_index_columns'], - inputs=inputs, column_indices=self._training_indices, - columns_list=output_columns) - return CallResult(outputs) - - - def get_params(self) -> Params: - if not self._fitted: - return Params( - categories_=None, - _categories=None, - input_column_names=self._input_column_names, - training_indices_=self._training_indices, - target_names_=self._target_names, - target_column_indices_=self._target_column_indices, - target_columns_metadata_=self._target_columns_metadata - ) - - return Params( - categories_=getattr(self._clf, 'categories_', None), - _categories=getattr(self._clf, '_categories', None), - input_column_names=self._input_column_names, - training_indices_=self._training_indices, - target_names_=self._target_names, - target_column_indices_=self._target_column_indices, - target_columns_metadata_=self._target_columns_metadata - ) - - def set_params(self, *, params: Params) -> None: - self._clf.categories_ = params['categories_'] - self._clf._categories = params['_categories'] - self._input_column_names = params['input_column_names'] - self._training_indices = params['training_indices_'] - self._target_names = params['target_names_'] - self._target_column_indices = params['target_column_indices_'] - self._target_columns_metadata = params['target_columns_metadata_'] - - if params['categories_'] is not None: - self._fitted = True - if params['_categories'] is not None: - self._fitted = True - - - - - - @classmethod - def _get_columns_to_fit(cls, inputs: Inputs, hyperparams: Hyperparams): - if not hyperparams['use_semantic_types']: - return inputs, list(range(len(inputs.columns))) - - inputs_metadata = inputs.metadata - - def can_produce_column(column_index: int) -> bool: - return cls._can_produce_column(inputs_metadata, column_index, hyperparams) - - columns_to_produce, columns_not_to_produce = base_utils.get_columns_to_use(inputs_metadata, - use_columns=hyperparams['use_columns'], - exclude_columns=hyperparams['exclude_columns'], - can_use_column=can_produce_column) - return inputs.iloc[:, columns_to_produce], columns_to_produce - # return columns_to_produce - - @classmethod - def _can_produce_column(cls, inputs_metadata: metadata_base.DataMetadata, column_index: int, hyperparams: Hyperparams) -> bool: - column_metadata = inputs_metadata.query((metadata_base.ALL_ELEMENTS, column_index)) - - accepted_structural_types = (int,float,numpy.integer,numpy.float64,str,) - accepted_semantic_types = set(["https://metadata.datadrivendiscovery.org/types/CategoricalData",]) - not_accepted_semantic_types = set(["https://metadata.datadrivendiscovery.org/types/Target",]) - - if not issubclass(column_metadata['structural_type'], accepted_structural_types): - return False - - semantic_types = set(column_metadata.get('semantic_types', [])) - - if len(semantic_types) == 0: - cls.logger.warning("No semantic types found in column metadata") - return False - - if len(not_accepted_semantic_types.intersection(semantic_types)) > 0: - return False - - # Making sure at least one accepted_semantic_types are available in semantic_types - if len(accepted_semantic_types.intersection(semantic_types)) > 0: - return True - - return False - - - @classmethod - def _get_target_columns_metadata(cls, outputs_metadata: metadata_base.DataMetadata, hyperparams) -> List[OrderedDict]: - outputs_length = outputs_metadata.query((metadata_base.ALL_ELEMENTS,))['dimension']['length'] - - target_columns_metadata: List[OrderedDict] = [] - for column_index in range(outputs_length): - column_metadata = OrderedDict(outputs_metadata.query_column(column_index)) - - # Update semantic types and prepare it for predicted targets. - semantic_types = set(column_metadata.get('semantic_types', [])) - semantic_types_to_remove = set([]) - add_semantic_types = [] - add_semantic_types.add(hyperparams["return_semantic_type"]) - semantic_types = semantic_types - semantic_types_to_remove - semantic_types = semantic_types.union(add_semantic_types) - column_metadata['semantic_types'] = list(semantic_types) - - target_columns_metadata.append(column_metadata) - - return target_columns_metadata - - @classmethod - def _update_predictions_metadata(cls, inputs_metadata: metadata_base.DataMetadata, outputs: Optional[Outputs], - target_columns_metadata: List[OrderedDict]) -> metadata_base.DataMetadata: - outputs_metadata = metadata_base.DataMetadata().generate(value=outputs) - - for column_index, column_metadata in enumerate(target_columns_metadata): - column_metadata.pop("structural_type", None) - outputs_metadata = outputs_metadata.update_column(column_index, column_metadata) - - return outputs_metadata - - def _wrap_predictions(self, inputs: Inputs, predictions: ndarray) -> Outputs: - outputs = d3m_dataframe(predictions, generate_metadata=True) - target_columns_metadata = self._copy_inputs_metadata(inputs.metadata, self._training_indices, outputs.metadata, self.hyperparams) - outputs.metadata = self._update_predictions_metadata(inputs.metadata, outputs, target_columns_metadata) - return outputs - - - @classmethod - def _copy_inputs_metadata(cls, inputs_metadata: metadata_base.DataMetadata, input_indices: List[int], - outputs_metadata: metadata_base.DataMetadata, hyperparams): - outputs_length = outputs_metadata.query((metadata_base.ALL_ELEMENTS,))['dimension']['length'] - target_columns_metadata: List[OrderedDict] = [] - for column_index in input_indices: - column_name = inputs_metadata.query((metadata_base.ALL_ELEMENTS, column_index)).get("name") - if column_name is None: - column_name = "output_{}".format(column_index) - - column_metadata = OrderedDict(inputs_metadata.query_column(column_index)) - semantic_types = set(column_metadata.get('semantic_types', [])) - semantic_types_to_remove = set([]) - add_semantic_types = set() - add_semantic_types.add(hyperparams["return_semantic_type"]) - semantic_types = semantic_types - semantic_types_to_remove - semantic_types = semantic_types.union(add_semantic_types) - column_metadata['semantic_types'] = list(semantic_types) - - column_metadata["name"] = str(column_name) - target_columns_metadata.append(column_metadata) - - # If outputs has more columns than index, add Attribute Type to all remaining - if outputs_length > len(input_indices): - for column_index in range(len(input_indices), outputs_length): - column_metadata = OrderedDict() - semantic_types = set() - semantic_types.add(hyperparams["return_semantic_type"]) - column_name = "output_{}".format(column_index) - column_metadata["semantic_types"] = list(semantic_types) - column_metadata["name"] = str(column_name) - target_columns_metadata.append(column_metadata) - - return target_columns_metadata - - -SKOrdinalEncoder.__doc__ = OrdinalEncoder.__doc__ \ No newline at end of file diff --git a/common-primitives/sklearn-wrap/sklearn_wrap/SKPCA.py b/common-primitives/sklearn-wrap/sklearn_wrap/SKPCA.py deleted file mode 100644 index a8c7973..0000000 --- a/common-primitives/sklearn-wrap/sklearn_wrap/SKPCA.py +++ /dev/null @@ -1,468 +0,0 @@ -from typing import Any, Callable, List, Dict, Union, Optional, Sequence, Tuple -from numpy import ndarray -from collections import OrderedDict -from scipy import sparse -import os -import sklearn -import numpy -import typing - -# Custom import commands if any -from sklearn.decomposition.pca import PCA -import sys - - -from d3m.container.numpy import ndarray as d3m_ndarray -from d3m.container import DataFrame as d3m_dataframe -from d3m.metadata import hyperparams, params, base as metadata_base -from d3m import utils -from d3m.base import utils as base_utils -from d3m.exceptions import PrimitiveNotFittedError -from d3m.primitive_interfaces.base import CallResult, DockerContainer -from d3m.primitive_interfaces.unsupervised_learning import UnsupervisedLearnerPrimitiveBase - - -Inputs = d3m_dataframe -Outputs = d3m_dataframe - - -class Params(params.Params): - components_: Optional[ndarray] - explained_variance_: Optional[ndarray] - explained_variance_ratio_: Optional[ndarray] - mean_: Optional[ndarray] - n_components_: Optional[int] - noise_variance_: Optional[float] - n_features_: Optional[int] - n_samples_: Optional[int] - singular_values_: Optional[ndarray] - _fit_svd_solver: Optional[str] - input_column_names: Optional[Any] - target_names_: Optional[Sequence[Any]] - training_indices_: Optional[Sequence[int]] - target_column_indices_: Optional[Sequence[int]] - target_columns_metadata_: Optional[List[OrderedDict]] - - - -class Hyperparams(hyperparams.Hyperparams): - n_components = hyperparams.Union( - configuration=OrderedDict({ - 'int': hyperparams.Bounded[int]( - lower=0, - upper=None, - default=0, - description='Number of components to keep.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ), - 'float': hyperparams.Uniform( - lower=0, - upper=1, - default=0.5, - description='Selects the number of components such that the amount of variance that needs to be explained is greater than the percentage specified by n_components.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ), - 'mle': hyperparams.Constant( - default='mle', - description='If svd_solver == \'full\', Minka\'s MLE is used to guess the dimension.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ), - 'none': hyperparams.Constant( - default=None, - description='All components are kept, n_components == min(n_samples, n_features).', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ) - }), - default='none', - description='Number of components to keep. if n_components is not set all components are kept:: n_components == min(n_samples, n_features) if n_components == \'mle\' and svd_solver == \'full\', Minka\'s MLE is used to guess the dimension if ``0 < n_components < 1`` and svd_solver == \'full\', select the number of components such that the amount of variance that needs to be explained is greater than the percentage specified by n_components n_components cannot be equal to n_features for svd_solver == \'arpack\'.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - whiten = hyperparams.UniformBool( - default=False, - description='When True (False by default) the `components_` vectors are multiplied by the square root of n_samples and then divided by the singular values to ensure uncorrelated outputs with unit component-wise variances. Whitening will remove some information from the transformed signal (the relative variance scales of the components) but can sometime improve the predictive accuracy of the downstream estimators by making their data respect some hard-wired assumptions.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - svd_solver = hyperparams.Choice( - choices={ - 'auto': hyperparams.Hyperparams.define( - configuration=OrderedDict({}) - ), - 'full': hyperparams.Hyperparams.define( - configuration=OrderedDict({}) - ), - 'arpack': hyperparams.Hyperparams.define( - configuration=OrderedDict({ - 'tol': hyperparams.Bounded[float]( - default=0.0, - lower=0.0, - upper=None, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - }) - ), - 'randomized': hyperparams.Hyperparams.define( - configuration=OrderedDict({ - 'iterated_power': hyperparams.Union( - configuration=OrderedDict({ - 'int': hyperparams.Bounded[int]( - lower=0, - upper=None, - default=0, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ), - 'auto': hyperparams.Constant( - default='auto', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ) - }), - default='auto', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - }) - ) - }, - default='auto', - description='auto : the solver is selected by a default policy based on `X.shape` and `n_components`: if the input data is larger than 500x500 and the number of components to extract is lower than 80% of the smallest dimension of the data, then the more efficient \'randomized\' method is enabled. Otherwise the exact full SVD is computed and optionally truncated afterwards. full : run exact full SVD calling the standard LAPACK solver via `scipy.linalg.svd` and select the components by postprocessing arpack : run SVD truncated to n_components calling ARPACK solver via `scipy.sparse.linalg.svds`. It requires strictly 0 < n_components < X.shape[1] randomized : run randomized SVD by the method of Halko et al. .. versionadded:: 0.18.0', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - - use_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to force primitive to operate on. If any specified column cannot be parsed, it is skipped.", - ) - exclude_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to not operate on. Applicable only if \"use_columns\" is not provided.", - ) - return_result = hyperparams.Enumeration( - values=['append', 'replace', 'new'], - default='new', - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Should parsed columns be appended, should they replace original columns, or should only parsed columns be returned? This hyperparam is ignored if use_semantic_types is set to false.", - ) - use_semantic_types = hyperparams.UniformBool( - default=False, - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Controls whether semantic_types metadata will be used for filtering columns in input dataframe. Setting this to false makes the code ignore return_result and will produce only the output dataframe" - ) - add_index_columns = hyperparams.UniformBool( - default=False, - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Also include primary index columns if input data has them. Applicable only if \"return_result\" is set to \"new\".", - ) - error_on_no_input = hyperparams.UniformBool( - default=True, - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Throw an exception if no input column is selected/provided. Defaults to true to behave like sklearn. To prevent pipelines from breaking set this to False.", - ) - - return_semantic_type = hyperparams.Enumeration[str]( - values=['https://metadata.datadrivendiscovery.org/types/Attribute', 'https://metadata.datadrivendiscovery.org/types/ConstructedAttribute'], - default='https://metadata.datadrivendiscovery.org/types/Attribute', - description='Decides what semantic type to attach to generated attributes', - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'] - ) - -class SKPCA(UnsupervisedLearnerPrimitiveBase[Inputs, Outputs, Params, Hyperparams]): - """ - Primitive wrapping for sklearn PCA - `sklearn documentation `_ - - """ - - __author__ = "JPL MARVIN" - metadata = metadata_base.PrimitiveMetadata({ - "algorithm_types": [metadata_base.PrimitiveAlgorithmType.PRINCIPAL_COMPONENT_ANALYSIS, ], - "name": "sklearn.decomposition.pca.PCA", - "primitive_family": metadata_base.PrimitiveFamily.FEATURE_EXTRACTION, - "python_path": "d3m.primitives.feature_extraction.pca.SKlearn", - "source": {'name': 'JPL', 'contact': 'mailto:shah@jpl.nasa.gov', 'uris': ['https://gitlab.com/datadrivendiscovery/sklearn-wrap/issues', 'https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html']}, - "version": "2019.11.13", - "id": "2fb28cd1-5de6-3663-a2dc-09c786fba7f4", - "hyperparams_to_tune": ['n_components', 'svd_solver'], - 'installation': [ - {'type': metadata_base.PrimitiveInstallationType.PIP, - 'package_uri': 'git+https://gitlab.com/datadrivendiscovery/sklearn-wrap.git@{git_commit}#egg=sklearn_wrap'.format( - git_commit=utils.current_git_commit(os.path.dirname(__file__)), - ), - }] - }) - - def __init__(self, *, - hyperparams: Hyperparams, - random_seed: int = 0, - docker_containers: Dict[str, DockerContainer] = None) -> None: - - super().__init__(hyperparams=hyperparams, random_seed=random_seed, docker_containers=docker_containers) - - # False - self._clf = PCA( - n_components=self.hyperparams['n_components'], - whiten=self.hyperparams['whiten'], - svd_solver=self.hyperparams['svd_solver']['choice'], - tol=self.hyperparams['svd_solver'].get('tol', 0.0), - iterated_power=self.hyperparams['svd_solver'].get('iterated_power', 'auto'), - random_state=self.random_seed, - ) - - self._inputs = None - self._outputs = None - self._training_inputs = None - self._training_outputs = None - self._target_names = None - self._training_indices = None - self._target_column_indices = None - self._target_columns_metadata: List[OrderedDict] = None - self._input_column_names = None - self._fitted = False - - - def set_training_data(self, *, inputs: Inputs) -> None: - self._inputs = inputs - self._fitted = False - - def fit(self, *, timeout: float = None, iterations: int = None) -> CallResult[None]: - if self._fitted: - return CallResult(None) - - self._training_inputs, self._training_indices = self._get_columns_to_fit(self._inputs, self.hyperparams) - self._input_column_names = self._training_inputs.columns - - if self._training_inputs is None: - return CallResult(None) - - if len(self._training_indices) > 0: - self._clf.fit(self._training_inputs) - self._fitted = True - else: - if self.hyperparams['error_on_no_input']: - raise RuntimeError("No input columns were selected") - self.logger.warn("No input columns were selected") - return CallResult(None) - - def produce(self, *, inputs: Inputs, timeout: float = None, iterations: int = None) -> CallResult[Outputs]: - if not self._fitted: - raise PrimitiveNotFittedError("Primitive not fitted.") - sk_inputs = inputs - if self.hyperparams['use_semantic_types']: - sk_inputs = inputs.iloc[:, self._training_indices] - output_columns = [] - if len(self._training_indices) > 0: - sk_output = self._clf.transform(sk_inputs) - if sparse.issparse(sk_output): - sk_output = sk_output.toarray() - outputs = self._wrap_predictions(inputs, sk_output) - if len(outputs.columns) == len(self._input_column_names): - outputs.columns = self._input_column_names - output_columns = [outputs] - else: - if self.hyperparams['error_on_no_input']: - raise RuntimeError("No input columns were selected") - self.logger.warn("No input columns were selected") - outputs = base_utils.combine_columns(return_result=self.hyperparams['return_result'], - add_index_columns=self.hyperparams['add_index_columns'], - inputs=inputs, column_indices=self._training_indices, - columns_list=output_columns) - return CallResult(outputs) - - - def get_params(self) -> Params: - if not self._fitted: - return Params( - components_=None, - explained_variance_=None, - explained_variance_ratio_=None, - mean_=None, - n_components_=None, - noise_variance_=None, - n_features_=None, - n_samples_=None, - singular_values_=None, - _fit_svd_solver=None, - input_column_names=self._input_column_names, - training_indices_=self._training_indices, - target_names_=self._target_names, - target_column_indices_=self._target_column_indices, - target_columns_metadata_=self._target_columns_metadata - ) - - return Params( - components_=getattr(self._clf, 'components_', None), - explained_variance_=getattr(self._clf, 'explained_variance_', None), - explained_variance_ratio_=getattr(self._clf, 'explained_variance_ratio_', None), - mean_=getattr(self._clf, 'mean_', None), - n_components_=getattr(self._clf, 'n_components_', None), - noise_variance_=getattr(self._clf, 'noise_variance_', None), - n_features_=getattr(self._clf, 'n_features_', None), - n_samples_=getattr(self._clf, 'n_samples_', None), - singular_values_=getattr(self._clf, 'singular_values_', None), - _fit_svd_solver=getattr(self._clf, '_fit_svd_solver', None), - input_column_names=self._input_column_names, - training_indices_=self._training_indices, - target_names_=self._target_names, - target_column_indices_=self._target_column_indices, - target_columns_metadata_=self._target_columns_metadata - ) - - def set_params(self, *, params: Params) -> None: - self._clf.components_ = params['components_'] - self._clf.explained_variance_ = params['explained_variance_'] - self._clf.explained_variance_ratio_ = params['explained_variance_ratio_'] - self._clf.mean_ = params['mean_'] - self._clf.n_components_ = params['n_components_'] - self._clf.noise_variance_ = params['noise_variance_'] - self._clf.n_features_ = params['n_features_'] - self._clf.n_samples_ = params['n_samples_'] - self._clf.singular_values_ = params['singular_values_'] - self._clf._fit_svd_solver = params['_fit_svd_solver'] - self._input_column_names = params['input_column_names'] - self._training_indices = params['training_indices_'] - self._target_names = params['target_names_'] - self._target_column_indices = params['target_column_indices_'] - self._target_columns_metadata = params['target_columns_metadata_'] - - if params['components_'] is not None: - self._fitted = True - if params['explained_variance_'] is not None: - self._fitted = True - if params['explained_variance_ratio_'] is not None: - self._fitted = True - if params['mean_'] is not None: - self._fitted = True - if params['n_components_'] is not None: - self._fitted = True - if params['noise_variance_'] is not None: - self._fitted = True - if params['n_features_'] is not None: - self._fitted = True - if params['n_samples_'] is not None: - self._fitted = True - if params['singular_values_'] is not None: - self._fitted = True - if params['_fit_svd_solver'] is not None: - self._fitted = True - - - - - - @classmethod - def _get_columns_to_fit(cls, inputs: Inputs, hyperparams: Hyperparams): - if not hyperparams['use_semantic_types']: - return inputs, list(range(len(inputs.columns))) - - inputs_metadata = inputs.metadata - - def can_produce_column(column_index: int) -> bool: - return cls._can_produce_column(inputs_metadata, column_index, hyperparams) - - columns_to_produce, columns_not_to_produce = base_utils.get_columns_to_use(inputs_metadata, - use_columns=hyperparams['use_columns'], - exclude_columns=hyperparams['exclude_columns'], - can_use_column=can_produce_column) - return inputs.iloc[:, columns_to_produce], columns_to_produce - # return columns_to_produce - - @classmethod - def _can_produce_column(cls, inputs_metadata: metadata_base.DataMetadata, column_index: int, hyperparams: Hyperparams) -> bool: - column_metadata = inputs_metadata.query((metadata_base.ALL_ELEMENTS, column_index)) - - accepted_structural_types = (int, float, numpy.integer, numpy.float64) - accepted_semantic_types = set() - accepted_semantic_types.add("https://metadata.datadrivendiscovery.org/types/Attribute") - if not issubclass(column_metadata['structural_type'], accepted_structural_types): - return False - - semantic_types = set(column_metadata.get('semantic_types', [])) - - if len(semantic_types) == 0: - cls.logger.warning("No semantic types found in column metadata") - return False - - # Making sure all accepted_semantic_types are available in semantic_types - if len(accepted_semantic_types - semantic_types) == 0: - return True - - return False - - - @classmethod - def _get_target_columns_metadata(cls, outputs_metadata: metadata_base.DataMetadata, hyperparams) -> List[OrderedDict]: - outputs_length = outputs_metadata.query((metadata_base.ALL_ELEMENTS,))['dimension']['length'] - - target_columns_metadata: List[OrderedDict] = [] - for column_index in range(outputs_length): - column_metadata = OrderedDict(outputs_metadata.query_column(column_index)) - - # Update semantic types and prepare it for predicted targets. - semantic_types = set(column_metadata.get('semantic_types', [])) - semantic_types_to_remove = set([]) - add_semantic_types = [] - add_semantic_types.add(hyperparams["return_semantic_type"]) - semantic_types = semantic_types - semantic_types_to_remove - semantic_types = semantic_types.union(add_semantic_types) - column_metadata['semantic_types'] = list(semantic_types) - - target_columns_metadata.append(column_metadata) - - return target_columns_metadata - - @classmethod - def _update_predictions_metadata(cls, inputs_metadata: metadata_base.DataMetadata, outputs: Optional[Outputs], - target_columns_metadata: List[OrderedDict]) -> metadata_base.DataMetadata: - outputs_metadata = metadata_base.DataMetadata().generate(value=outputs) - - for column_index, column_metadata in enumerate(target_columns_metadata): - column_metadata.pop("structural_type", None) - outputs_metadata = outputs_metadata.update_column(column_index, column_metadata) - - return outputs_metadata - - def _wrap_predictions(self, inputs: Inputs, predictions: ndarray) -> Outputs: - outputs = d3m_dataframe(predictions, generate_metadata=True) - target_columns_metadata = self._copy_inputs_metadata(inputs.metadata, self._training_indices, outputs.metadata, self.hyperparams) - outputs.metadata = self._update_predictions_metadata(inputs.metadata, outputs, target_columns_metadata) - return outputs - - - @classmethod - def _copy_inputs_metadata(cls, inputs_metadata: metadata_base.DataMetadata, input_indices: List[int], - outputs_metadata: metadata_base.DataMetadata, hyperparams): - outputs_length = outputs_metadata.query((metadata_base.ALL_ELEMENTS,))['dimension']['length'] - target_columns_metadata: List[OrderedDict] = [] - for column_index in input_indices: - column_name = inputs_metadata.query((metadata_base.ALL_ELEMENTS, column_index)).get("name") - if column_name is None: - column_name = "output_{}".format(column_index) - - column_metadata = OrderedDict(inputs_metadata.query_column(column_index)) - semantic_types = set(column_metadata.get('semantic_types', [])) - semantic_types_to_remove = set([]) - add_semantic_types = set() - add_semantic_types.add(hyperparams["return_semantic_type"]) - semantic_types = semantic_types - semantic_types_to_remove - semantic_types = semantic_types.union(add_semantic_types) - column_metadata['semantic_types'] = list(semantic_types) - - column_metadata["name"] = str(column_name) - target_columns_metadata.append(column_metadata) - - # If outputs has more columns than index, add Attribute Type to all remaining - if outputs_length > len(input_indices): - for column_index in range(len(input_indices), outputs_length): - column_metadata = OrderedDict() - semantic_types = set() - semantic_types.add(hyperparams["return_semantic_type"]) - column_name = "output_{}".format(column_index) - column_metadata["semantic_types"] = list(semantic_types) - column_metadata["name"] = str(column_name) - target_columns_metadata.append(column_metadata) - - return target_columns_metadata - - -SKPCA.__doc__ = PCA.__doc__ \ No newline at end of file diff --git a/common-primitives/sklearn-wrap/sklearn_wrap/SKPassiveAggressiveClassifier.py b/common-primitives/sklearn-wrap/sklearn_wrap/SKPassiveAggressiveClassifier.py deleted file mode 100644 index 9a4cfa9..0000000 --- a/common-primitives/sklearn-wrap/sklearn_wrap/SKPassiveAggressiveClassifier.py +++ /dev/null @@ -1,648 +0,0 @@ -from typing import Any, Callable, List, Dict, Union, Optional, Sequence, Tuple -from numpy import ndarray -from collections import OrderedDict -from scipy import sparse -import os -import sklearn -import numpy -import typing - -# Custom import commands if any -from sklearn.linear_model.passive_aggressive import PassiveAggressiveClassifier - - -from d3m.container.numpy import ndarray as d3m_ndarray -from d3m.container import DataFrame as d3m_dataframe -from d3m.metadata import hyperparams, params, base as metadata_base -from d3m import utils -from d3m.base import utils as base_utils -from d3m.exceptions import PrimitiveNotFittedError -from d3m.primitive_interfaces.base import CallResult, DockerContainer - -from d3m.primitive_interfaces.supervised_learning import SupervisedLearnerPrimitiveBase -from d3m.primitive_interfaces.base import ProbabilisticCompositionalityMixin, ContinueFitMixin -from d3m import exceptions -import pandas - - - -Inputs = d3m_dataframe -Outputs = d3m_dataframe - - -class Params(params.Params): - coef_: Optional[ndarray] - intercept_: Optional[ndarray] - classes_: Optional[ndarray] - _expanded_class_weight: Optional[ndarray] - alpha: Optional[float] - epsilon: Optional[float] - eta0: Optional[float] - l1_ratio: Optional[float] - learning_rate: Optional[str] - loss_function_: Optional[object] - n_iter_: Optional[int] - penalty: Optional[str] - power_t: Optional[float] - t_: Optional[float] - average_coef_: Optional[ndarray] - average_intercept_: Optional[ndarray] - standard_coef_: Optional[ndarray] - standard_intercept_: Optional[ndarray] - input_column_names: Optional[Any] - target_names_: Optional[Sequence[Any]] - training_indices_: Optional[Sequence[int]] - target_column_indices_: Optional[Sequence[int]] - target_columns_metadata_: Optional[List[OrderedDict]] - - - -class Hyperparams(hyperparams.Hyperparams): - C = hyperparams.Bounded[float]( - default=1, - lower=0, - upper=None, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - fit_intercept = hyperparams.UniformBool( - default=False, - description='Whether the intercept should be estimated or not. If False, the data is assumed to be already centered.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - max_iter = hyperparams.Union( - configuration=OrderedDict({ - 'int': hyperparams.Bounded[int]( - lower=0, - upper=None, - default=1000, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ), - 'none': hyperparams.Constant( - default=None, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ) - }), - default='int', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - shuffle = hyperparams.UniformBool( - default=True, - description='Whether or not the training data should be shuffled after each epoch.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - tol = hyperparams.Union( - configuration=OrderedDict({ - 'float': hyperparams.Bounded[float]( - default=0.001, - lower=0, - upper=None, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ), - 'none': hyperparams.Constant( - default=None, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ) - }), - default='float', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - n_jobs = hyperparams.Union( - configuration=OrderedDict({ - 'limit': hyperparams.Bounded[int]( - default=1, - lower=1, - upper=None, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ), - 'all_cores': hyperparams.Constant( - default=-1, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ) - }), - default='limit', - description='The number of CPUs to use to do the OVA (One Versus All, for multi-class problems) computation. -1 means \'all CPUs\'. Defaults to 1.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/ResourcesUseParameter'] - ) - loss = hyperparams.Enumeration[str]( - values=['hinge', 'squared_hinge'], - default='hinge', - description='The loss function to be used: hinge: equivalent to PA-I in the reference paper. squared_hinge: equivalent to PA-II in the reference paper.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - warm_start = hyperparams.UniformBool( - default=False, - description='When set to True, reuse the solution of the previous call to fit as initialization, otherwise, just erase the previous solution.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - class_weight = hyperparams.Union( - configuration=OrderedDict({ - 'str': hyperparams.Constant( - default='balanced', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ), - 'none': hyperparams.Constant( - default=None, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ) - }), - default='none', - description='Preset for the class_weight fit parameter. Weights associated with classes. If not given, all classes are supposed to have weight one. The "balanced" mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data as ``n_samples / (n_classes * np.bincount(y))`` .. versionadded:: 0.17 parameter *class_weight* to automatically weight samples.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - average = hyperparams.Union( - configuration=OrderedDict({ - 'int': hyperparams.Bounded[int]( - lower=2, - upper=None, - default=10, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ), - 'bool': hyperparams.UniformBool( - default=False, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ) - }), - default='bool', - description='When set to True, computes the averaged SGD weights and stores the result in the coef_ attribute. If set to an int greater than 1, averaging will begin once the total number of samples seen reaches average. So average=10 will begin averaging after seeing 10 samples. New in version 0.19: parameter average to use weights averaging in SGD', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - early_stopping = hyperparams.UniformBool( - default=False, - description='Whether to use early stopping to terminate training when validation score is not improving. If set to True, it will automatically set asid a fraction of training data as validation and terminate training whe validation score is not improving by at least tol fo n_iter_no_change consecutive epochs.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - validation_fraction = hyperparams.Bounded[float]( - default=0.1, - lower=0, - upper=1, - description='The proportion of training data to set aside as validation set for early stopping. Must be between 0 and 1. Only used if early_stopping is True.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - n_iter_no_change = hyperparams.Bounded[int]( - default=5, - lower=0, - upper=None, - description='Number of iterations with no improvement to wait before early stopping.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - - use_inputs_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to force primitive to use as training input. If any specified column cannot be parsed, it is skipped.", - ) - use_outputs_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to force primitive to use as training target. If any specified column cannot be parsed, it is skipped.", - ) - exclude_inputs_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to not use as training inputs. Applicable only if \"use_columns\" is not provided.", - ) - exclude_outputs_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to not use as training target. Applicable only if \"use_columns\" is not provided.", - ) - return_result = hyperparams.Enumeration( - values=['append', 'replace', 'new'], - default='new', - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Should parsed columns be appended, should they replace original columns, or should only parsed columns be returned? This hyperparam is ignored if use_semantic_types is set to false.", - ) - use_semantic_types = hyperparams.UniformBool( - default=False, - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Controls whether semantic_types metadata will be used for filtering columns in input dataframe. Setting this to false makes the code ignore return_result and will produce only the output dataframe" - ) - add_index_columns = hyperparams.UniformBool( - default=False, - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Also include primary index columns if input data has them. Applicable only if \"return_result\" is set to \"new\".", - ) - error_on_no_input = hyperparams.UniformBool( - default=True, - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Throw an exception if no input column is selected/provided. Defaults to true to behave like sklearn. To prevent pipelines from breaking set this to False.", - ) - - return_semantic_type = hyperparams.Enumeration[str]( - values=['https://metadata.datadrivendiscovery.org/types/Attribute', 'https://metadata.datadrivendiscovery.org/types/ConstructedAttribute', 'https://metadata.datadrivendiscovery.org/types/PredictedTarget'], - default='https://metadata.datadrivendiscovery.org/types/PredictedTarget', - description='Decides what semantic type to attach to generated output', - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'] - ) - -class SKPassiveAggressiveClassifier(SupervisedLearnerPrimitiveBase[Inputs, Outputs, Params, Hyperparams], - ContinueFitMixin[Inputs, Outputs, Params, Hyperparams]): - """ - Primitive wrapping for sklearn PassiveAggressiveClassifier - `sklearn documentation `_ - - """ - - __author__ = "JPL MARVIN" - metadata = metadata_base.PrimitiveMetadata({ - "algorithm_types": [metadata_base.PrimitiveAlgorithmType.PASSIVE_AGGRESSIVE, ], - "name": "sklearn.linear_model.passive_aggressive.PassiveAggressiveClassifier", - "primitive_family": metadata_base.PrimitiveFamily.CLASSIFICATION, - "python_path": "d3m.primitives.classification.passive_aggressive.SKlearn", - "source": {'name': 'JPL', 'contact': 'mailto:shah@jpl.nasa.gov', 'uris': ['https://gitlab.com/datadrivendiscovery/sklearn-wrap/issues', 'https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.PassiveAggressiveClassifier.html']}, - "version": "2019.11.13", - "id": "85e5c88d-9eec-3452-8f2f-414f17d3e4d5", - "hyperparams_to_tune": ['C'], - 'installation': [ - {'type': metadata_base.PrimitiveInstallationType.PIP, - 'package_uri': 'git+https://gitlab.com/datadrivendiscovery/sklearn-wrap.git@{git_commit}#egg=sklearn_wrap'.format( - git_commit=utils.current_git_commit(os.path.dirname(__file__)), - ), - }] - }) - - def __init__(self, *, - hyperparams: Hyperparams, - random_seed: int = 0, - docker_containers: Dict[str, DockerContainer] = None, - _verbose: int = 0) -> None: - - super().__init__(hyperparams=hyperparams, random_seed=random_seed, docker_containers=docker_containers) - - # False - self._clf = PassiveAggressiveClassifier( - C=self.hyperparams['C'], - fit_intercept=self.hyperparams['fit_intercept'], - max_iter=self.hyperparams['max_iter'], - shuffle=self.hyperparams['shuffle'], - tol=self.hyperparams['tol'], - n_jobs=self.hyperparams['n_jobs'], - loss=self.hyperparams['loss'], - warm_start=self.hyperparams['warm_start'], - class_weight=self.hyperparams['class_weight'], - average=self.hyperparams['average'], - early_stopping=self.hyperparams['early_stopping'], - validation_fraction=self.hyperparams['validation_fraction'], - n_iter_no_change=self.hyperparams['n_iter_no_change'], - verbose=_verbose, - random_state=self.random_seed, - ) - - self._inputs = None - self._outputs = None - self._training_inputs = None - self._training_outputs = None - self._target_names = None - self._training_indices = None - self._target_column_indices = None - self._target_columns_metadata: List[OrderedDict] = None - self._input_column_names = None - self._fitted = False - self._new_training_data = False - - def set_training_data(self, *, inputs: Inputs, outputs: Outputs) -> None: - self._inputs = inputs - self._outputs = outputs - self._fitted = False - self._new_training_data = True - - def fit(self, *, timeout: float = None, iterations: int = None) -> CallResult[None]: - if self._inputs is None or self._outputs is None: - raise ValueError("Missing training data.") - - if not self._new_training_data: - return CallResult(None) - self._new_training_data = False - - self._training_inputs, self._training_indices = self._get_columns_to_fit(self._inputs, self.hyperparams) - self._training_outputs, self._target_names, self._target_column_indices = self._get_targets(self._outputs, self.hyperparams) - self._input_column_names = self._training_inputs.columns - - if len(self._training_indices) > 0 and len(self._target_column_indices) > 0: - self._target_columns_metadata = self._get_target_columns_metadata(self._training_outputs.metadata, self.hyperparams) - sk_training_output = self._training_outputs.values - - shape = sk_training_output.shape - if len(shape) == 2 and shape[1] == 1: - sk_training_output = numpy.ravel(sk_training_output) - - self._clf.fit(self._training_inputs, sk_training_output) - self._fitted = True - else: - if self.hyperparams['error_on_no_input']: - raise RuntimeError("No input columns were selected") - self.logger.warn("No input columns were selected") - - return CallResult(None) - - def continue_fit(self, *, timeout: float = None, iterations: int = None) -> CallResult[None]: - if self._training_inputs is None or self._training_outputs is None: - raise ValueError("Missing training data.") - - if not self._new_training_data: - return CallResult(None) - self._new_training_data = False - - if len(self._training_indices) > 0 and len(self._target_column_indices) > 0: - self._target_columns_metadata = self._get_target_columns_metadata(self._training_outputs.metadata, self.hyperparams) - sk_training_output = self._training_outputs.values - - shape = sk_training_output.shape - if len(shape) == 2 and shape[1] == 1: - sk_training_output = numpy.ravel(sk_training_output) - - self._clf.partial_fit(self._training_inputs, sk_training_output) - self._fitted = True - else: - if self.hyperparams['error_on_no_input']: - raise RuntimeError("No input columns were selected") - self.logger.warn("No input columns were selected") - - return CallResult(None) - - - def produce(self, *, inputs: Inputs, timeout: float = None, iterations: int = None) -> CallResult[Outputs]: - sk_inputs, columns_to_use = self._get_columns_to_fit(inputs, self.hyperparams) - output = [] - if len(sk_inputs.columns): - try: - sk_output = self._clf.predict(sk_inputs) - except sklearn.exceptions.NotFittedError as error: - raise PrimitiveNotFittedError("Primitive not fitted.") from error - # For primitives that allow predicting without fitting like GaussianProcessRegressor - if not self._fitted: - raise PrimitiveNotFittedError("Primitive not fitted.") - if sparse.issparse(sk_output): - sk_output = sk_output.toarray() - output = self._wrap_predictions(inputs, sk_output) - output.columns = self._target_names - output = [output] - else: - if self.hyperparams['error_on_no_input']: - raise RuntimeError("No input columns were selected") - self.logger.warn("No input columns were selected") - outputs = base_utils.combine_columns(return_result=self.hyperparams['return_result'], - add_index_columns=self.hyperparams['add_index_columns'], - inputs=inputs, column_indices=self._target_column_indices, - columns_list=output) - - return CallResult(outputs) - - - def get_params(self) -> Params: - if not self._fitted: - return Params( - coef_=None, - intercept_=None, - classes_=None, - _expanded_class_weight=None, - alpha=None, - epsilon=None, - eta0=None, - l1_ratio=None, - learning_rate=None, - loss_function_=None, - n_iter_=None, - penalty=None, - power_t=None, - t_=None, - average_coef_=None, - average_intercept_=None, - standard_coef_=None, - standard_intercept_=None, - input_column_names=self._input_column_names, - training_indices_=self._training_indices, - target_names_=self._target_names, - target_column_indices_=self._target_column_indices, - target_columns_metadata_=self._target_columns_metadata - ) - - return Params( - coef_=getattr(self._clf, 'coef_', None), - intercept_=getattr(self._clf, 'intercept_', None), - classes_=getattr(self._clf, 'classes_', None), - _expanded_class_weight=getattr(self._clf, '_expanded_class_weight', None), - alpha=getattr(self._clf, 'alpha', None), - epsilon=getattr(self._clf, 'epsilon', None), - eta0=getattr(self._clf, 'eta0', None), - l1_ratio=getattr(self._clf, 'l1_ratio', None), - learning_rate=getattr(self._clf, 'learning_rate', None), - loss_function_=getattr(self._clf, 'loss_function_', None), - n_iter_=getattr(self._clf, 'n_iter_', None), - penalty=getattr(self._clf, 'penalty', None), - power_t=getattr(self._clf, 'power_t', None), - t_=getattr(self._clf, 't_', None), - average_coef_=getattr(self._clf, 'average_coef_', None), - average_intercept_=getattr(self._clf, 'average_intercept_', None), - standard_coef_=getattr(self._clf, 'standard_coef_', None), - standard_intercept_=getattr(self._clf, 'standard_intercept_', None), - input_column_names=self._input_column_names, - training_indices_=self._training_indices, - target_names_=self._target_names, - target_column_indices_=self._target_column_indices, - target_columns_metadata_=self._target_columns_metadata - ) - - def set_params(self, *, params: Params) -> None: - self._clf.coef_ = params['coef_'] - self._clf.intercept_ = params['intercept_'] - self._clf.classes_ = params['classes_'] - self._clf._expanded_class_weight = params['_expanded_class_weight'] - self._clf.alpha = params['alpha'] - self._clf.epsilon = params['epsilon'] - self._clf.eta0 = params['eta0'] - self._clf.l1_ratio = params['l1_ratio'] - self._clf.learning_rate = params['learning_rate'] - self._clf.loss_function_ = params['loss_function_'] - self._clf.n_iter_ = params['n_iter_'] - self._clf.penalty = params['penalty'] - self._clf.power_t = params['power_t'] - self._clf.t_ = params['t_'] - self._clf.average_coef_ = params['average_coef_'] - self._clf.average_intercept_ = params['average_intercept_'] - self._clf.standard_coef_ = params['standard_coef_'] - self._clf.standard_intercept_ = params['standard_intercept_'] - self._input_column_names = params['input_column_names'] - self._training_indices = params['training_indices_'] - self._target_names = params['target_names_'] - self._target_column_indices = params['target_column_indices_'] - self._target_columns_metadata = params['target_columns_metadata_'] - - if params['coef_'] is not None: - self._fitted = True - if params['intercept_'] is not None: - self._fitted = True - if params['classes_'] is not None: - self._fitted = True - if params['_expanded_class_weight'] is not None: - self._fitted = True - if params['alpha'] is not None: - self._fitted = True - if params['epsilon'] is not None: - self._fitted = True - if params['eta0'] is not None: - self._fitted = True - if params['l1_ratio'] is not None: - self._fitted = True - if params['learning_rate'] is not None: - self._fitted = True - if params['loss_function_'] is not None: - self._fitted = True - if params['n_iter_'] is not None: - self._fitted = True - if params['penalty'] is not None: - self._fitted = True - if params['power_t'] is not None: - self._fitted = True - if params['t_'] is not None: - self._fitted = True - if params['average_coef_'] is not None: - self._fitted = True - if params['average_intercept_'] is not None: - self._fitted = True - if params['standard_coef_'] is not None: - self._fitted = True - if params['standard_intercept_'] is not None: - self._fitted = True - - - - - - - - @classmethod - def _get_columns_to_fit(cls, inputs: Inputs, hyperparams: Hyperparams): - if not hyperparams['use_semantic_types']: - return inputs, list(range(len(inputs.columns))) - - inputs_metadata = inputs.metadata - - def can_produce_column(column_index: int) -> bool: - return cls._can_produce_column(inputs_metadata, column_index, hyperparams) - - columns_to_produce, columns_not_to_produce = base_utils.get_columns_to_use(inputs_metadata, - use_columns=hyperparams['use_inputs_columns'], - exclude_columns=hyperparams['exclude_inputs_columns'], - can_use_column=can_produce_column) - return inputs.iloc[:, columns_to_produce], columns_to_produce - # return columns_to_produce - - @classmethod - def _can_produce_column(cls, inputs_metadata: metadata_base.DataMetadata, column_index: int, hyperparams: Hyperparams) -> bool: - column_metadata = inputs_metadata.query((metadata_base.ALL_ELEMENTS, column_index)) - - accepted_structural_types = (int, float, numpy.integer, numpy.float64) - accepted_semantic_types = set() - accepted_semantic_types.add("https://metadata.datadrivendiscovery.org/types/Attribute") - if not issubclass(column_metadata['structural_type'], accepted_structural_types): - return False - - semantic_types = set(column_metadata.get('semantic_types', [])) - - if len(semantic_types) == 0: - cls.logger.warning("No semantic types found in column metadata") - return False - # Making sure all accepted_semantic_types are available in semantic_types - if len(accepted_semantic_types - semantic_types) == 0: - return True - - return False - - @classmethod - def _get_targets(cls, data: d3m_dataframe, hyperparams: Hyperparams): - if not hyperparams['use_semantic_types']: - return data, list(data.columns), list(range(len(data.columns))) - - metadata = data.metadata - - def can_produce_column(column_index: int) -> bool: - accepted_semantic_types = set() - accepted_semantic_types.add("https://metadata.datadrivendiscovery.org/types/TrueTarget") - column_metadata = metadata.query((metadata_base.ALL_ELEMENTS, column_index)) - semantic_types = set(column_metadata.get('semantic_types', [])) - if len(semantic_types) == 0: - cls.logger.warning("No semantic types found in column metadata") - return False - # Making sure all accepted_semantic_types are available in semantic_types - if len(accepted_semantic_types - semantic_types) == 0: - return True - return False - - target_column_indices, target_columns_not_to_produce = base_utils.get_columns_to_use(metadata, - use_columns=hyperparams[ - 'use_outputs_columns'], - exclude_columns= - hyperparams[ - 'exclude_outputs_columns'], - can_use_column=can_produce_column) - targets = [] - if target_column_indices: - targets = data.select_columns(target_column_indices) - target_column_names = [] - for idx in target_column_indices: - target_column_names.append(data.columns[idx]) - return targets, target_column_names, target_column_indices - - @classmethod - def _get_target_columns_metadata(cls, outputs_metadata: metadata_base.DataMetadata, hyperparams) -> List[OrderedDict]: - outputs_length = outputs_metadata.query((metadata_base.ALL_ELEMENTS,))['dimension']['length'] - - target_columns_metadata: List[OrderedDict] = [] - for column_index in range(outputs_length): - column_metadata = OrderedDict(outputs_metadata.query_column(column_index)) - - # Update semantic types and prepare it for predicted targets. - semantic_types = set(column_metadata.get('semantic_types', [])) - semantic_types_to_remove = set(["https://metadata.datadrivendiscovery.org/types/TrueTarget","https://metadata.datadrivendiscovery.org/types/SuggestedTarget",]) - add_semantic_types = set(["https://metadata.datadrivendiscovery.org/types/PredictedTarget",]) - add_semantic_types.add(hyperparams["return_semantic_type"]) - semantic_types = semantic_types - semantic_types_to_remove - semantic_types = semantic_types.union(add_semantic_types) - column_metadata['semantic_types'] = list(semantic_types) - - target_columns_metadata.append(column_metadata) - - return target_columns_metadata - - @classmethod - def _update_predictions_metadata(cls, inputs_metadata: metadata_base.DataMetadata, outputs: Optional[Outputs], - target_columns_metadata: List[OrderedDict]) -> metadata_base.DataMetadata: - outputs_metadata = metadata_base.DataMetadata().generate(value=outputs) - - for column_index, column_metadata in enumerate(target_columns_metadata): - column_metadata.pop("structural_type", None) - outputs_metadata = outputs_metadata.update_column(column_index, column_metadata) - - return outputs_metadata - - def _wrap_predictions(self, inputs: Inputs, predictions: ndarray) -> Outputs: - outputs = d3m_dataframe(predictions, generate_metadata=False) - outputs.metadata = self._update_predictions_metadata(inputs.metadata, outputs, self._target_columns_metadata) - return outputs - - - @classmethod - def _add_target_columns_metadata(cls, outputs_metadata: metadata_base.DataMetadata): - outputs_length = outputs_metadata.query((metadata_base.ALL_ELEMENTS,))['dimension']['length'] - - target_columns_metadata: List[OrderedDict] = [] - for column_index in range(outputs_length): - column_metadata = OrderedDict() - semantic_types = [] - semantic_types.append('https://metadata.datadrivendiscovery.org/types/PredictedTarget') - column_name = outputs_metadata.query((metadata_base.ALL_ELEMENTS, column_index)).get("name") - if column_name is None: - column_name = "output_{}".format(column_index) - column_metadata["semantic_types"] = semantic_types - column_metadata["name"] = str(column_name) - target_columns_metadata.append(column_metadata) - - return target_columns_metadata - - -SKPassiveAggressiveClassifier.__doc__ = PassiveAggressiveClassifier.__doc__ \ No newline at end of file diff --git a/common-primitives/sklearn-wrap/sklearn_wrap/SKPassiveAggressiveRegressor.py b/common-primitives/sklearn-wrap/sklearn_wrap/SKPassiveAggressiveRegressor.py deleted file mode 100644 index 900de99..0000000 --- a/common-primitives/sklearn-wrap/sklearn_wrap/SKPassiveAggressiveRegressor.py +++ /dev/null @@ -1,583 +0,0 @@ -from typing import Any, Callable, List, Dict, Union, Optional, Sequence, Tuple -from numpy import ndarray -from collections import OrderedDict -from scipy import sparse -import os -import sklearn -import numpy -import typing - -# Custom import commands if any -from sklearn.linear_model.passive_aggressive import PassiveAggressiveRegressor - - -from d3m.container.numpy import ndarray as d3m_ndarray -from d3m.container import DataFrame as d3m_dataframe -from d3m.metadata import hyperparams, params, base as metadata_base -from d3m import utils -from d3m.base import utils as base_utils -from d3m.exceptions import PrimitiveNotFittedError -from d3m.primitive_interfaces.base import CallResult, DockerContainer - -from d3m.primitive_interfaces.supervised_learning import SupervisedLearnerPrimitiveBase -from d3m.primitive_interfaces.base import ProbabilisticCompositionalityMixin, ContinueFitMixin -from d3m import exceptions -import pandas - - - -Inputs = d3m_dataframe -Outputs = d3m_dataframe - - -class Params(params.Params): - coef_: Optional[ndarray] - intercept_: Optional[ndarray] - t_: Optional[float] - alpha: Optional[float] - eta0: Optional[float] - l1_ratio: Optional[int] - learning_rate: Optional[str] - n_iter_: Optional[int] - penalty: Optional[float] - power_t: Optional[float] - average_coef_: Optional[ndarray] - average_intercept_: Optional[ndarray] - standard_coef_: Optional[ndarray] - standard_intercept_: Optional[ndarray] - input_column_names: Optional[Any] - target_names_: Optional[Sequence[Any]] - training_indices_: Optional[Sequence[int]] - target_column_indices_: Optional[Sequence[int]] - target_columns_metadata_: Optional[List[OrderedDict]] - - - -class Hyperparams(hyperparams.Hyperparams): - C = hyperparams.Hyperparameter[float]( - default=1, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ) - fit_intercept = hyperparams.UniformBool( - default=True, - description='Whether the intercept should be estimated or not. If False, the data is assumed to be already centered. Defaults to True.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - max_iter = hyperparams.Hyperparameter[int]( - default=1000, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - shuffle = hyperparams.UniformBool( - default=True, - description='Whether or not the training data should be shuffled after each epoch.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - tol = hyperparams.Union( - configuration=OrderedDict({ - 'float': hyperparams.Bounded[float]( - lower=0, - upper=None, - default=0.001, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ), - 'none': hyperparams.Constant( - default=None, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ) - }), - default='float', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - loss = hyperparams.Enumeration[str]( - values=['epsilon_insensitive', 'squared_epsilon_insensitive'], - default='epsilon_insensitive', - description='The loss function to be used: epsilon_insensitive: equivalent to PA-I in the reference paper. squared_epsilon_insensitive: equivalent to PA-II in the reference paper.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - warm_start = hyperparams.UniformBool( - default=False, - description='When set to True, reuse the solution of the previous call to fit as initialization, otherwise, just erase the previous solution.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - average = hyperparams.Union( - configuration=OrderedDict({ - 'int': hyperparams.Bounded[int]( - default=2, - lower=2, - upper=None, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ), - 'bool': hyperparams.UniformBool( - default=False, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ) - }), - default='bool', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - epsilon = hyperparams.Bounded[float]( - lower=0, - upper=None, - default=0.1, - description='If the difference between the current prediction and the correct label is below this threshold, the model is not updated.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - early_stopping = hyperparams.UniformBool( - default=False, - description='Whether to use early stopping to terminate training when validation score is not improving. If set to True, it will automatically set asid a fraction of training data as validation and terminate training whe validation score is not improving by at least tol fo n_iter_no_change consecutive epochs.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - validation_fraction = hyperparams.Bounded[float]( - default=0.1, - lower=0, - upper=1, - description='The proportion of training data to set aside as validation set for early stopping. Must be between 0 and 1. Only used if early_stopping is True.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - n_iter_no_change = hyperparams.Bounded[int]( - default=5, - lower=0, - upper=None, - description='Number of iterations with no improvement to wait before early stopping.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - - use_inputs_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to force primitive to use as training input. If any specified column cannot be parsed, it is skipped.", - ) - use_outputs_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to force primitive to use as training target. If any specified column cannot be parsed, it is skipped.", - ) - exclude_inputs_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to not use as training inputs. Applicable only if \"use_columns\" is not provided.", - ) - exclude_outputs_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to not use as training target. Applicable only if \"use_columns\" is not provided.", - ) - return_result = hyperparams.Enumeration( - values=['append', 'replace', 'new'], - default='new', - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Should parsed columns be appended, should they replace original columns, or should only parsed columns be returned? This hyperparam is ignored if use_semantic_types is set to false.", - ) - use_semantic_types = hyperparams.UniformBool( - default=False, - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Controls whether semantic_types metadata will be used for filtering columns in input dataframe. Setting this to false makes the code ignore return_result and will produce only the output dataframe" - ) - add_index_columns = hyperparams.UniformBool( - default=False, - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Also include primary index columns if input data has them. Applicable only if \"return_result\" is set to \"new\".", - ) - error_on_no_input = hyperparams.UniformBool( - default=True, - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Throw an exception if no input column is selected/provided. Defaults to true to behave like sklearn. To prevent pipelines from breaking set this to False.", - ) - - return_semantic_type = hyperparams.Enumeration[str]( - values=['https://metadata.datadrivendiscovery.org/types/Attribute', 'https://metadata.datadrivendiscovery.org/types/ConstructedAttribute', 'https://metadata.datadrivendiscovery.org/types/PredictedTarget'], - default='https://metadata.datadrivendiscovery.org/types/PredictedTarget', - description='Decides what semantic type to attach to generated output', - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'] - ) - -class SKPassiveAggressiveRegressor(SupervisedLearnerPrimitiveBase[Inputs, Outputs, Params, Hyperparams], - ContinueFitMixin[Inputs, Outputs, Params, Hyperparams]): - """ - Primitive wrapping for sklearn PassiveAggressiveRegressor - `sklearn documentation `_ - - """ - - __author__ = "JPL MARVIN" - metadata = metadata_base.PrimitiveMetadata({ - "algorithm_types": [metadata_base.PrimitiveAlgorithmType.PASSIVE_AGGRESSIVE, ], - "name": "sklearn.linear_model.passive_aggressive.PassiveAggressiveRegressor", - "primitive_family": metadata_base.PrimitiveFamily.REGRESSION, - "python_path": "d3m.primitives.regression.passive_aggressive.SKlearn", - "source": {'name': 'JPL', 'contact': 'mailto:shah@jpl.nasa.gov', 'uris': ['https://gitlab.com/datadrivendiscovery/sklearn-wrap/issues', 'https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.PassiveAggressiveRegressor.html']}, - "version": "2019.11.13", - "id": "50ce5919-a155-3c72-a230-f4ab4b5babba", - "hyperparams_to_tune": ['C'], - 'installation': [ - {'type': metadata_base.PrimitiveInstallationType.PIP, - 'package_uri': 'git+https://gitlab.com/datadrivendiscovery/sklearn-wrap.git@{git_commit}#egg=sklearn_wrap'.format( - git_commit=utils.current_git_commit(os.path.dirname(__file__)), - ), - }] - }) - - def __init__(self, *, - hyperparams: Hyperparams, - random_seed: int = 0, - docker_containers: Dict[str, DockerContainer] = None, - _verbose: int = 0) -> None: - - super().__init__(hyperparams=hyperparams, random_seed=random_seed, docker_containers=docker_containers) - - # False - self._clf = PassiveAggressiveRegressor( - C=self.hyperparams['C'], - fit_intercept=self.hyperparams['fit_intercept'], - max_iter=self.hyperparams['max_iter'], - shuffle=self.hyperparams['shuffle'], - tol=self.hyperparams['tol'], - loss=self.hyperparams['loss'], - warm_start=self.hyperparams['warm_start'], - average=self.hyperparams['average'], - epsilon=self.hyperparams['epsilon'], - early_stopping=self.hyperparams['early_stopping'], - validation_fraction=self.hyperparams['validation_fraction'], - n_iter_no_change=self.hyperparams['n_iter_no_change'], - random_state=self.random_seed, - verbose=_verbose - ) - - self._inputs = None - self._outputs = None - self._training_inputs = None - self._training_outputs = None - self._target_names = None - self._training_indices = None - self._target_column_indices = None - self._target_columns_metadata: List[OrderedDict] = None - self._input_column_names = None - self._fitted = False - self._new_training_data = False - - def set_training_data(self, *, inputs: Inputs, outputs: Outputs) -> None: - self._inputs = inputs - self._outputs = outputs - self._fitted = False - self._new_training_data = True - - def fit(self, *, timeout: float = None, iterations: int = None) -> CallResult[None]: - if self._inputs is None or self._outputs is None: - raise ValueError("Missing training data.") - - if not self._new_training_data: - return CallResult(None) - self._new_training_data = False - - self._training_inputs, self._training_indices = self._get_columns_to_fit(self._inputs, self.hyperparams) - self._training_outputs, self._target_names, self._target_column_indices = self._get_targets(self._outputs, self.hyperparams) - self._input_column_names = self._training_inputs.columns - - if len(self._training_indices) > 0 and len(self._target_column_indices) > 0: - self._target_columns_metadata = self._get_target_columns_metadata(self._training_outputs.metadata, self.hyperparams) - sk_training_output = self._training_outputs.values - - shape = sk_training_output.shape - if len(shape) == 2 and shape[1] == 1: - sk_training_output = numpy.ravel(sk_training_output) - - self._clf.fit(self._training_inputs, sk_training_output) - self._fitted = True - else: - if self.hyperparams['error_on_no_input']: - raise RuntimeError("No input columns were selected") - self.logger.warn("No input columns were selected") - - return CallResult(None) - - def continue_fit(self, *, timeout: float = None, iterations: int = None) -> CallResult[None]: - if self._training_inputs is None or self._training_outputs is None: - raise ValueError("Missing training data.") - - if not self._new_training_data: - return CallResult(None) - self._new_training_data = False - - if len(self._training_indices) > 0 and len(self._target_column_indices) > 0: - self._target_columns_metadata = self._get_target_columns_metadata(self._training_outputs.metadata, self.hyperparams) - sk_training_output = self._training_outputs.values - - shape = sk_training_output.shape - if len(shape) == 2 and shape[1] == 1: - sk_training_output = numpy.ravel(sk_training_output) - - self._clf.partial_fit(self._training_inputs, sk_training_output) - self._fitted = True - else: - if self.hyperparams['error_on_no_input']: - raise RuntimeError("No input columns were selected") - self.logger.warn("No input columns were selected") - - return CallResult(None) - - - def produce(self, *, inputs: Inputs, timeout: float = None, iterations: int = None) -> CallResult[Outputs]: - sk_inputs, columns_to_use = self._get_columns_to_fit(inputs, self.hyperparams) - output = [] - if len(sk_inputs.columns): - try: - sk_output = self._clf.predict(sk_inputs) - except sklearn.exceptions.NotFittedError as error: - raise PrimitiveNotFittedError("Primitive not fitted.") from error - # For primitives that allow predicting without fitting like GaussianProcessRegressor - if not self._fitted: - raise PrimitiveNotFittedError("Primitive not fitted.") - if sparse.issparse(sk_output): - sk_output = sk_output.toarray() - output = self._wrap_predictions(inputs, sk_output) - output.columns = self._target_names - output = [output] - else: - if self.hyperparams['error_on_no_input']: - raise RuntimeError("No input columns were selected") - self.logger.warn("No input columns were selected") - outputs = base_utils.combine_columns(return_result=self.hyperparams['return_result'], - add_index_columns=self.hyperparams['add_index_columns'], - inputs=inputs, column_indices=self._target_column_indices, - columns_list=output) - - return CallResult(outputs) - - - def get_params(self) -> Params: - if not self._fitted: - return Params( - coef_=None, - intercept_=None, - t_=None, - alpha=None, - eta0=None, - l1_ratio=None, - learning_rate=None, - n_iter_=None, - penalty=None, - power_t=None, - average_coef_=None, - average_intercept_=None, - standard_coef_=None, - standard_intercept_=None, - input_column_names=self._input_column_names, - training_indices_=self._training_indices, - target_names_=self._target_names, - target_column_indices_=self._target_column_indices, - target_columns_metadata_=self._target_columns_metadata - ) - - return Params( - coef_=getattr(self._clf, 'coef_', None), - intercept_=getattr(self._clf, 'intercept_', None), - t_=getattr(self._clf, 't_', None), - alpha=getattr(self._clf, 'alpha', None), - eta0=getattr(self._clf, 'eta0', None), - l1_ratio=getattr(self._clf, 'l1_ratio', None), - learning_rate=getattr(self._clf, 'learning_rate', None), - n_iter_=getattr(self._clf, 'n_iter_', None), - penalty=getattr(self._clf, 'penalty', None), - power_t=getattr(self._clf, 'power_t', None), - average_coef_=getattr(self._clf, 'average_coef_', None), - average_intercept_=getattr(self._clf, 'average_intercept_', None), - standard_coef_=getattr(self._clf, 'standard_coef_', None), - standard_intercept_=getattr(self._clf, 'standard_intercept_', None), - input_column_names=self._input_column_names, - training_indices_=self._training_indices, - target_names_=self._target_names, - target_column_indices_=self._target_column_indices, - target_columns_metadata_=self._target_columns_metadata - ) - - def set_params(self, *, params: Params) -> None: - self._clf.coef_ = params['coef_'] - self._clf.intercept_ = params['intercept_'] - self._clf.t_ = params['t_'] - self._clf.alpha = params['alpha'] - self._clf.eta0 = params['eta0'] - self._clf.l1_ratio = params['l1_ratio'] - self._clf.learning_rate = params['learning_rate'] - self._clf.n_iter_ = params['n_iter_'] - self._clf.penalty = params['penalty'] - self._clf.power_t = params['power_t'] - self._clf.average_coef_ = params['average_coef_'] - self._clf.average_intercept_ = params['average_intercept_'] - self._clf.standard_coef_ = params['standard_coef_'] - self._clf.standard_intercept_ = params['standard_intercept_'] - self._input_column_names = params['input_column_names'] - self._training_indices = params['training_indices_'] - self._target_names = params['target_names_'] - self._target_column_indices = params['target_column_indices_'] - self._target_columns_metadata = params['target_columns_metadata_'] - - if params['coef_'] is not None: - self._fitted = True - if params['intercept_'] is not None: - self._fitted = True - if params['t_'] is not None: - self._fitted = True - if params['alpha'] is not None: - self._fitted = True - if params['eta0'] is not None: - self._fitted = True - if params['l1_ratio'] is not None: - self._fitted = True - if params['learning_rate'] is not None: - self._fitted = True - if params['n_iter_'] is not None: - self._fitted = True - if params['penalty'] is not None: - self._fitted = True - if params['power_t'] is not None: - self._fitted = True - if params['average_coef_'] is not None: - self._fitted = True - if params['average_intercept_'] is not None: - self._fitted = True - if params['standard_coef_'] is not None: - self._fitted = True - if params['standard_intercept_'] is not None: - self._fitted = True - - - - - - - - @classmethod - def _get_columns_to_fit(cls, inputs: Inputs, hyperparams: Hyperparams): - if not hyperparams['use_semantic_types']: - return inputs, list(range(len(inputs.columns))) - - inputs_metadata = inputs.metadata - - def can_produce_column(column_index: int) -> bool: - return cls._can_produce_column(inputs_metadata, column_index, hyperparams) - - columns_to_produce, columns_not_to_produce = base_utils.get_columns_to_use(inputs_metadata, - use_columns=hyperparams['use_inputs_columns'], - exclude_columns=hyperparams['exclude_inputs_columns'], - can_use_column=can_produce_column) - return inputs.iloc[:, columns_to_produce], columns_to_produce - # return columns_to_produce - - @classmethod - def _can_produce_column(cls, inputs_metadata: metadata_base.DataMetadata, column_index: int, hyperparams: Hyperparams) -> bool: - column_metadata = inputs_metadata.query((metadata_base.ALL_ELEMENTS, column_index)) - - accepted_structural_types = (int, float, numpy.integer, numpy.float64) - accepted_semantic_types = set() - accepted_semantic_types.add("https://metadata.datadrivendiscovery.org/types/Attribute") - if not issubclass(column_metadata['structural_type'], accepted_structural_types): - return False - - semantic_types = set(column_metadata.get('semantic_types', [])) - - if len(semantic_types) == 0: - cls.logger.warning("No semantic types found in column metadata") - return False - # Making sure all accepted_semantic_types are available in semantic_types - if len(accepted_semantic_types - semantic_types) == 0: - return True - - return False - - @classmethod - def _get_targets(cls, data: d3m_dataframe, hyperparams: Hyperparams): - if not hyperparams['use_semantic_types']: - return data, list(data.columns), list(range(len(data.columns))) - - metadata = data.metadata - - def can_produce_column(column_index: int) -> bool: - accepted_semantic_types = set() - accepted_semantic_types.add("https://metadata.datadrivendiscovery.org/types/TrueTarget") - column_metadata = metadata.query((metadata_base.ALL_ELEMENTS, column_index)) - semantic_types = set(column_metadata.get('semantic_types', [])) - if len(semantic_types) == 0: - cls.logger.warning("No semantic types found in column metadata") - return False - # Making sure all accepted_semantic_types are available in semantic_types - if len(accepted_semantic_types - semantic_types) == 0: - return True - return False - - target_column_indices, target_columns_not_to_produce = base_utils.get_columns_to_use(metadata, - use_columns=hyperparams[ - 'use_outputs_columns'], - exclude_columns= - hyperparams[ - 'exclude_outputs_columns'], - can_use_column=can_produce_column) - targets = [] - if target_column_indices: - targets = data.select_columns(target_column_indices) - target_column_names = [] - for idx in target_column_indices: - target_column_names.append(data.columns[idx]) - return targets, target_column_names, target_column_indices - - @classmethod - def _get_target_columns_metadata(cls, outputs_metadata: metadata_base.DataMetadata, hyperparams) -> List[OrderedDict]: - outputs_length = outputs_metadata.query((metadata_base.ALL_ELEMENTS,))['dimension']['length'] - - target_columns_metadata: List[OrderedDict] = [] - for column_index in range(outputs_length): - column_metadata = OrderedDict(outputs_metadata.query_column(column_index)) - - # Update semantic types and prepare it for predicted targets. - semantic_types = set(column_metadata.get('semantic_types', [])) - semantic_types_to_remove = set(["https://metadata.datadrivendiscovery.org/types/TrueTarget","https://metadata.datadrivendiscovery.org/types/SuggestedTarget",]) - add_semantic_types = set(["https://metadata.datadrivendiscovery.org/types/PredictedTarget",]) - add_semantic_types.add(hyperparams["return_semantic_type"]) - semantic_types = semantic_types - semantic_types_to_remove - semantic_types = semantic_types.union(add_semantic_types) - column_metadata['semantic_types'] = list(semantic_types) - - target_columns_metadata.append(column_metadata) - - return target_columns_metadata - - @classmethod - def _update_predictions_metadata(cls, inputs_metadata: metadata_base.DataMetadata, outputs: Optional[Outputs], - target_columns_metadata: List[OrderedDict]) -> metadata_base.DataMetadata: - outputs_metadata = metadata_base.DataMetadata().generate(value=outputs) - - for column_index, column_metadata in enumerate(target_columns_metadata): - column_metadata.pop("structural_type", None) - outputs_metadata = outputs_metadata.update_column(column_index, column_metadata) - - return outputs_metadata - - def _wrap_predictions(self, inputs: Inputs, predictions: ndarray) -> Outputs: - outputs = d3m_dataframe(predictions, generate_metadata=False) - outputs.metadata = self._update_predictions_metadata(inputs.metadata, outputs, self._target_columns_metadata) - return outputs - - - @classmethod - def _add_target_columns_metadata(cls, outputs_metadata: metadata_base.DataMetadata): - outputs_length = outputs_metadata.query((metadata_base.ALL_ELEMENTS,))['dimension']['length'] - - target_columns_metadata: List[OrderedDict] = [] - for column_index in range(outputs_length): - column_metadata = OrderedDict() - semantic_types = [] - semantic_types.append('https://metadata.datadrivendiscovery.org/types/PredictedTarget') - column_name = outputs_metadata.query((metadata_base.ALL_ELEMENTS, column_index)).get("name") - if column_name is None: - column_name = "output_{}".format(column_index) - column_metadata["semantic_types"] = semantic_types - column_metadata["name"] = str(column_name) - target_columns_metadata.append(column_metadata) - - return target_columns_metadata - - -SKPassiveAggressiveRegressor.__doc__ = PassiveAggressiveRegressor.__doc__ \ No newline at end of file diff --git a/common-primitives/sklearn-wrap/sklearn_wrap/SKPolynomialFeatures.py b/common-primitives/sklearn-wrap/sklearn_wrap/SKPolynomialFeatures.py deleted file mode 100644 index 283adfd..0000000 --- a/common-primitives/sklearn-wrap/sklearn_wrap/SKPolynomialFeatures.py +++ /dev/null @@ -1,346 +0,0 @@ -from typing import Any, Callable, List, Dict, Union, Optional, Sequence, Tuple -from numpy import ndarray -from collections import OrderedDict -from scipy import sparse -import os -import sklearn -import numpy -import typing - -# Custom import commands if any -from sklearn.preprocessing.data import PolynomialFeatures - - -from d3m.container.numpy import ndarray as d3m_ndarray -from d3m.container import DataFrame as d3m_dataframe -from d3m.metadata import hyperparams, params, base as metadata_base -from d3m import utils -from d3m.base import utils as base_utils -from d3m.exceptions import PrimitiveNotFittedError -from d3m.primitive_interfaces.base import CallResult, DockerContainer -from d3m.primitive_interfaces.unsupervised_learning import UnsupervisedLearnerPrimitiveBase - - -Inputs = d3m_dataframe -Outputs = d3m_dataframe - - -class Params(params.Params): - n_input_features_: Optional[int] - n_output_features_: Optional[int] - input_column_names: Optional[Any] - target_names_: Optional[Sequence[Any]] - training_indices_: Optional[Sequence[int]] - target_column_indices_: Optional[Sequence[int]] - target_columns_metadata_: Optional[List[OrderedDict]] - - - -class Hyperparams(hyperparams.Hyperparams): - degree = hyperparams.Hyperparameter[int]( - default=2, - description='The degree of the polynomial features. Default = 2.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - include_bias = hyperparams.UniformBool( - default=True, - description='If True (default), then include a bias column, the feature in which all polynomial powers are zero (i.e. a column of ones - acts as an intercept term in a linear model). Examples -------- >>> X = np.arange(6).reshape(3, 2) >>> X array([[0, 1], [2, 3], [4, 5]]) >>> poly = PolynomialFeatures(2) >>> poly.fit_transform(X) array([[ 1., 0., 1., 0., 0., 1.], [ 1., 2., 3., 4., 6., 9.], [ 1., 4., 5., 16., 20., 25.]]) >>> poly = PolynomialFeatures(interaction_only=True) >>> poly.fit_transform(X) array([[ 1., 0., 1., 0.], [ 1., 2., 3., 6.], [ 1., 4., 5., 20.]])', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - - use_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to force primitive to operate on. If any specified column cannot be parsed, it is skipped.", - ) - exclude_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to not operate on. Applicable only if \"use_columns\" is not provided.", - ) - return_result = hyperparams.Enumeration( - values=['append', 'replace', 'new'], - default='new', - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Should parsed columns be appended, should they replace original columns, or should only parsed columns be returned? This hyperparam is ignored if use_semantic_types is set to false.", - ) - use_semantic_types = hyperparams.UniformBool( - default=False, - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Controls whether semantic_types metadata will be used for filtering columns in input dataframe. Setting this to false makes the code ignore return_result and will produce only the output dataframe" - ) - add_index_columns = hyperparams.UniformBool( - default=False, - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Also include primary index columns if input data has them. Applicable only if \"return_result\" is set to \"new\".", - ) - error_on_no_input = hyperparams.UniformBool( - default=True, - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Throw an exception if no input column is selected/provided. Defaults to true to behave like sklearn. To prevent pipelines from breaking set this to False.", - ) - - return_semantic_type = hyperparams.Enumeration[str]( - values=['https://metadata.datadrivendiscovery.org/types/Attribute', 'https://metadata.datadrivendiscovery.org/types/ConstructedAttribute'], - default='https://metadata.datadrivendiscovery.org/types/Attribute', - description='Decides what semantic type to attach to generated attributes', - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'] - ) - -class SKPolynomialFeatures(UnsupervisedLearnerPrimitiveBase[Inputs, Outputs, Params, Hyperparams]): - """ - Primitive wrapping for sklearn PolynomialFeatures - `sklearn documentation `_ - - """ - - __author__ = "JPL MARVIN" - metadata = metadata_base.PrimitiveMetadata({ - "algorithm_types": [metadata_base.PrimitiveAlgorithmType.STATISTICAL_MOMENT_ANALYSIS, ], - "name": "sklearn.preprocessing.data.PolynomialFeatures", - "primitive_family": metadata_base.PrimitiveFamily.DATA_PREPROCESSING, - "python_path": "d3m.primitives.data_preprocessing.polynomial_features.SKlearn", - "source": {'name': 'JPL', 'contact': 'mailto:shah@jpl.nasa.gov', 'uris': ['https://gitlab.com/datadrivendiscovery/sklearn-wrap/issues', 'https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.PolynomialFeatures.html']}, - "version": "2019.11.13", - "id": "93acb44b-532a-37d3-987a-8e61a8489d77", - "hyperparams_to_tune": ['degree'], - 'installation': [ - {'type': metadata_base.PrimitiveInstallationType.PIP, - 'package_uri': 'git+https://gitlab.com/datadrivendiscovery/sklearn-wrap.git@{git_commit}#egg=sklearn_wrap'.format( - git_commit=utils.current_git_commit(os.path.dirname(__file__)), - ), - }] - }) - - def __init__(self, *, - hyperparams: Hyperparams, - random_seed: int = 0, - docker_containers: Dict[str, DockerContainer] = None) -> None: - - super().__init__(hyperparams=hyperparams, random_seed=random_seed, docker_containers=docker_containers) - - # False - self._clf = PolynomialFeatures( - degree=self.hyperparams['degree'], - include_bias=self.hyperparams['include_bias'], - ) - - self._inputs = None - self._outputs = None - self._training_inputs = None - self._training_outputs = None - self._target_names = None - self._training_indices = None - self._target_column_indices = None - self._target_columns_metadata: List[OrderedDict] = None - self._input_column_names = None - self._fitted = False - - - def set_training_data(self, *, inputs: Inputs) -> None: - self._inputs = inputs - self._fitted = False - - def fit(self, *, timeout: float = None, iterations: int = None) -> CallResult[None]: - if self._fitted: - return CallResult(None) - - self._training_inputs, self._training_indices = self._get_columns_to_fit(self._inputs, self.hyperparams) - self._input_column_names = self._training_inputs.columns - - if self._training_inputs is None: - return CallResult(None) - - if len(self._training_indices) > 0: - self._clf.fit(self._training_inputs) - self._fitted = True - else: - if self.hyperparams['error_on_no_input']: - raise RuntimeError("No input columns were selected") - self.logger.warn("No input columns were selected") - return CallResult(None) - - def produce(self, *, inputs: Inputs, timeout: float = None, iterations: int = None) -> CallResult[Outputs]: - if not self._fitted: - raise PrimitiveNotFittedError("Primitive not fitted.") - sk_inputs = inputs - if self.hyperparams['use_semantic_types']: - sk_inputs = inputs.iloc[:, self._training_indices] - output_columns = [] - if len(self._training_indices) > 0: - sk_output = self._clf.transform(sk_inputs) - if sparse.issparse(sk_output): - sk_output = sk_output.toarray() - outputs = self._wrap_predictions(inputs, sk_output) - if len(outputs.columns) == len(self._input_column_names): - outputs.columns = self._input_column_names - output_columns = [outputs] - else: - if self.hyperparams['error_on_no_input']: - raise RuntimeError("No input columns were selected") - self.logger.warn("No input columns were selected") - outputs = base_utils.combine_columns(return_result=self.hyperparams['return_result'], - add_index_columns=self.hyperparams['add_index_columns'], - inputs=inputs, column_indices=self._training_indices, - columns_list=output_columns) - return CallResult(outputs) - - - def get_params(self) -> Params: - if not self._fitted: - return Params( - n_input_features_=None, - n_output_features_=None, - input_column_names=self._input_column_names, - training_indices_=self._training_indices, - target_names_=self._target_names, - target_column_indices_=self._target_column_indices, - target_columns_metadata_=self._target_columns_metadata - ) - - return Params( - n_input_features_=getattr(self._clf, 'n_input_features_', None), - n_output_features_=getattr(self._clf, 'n_output_features_', None), - input_column_names=self._input_column_names, - training_indices_=self._training_indices, - target_names_=self._target_names, - target_column_indices_=self._target_column_indices, - target_columns_metadata_=self._target_columns_metadata - ) - - def set_params(self, *, params: Params) -> None: - self._clf.n_input_features_ = params['n_input_features_'] - self._clf.n_output_features_ = params['n_output_features_'] - self._input_column_names = params['input_column_names'] - self._training_indices = params['training_indices_'] - self._target_names = params['target_names_'] - self._target_column_indices = params['target_column_indices_'] - self._target_columns_metadata = params['target_columns_metadata_'] - - if params['n_input_features_'] is not None: - self._fitted = True - if params['n_output_features_'] is not None: - self._fitted = True - - - - - - @classmethod - def _get_columns_to_fit(cls, inputs: Inputs, hyperparams: Hyperparams): - if not hyperparams['use_semantic_types']: - return inputs, list(range(len(inputs.columns))) - - inputs_metadata = inputs.metadata - - def can_produce_column(column_index: int) -> bool: - return cls._can_produce_column(inputs_metadata, column_index, hyperparams) - - columns_to_produce, columns_not_to_produce = base_utils.get_columns_to_use(inputs_metadata, - use_columns=hyperparams['use_columns'], - exclude_columns=hyperparams['exclude_columns'], - can_use_column=can_produce_column) - return inputs.iloc[:, columns_to_produce], columns_to_produce - # return columns_to_produce - - @classmethod - def _can_produce_column(cls, inputs_metadata: metadata_base.DataMetadata, column_index: int, hyperparams: Hyperparams) -> bool: - column_metadata = inputs_metadata.query((metadata_base.ALL_ELEMENTS, column_index)) - - accepted_structural_types = (int, float, numpy.integer, numpy.float64) - accepted_semantic_types = set() - accepted_semantic_types.add("https://metadata.datadrivendiscovery.org/types/Attribute") - if not issubclass(column_metadata['structural_type'], accepted_structural_types): - return False - - semantic_types = set(column_metadata.get('semantic_types', [])) - - if len(semantic_types) == 0: - cls.logger.warning("No semantic types found in column metadata") - return False - - # Making sure all accepted_semantic_types are available in semantic_types - if len(accepted_semantic_types - semantic_types) == 0: - return True - - return False - - - @classmethod - def _get_target_columns_metadata(cls, outputs_metadata: metadata_base.DataMetadata, hyperparams) -> List[OrderedDict]: - outputs_length = outputs_metadata.query((metadata_base.ALL_ELEMENTS,))['dimension']['length'] - - target_columns_metadata: List[OrderedDict] = [] - for column_index in range(outputs_length): - column_metadata = OrderedDict(outputs_metadata.query_column(column_index)) - - # Update semantic types and prepare it for predicted targets. - semantic_types = set(column_metadata.get('semantic_types', [])) - semantic_types_to_remove = set([]) - add_semantic_types = [] - add_semantic_types.add(hyperparams["return_semantic_type"]) - semantic_types = semantic_types - semantic_types_to_remove - semantic_types = semantic_types.union(add_semantic_types) - column_metadata['semantic_types'] = list(semantic_types) - - target_columns_metadata.append(column_metadata) - - return target_columns_metadata - - @classmethod - def _update_predictions_metadata(cls, inputs_metadata: metadata_base.DataMetadata, outputs: Optional[Outputs], - target_columns_metadata: List[OrderedDict]) -> metadata_base.DataMetadata: - outputs_metadata = metadata_base.DataMetadata().generate(value=outputs) - - for column_index, column_metadata in enumerate(target_columns_metadata): - column_metadata.pop("structural_type", None) - outputs_metadata = outputs_metadata.update_column(column_index, column_metadata) - - return outputs_metadata - - def _wrap_predictions(self, inputs: Inputs, predictions: ndarray) -> Outputs: - outputs = d3m_dataframe(predictions, generate_metadata=True) - target_columns_metadata = self._copy_inputs_metadata(inputs.metadata, self._training_indices, outputs.metadata, self.hyperparams) - outputs.metadata = self._update_predictions_metadata(inputs.metadata, outputs, target_columns_metadata) - return outputs - - - @classmethod - def _copy_inputs_metadata(cls, inputs_metadata: metadata_base.DataMetadata, input_indices: List[int], - outputs_metadata: metadata_base.DataMetadata, hyperparams): - outputs_length = outputs_metadata.query((metadata_base.ALL_ELEMENTS,))['dimension']['length'] - target_columns_metadata: List[OrderedDict] = [] - for column_index in input_indices: - column_name = inputs_metadata.query((metadata_base.ALL_ELEMENTS, column_index)).get("name") - if column_name is None: - column_name = "output_{}".format(column_index) - - column_metadata = OrderedDict(inputs_metadata.query_column(column_index)) - semantic_types = set(column_metadata.get('semantic_types', [])) - semantic_types_to_remove = set([]) - add_semantic_types = set() - add_semantic_types.add(hyperparams["return_semantic_type"]) - semantic_types = semantic_types - semantic_types_to_remove - semantic_types = semantic_types.union(add_semantic_types) - column_metadata['semantic_types'] = list(semantic_types) - - column_metadata["name"] = str(column_name) - target_columns_metadata.append(column_metadata) - - # If outputs has more columns than index, add Attribute Type to all remaining - if outputs_length > len(input_indices): - for column_index in range(len(input_indices), outputs_length): - column_metadata = OrderedDict() - semantic_types = set() - semantic_types.add(hyperparams["return_semantic_type"]) - column_name = "output_{}".format(column_index) - column_metadata["semantic_types"] = list(semantic_types) - column_metadata["name"] = str(column_name) - target_columns_metadata.append(column_metadata) - - return target_columns_metadata - - -SKPolynomialFeatures.__doc__ = PolynomialFeatures.__doc__ \ No newline at end of file diff --git a/common-primitives/sklearn-wrap/sklearn_wrap/SKQuadraticDiscriminantAnalysis.py b/common-primitives/sklearn-wrap/sklearn_wrap/SKQuadraticDiscriminantAnalysis.py deleted file mode 100644 index fa90760..0000000 --- a/common-primitives/sklearn-wrap/sklearn_wrap/SKQuadraticDiscriminantAnalysis.py +++ /dev/null @@ -1,473 +0,0 @@ -from typing import Any, Callable, List, Dict, Union, Optional, Sequence, Tuple -from numpy import ndarray -from collections import OrderedDict -from scipy import sparse -import os -import sklearn -import numpy -import typing - -# Custom import commands if any -from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis - - -from d3m.container.numpy import ndarray as d3m_ndarray -from d3m.container import DataFrame as d3m_dataframe -from d3m.metadata import hyperparams, params, base as metadata_base -from d3m import utils -from d3m.base import utils as base_utils -from d3m.exceptions import PrimitiveNotFittedError -from d3m.primitive_interfaces.base import CallResult, DockerContainer - -from d3m.primitive_interfaces.supervised_learning import SupervisedLearnerPrimitiveBase -from d3m.primitive_interfaces.base import ProbabilisticCompositionalityMixin, ContinueFitMixin -from d3m import exceptions -import pandas - - - -Inputs = d3m_dataframe -Outputs = d3m_dataframe - - -class Params(params.Params): - covariance_: Optional[ndarray] - means_: Optional[ndarray] - priors_: Optional[ndarray] - rotations_: Optional[Sequence[ndarray]] - scalings_: Optional[Sequence[ndarray]] - classes_: Optional[ndarray] - input_column_names: Optional[Any] - target_names_: Optional[Sequence[Any]] - training_indices_: Optional[Sequence[int]] - target_column_indices_: Optional[Sequence[int]] - target_columns_metadata_: Optional[List[OrderedDict]] - - - -class Hyperparams(hyperparams.Hyperparams): - reg_param = hyperparams.Bounded[float]( - default=0.0, - lower=0, - upper=1, - description='Regularizes the covariance estimate as ``(1-reg_param)*Sigma + reg_param*np.eye(n_features)``', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - tol = hyperparams.Bounded[float]( - default=0.0001, - lower=0, - upper=None, - description='Threshold used for rank estimation. .. versionadded:: 0.17', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - - use_inputs_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to force primitive to use as training input. If any specified column cannot be parsed, it is skipped.", - ) - use_outputs_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to force primitive to use as training target. If any specified column cannot be parsed, it is skipped.", - ) - exclude_inputs_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to not use as training inputs. Applicable only if \"use_columns\" is not provided.", - ) - exclude_outputs_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to not use as training target. Applicable only if \"use_columns\" is not provided.", - ) - return_result = hyperparams.Enumeration( - values=['append', 'replace', 'new'], - default='new', - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Should parsed columns be appended, should they replace original columns, or should only parsed columns be returned? This hyperparam is ignored if use_semantic_types is set to false.", - ) - use_semantic_types = hyperparams.UniformBool( - default=False, - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Controls whether semantic_types metadata will be used for filtering columns in input dataframe. Setting this to false makes the code ignore return_result and will produce only the output dataframe" - ) - add_index_columns = hyperparams.UniformBool( - default=False, - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Also include primary index columns if input data has them. Applicable only if \"return_result\" is set to \"new\".", - ) - error_on_no_input = hyperparams.UniformBool( - default=True, - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Throw an exception if no input column is selected/provided. Defaults to true to behave like sklearn. To prevent pipelines from breaking set this to False.", - ) - - return_semantic_type = hyperparams.Enumeration[str]( - values=['https://metadata.datadrivendiscovery.org/types/Attribute', 'https://metadata.datadrivendiscovery.org/types/ConstructedAttribute', 'https://metadata.datadrivendiscovery.org/types/PredictedTarget'], - default='https://metadata.datadrivendiscovery.org/types/PredictedTarget', - description='Decides what semantic type to attach to generated output', - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'] - ) - -class SKQuadraticDiscriminantAnalysis(SupervisedLearnerPrimitiveBase[Inputs, Outputs, Params, Hyperparams], - ProbabilisticCompositionalityMixin[Inputs, Outputs, Params, Hyperparams]): - """ - Primitive wrapping for sklearn QuadraticDiscriminantAnalysis - `sklearn documentation `_ - - """ - - __author__ = "JPL MARVIN" - metadata = metadata_base.PrimitiveMetadata({ - "algorithm_types": [metadata_base.PrimitiveAlgorithmType.QUADRATIC_DISCRIMINANT_ANALYSIS, ], - "name": "sklearn.discriminant_analysis.QuadraticDiscriminantAnalysis", - "primitive_family": metadata_base.PrimitiveFamily.CLASSIFICATION, - "python_path": "d3m.primitives.classification.quadratic_discriminant_analysis.SKlearn", - "source": {'name': 'JPL', 'contact': 'mailto:shah@jpl.nasa.gov', 'uris': ['https://gitlab.com/datadrivendiscovery/sklearn-wrap/issues', 'https://scikit-learn.org/stable/modules/generated/sklearn.discriminant_analysis.QuadraticDiscriminantAnalysis.html']}, - "version": "2019.11.13", - "id": "321dbf4d-07d9-3274-bd1b-2751520ed1d7", - "hyperparams_to_tune": ['reg_param'], - 'installation': [ - {'type': metadata_base.PrimitiveInstallationType.PIP, - 'package_uri': 'git+https://gitlab.com/datadrivendiscovery/sklearn-wrap.git@{git_commit}#egg=sklearn_wrap'.format( - git_commit=utils.current_git_commit(os.path.dirname(__file__)), - ), - }] - }) - - def __init__(self, *, - hyperparams: Hyperparams, - random_seed: int = 0, - docker_containers: Dict[str, DockerContainer] = None) -> None: - - super().__init__(hyperparams=hyperparams, random_seed=random_seed, docker_containers=docker_containers) - - # False - self._clf = QuadraticDiscriminantAnalysis( - reg_param=self.hyperparams['reg_param'], - tol=self.hyperparams['tol'], - ) - - self._inputs = None - self._outputs = None - self._training_inputs = None - self._training_outputs = None - self._target_names = None - self._training_indices = None - self._target_column_indices = None - self._target_columns_metadata: List[OrderedDict] = None - self._input_column_names = None - self._fitted = False - self._new_training_data = False - - def set_training_data(self, *, inputs: Inputs, outputs: Outputs) -> None: - self._inputs = inputs - self._outputs = outputs - self._fitted = False - self._new_training_data = True - - def fit(self, *, timeout: float = None, iterations: int = None) -> CallResult[None]: - if self._inputs is None or self._outputs is None: - raise ValueError("Missing training data.") - - if not self._new_training_data: - return CallResult(None) - self._new_training_data = False - - self._training_inputs, self._training_indices = self._get_columns_to_fit(self._inputs, self.hyperparams) - self._training_outputs, self._target_names, self._target_column_indices = self._get_targets(self._outputs, self.hyperparams) - self._input_column_names = self._training_inputs.columns - - if len(self._training_indices) > 0 and len(self._target_column_indices) > 0: - self._target_columns_metadata = self._get_target_columns_metadata(self._training_outputs.metadata, self.hyperparams) - sk_training_output = self._training_outputs.values - - shape = sk_training_output.shape - if len(shape) == 2 and shape[1] == 1: - sk_training_output = numpy.ravel(sk_training_output) - - self._clf.fit(self._training_inputs, sk_training_output) - self._fitted = True - else: - if self.hyperparams['error_on_no_input']: - raise RuntimeError("No input columns were selected") - self.logger.warn("No input columns were selected") - - return CallResult(None) - - - - def produce(self, *, inputs: Inputs, timeout: float = None, iterations: int = None) -> CallResult[Outputs]: - sk_inputs, columns_to_use = self._get_columns_to_fit(inputs, self.hyperparams) - output = [] - if len(sk_inputs.columns): - try: - sk_output = self._clf.predict(sk_inputs) - except sklearn.exceptions.NotFittedError as error: - raise PrimitiveNotFittedError("Primitive not fitted.") from error - # For primitives that allow predicting without fitting like GaussianProcessRegressor - if not self._fitted: - raise PrimitiveNotFittedError("Primitive not fitted.") - if sparse.issparse(sk_output): - sk_output = sk_output.toarray() - output = self._wrap_predictions(inputs, sk_output) - output.columns = self._target_names - output = [output] - else: - if self.hyperparams['error_on_no_input']: - raise RuntimeError("No input columns were selected") - self.logger.warn("No input columns were selected") - outputs = base_utils.combine_columns(return_result=self.hyperparams['return_result'], - add_index_columns=self.hyperparams['add_index_columns'], - inputs=inputs, column_indices=self._target_column_indices, - columns_list=output) - - return CallResult(outputs) - - - def get_params(self) -> Params: - if not self._fitted: - return Params( - covariance_=None, - means_=None, - priors_=None, - rotations_=None, - scalings_=None, - classes_=None, - input_column_names=self._input_column_names, - training_indices_=self._training_indices, - target_names_=self._target_names, - target_column_indices_=self._target_column_indices, - target_columns_metadata_=self._target_columns_metadata - ) - - return Params( - covariance_=getattr(self._clf, 'covariance_', None), - means_=getattr(self._clf, 'means_', None), - priors_=getattr(self._clf, 'priors_', None), - rotations_=getattr(self._clf, 'rotations_', None), - scalings_=getattr(self._clf, 'scalings_', None), - classes_=getattr(self._clf, 'classes_', None), - input_column_names=self._input_column_names, - training_indices_=self._training_indices, - target_names_=self._target_names, - target_column_indices_=self._target_column_indices, - target_columns_metadata_=self._target_columns_metadata - ) - - def set_params(self, *, params: Params) -> None: - self._clf.covariance_ = params['covariance_'] - self._clf.means_ = params['means_'] - self._clf.priors_ = params['priors_'] - self._clf.rotations_ = params['rotations_'] - self._clf.scalings_ = params['scalings_'] - self._clf.classes_ = params['classes_'] - self._input_column_names = params['input_column_names'] - self._training_indices = params['training_indices_'] - self._target_names = params['target_names_'] - self._target_column_indices = params['target_column_indices_'] - self._target_columns_metadata = params['target_columns_metadata_'] - - if params['covariance_'] is not None: - self._fitted = True - if params['means_'] is not None: - self._fitted = True - if params['priors_'] is not None: - self._fitted = True - if params['rotations_'] is not None: - self._fitted = True - if params['scalings_'] is not None: - self._fitted = True - if params['classes_'] is not None: - self._fitted = True - - - def log_likelihoods(self, *, - outputs: Outputs, - inputs: Inputs, - timeout: float = None, - iterations: int = None) -> CallResult[Sequence[float]]: - inputs = inputs.iloc[:, self._training_indices] # Get ndarray - outputs = outputs.iloc[:, self._target_column_indices] - - if len(inputs.columns) and len(outputs.columns): - - if outputs.shape[1] != self._clf.n_outputs_: - raise exceptions.InvalidArgumentValueError("\"outputs\" argument does not have the correct number of target columns.") - - log_proba = self._clf.predict_log_proba(inputs) - - # Making it always a list, even when only one target. - if self._clf.n_outputs_ == 1: - log_proba = [log_proba] - classes = [self._clf.classes_] - else: - classes = self._clf.classes_ - - samples_length = inputs.shape[0] - - log_likelihoods = [] - for k in range(self._clf.n_outputs_): - # We have to map each class to its internal (numerical) index used in the learner. - # This allows "outputs" to contain string classes. - outputs_column = outputs.iloc[:, k] - classes_map = pandas.Series(numpy.arange(len(classes[k])), index=classes[k]) - mapped_outputs_column = outputs_column.map(classes_map) - - # For each target column (column in "outputs"), for each sample (row) we pick the log - # likelihood for a given class. - log_likelihoods.append(log_proba[k][numpy.arange(samples_length), mapped_outputs_column]) - - results = d3m_dataframe(dict(enumerate(log_likelihoods)), generate_metadata=True) - results.columns = outputs.columns - - for k in range(self._clf.n_outputs_): - column_metadata = outputs.metadata.query_column(k) - if 'name' in column_metadata: - results.metadata = results.metadata.update_column(k, {'name': column_metadata['name']}) - - else: - results = d3m_dataframe(generate_metadata=True) - - return CallResult(results) - - - - - - @classmethod - def _get_columns_to_fit(cls, inputs: Inputs, hyperparams: Hyperparams): - if not hyperparams['use_semantic_types']: - return inputs, list(range(len(inputs.columns))) - - inputs_metadata = inputs.metadata - - def can_produce_column(column_index: int) -> bool: - return cls._can_produce_column(inputs_metadata, column_index, hyperparams) - - columns_to_produce, columns_not_to_produce = base_utils.get_columns_to_use(inputs_metadata, - use_columns=hyperparams['use_inputs_columns'], - exclude_columns=hyperparams['exclude_inputs_columns'], - can_use_column=can_produce_column) - return inputs.iloc[:, columns_to_produce], columns_to_produce - # return columns_to_produce - - @classmethod - def _can_produce_column(cls, inputs_metadata: metadata_base.DataMetadata, column_index: int, hyperparams: Hyperparams) -> bool: - column_metadata = inputs_metadata.query((metadata_base.ALL_ELEMENTS, column_index)) - - accepted_structural_types = (int, float, numpy.integer, numpy.float64) - accepted_semantic_types = set() - accepted_semantic_types.add("https://metadata.datadrivendiscovery.org/types/Attribute") - if not issubclass(column_metadata['structural_type'], accepted_structural_types): - return False - - semantic_types = set(column_metadata.get('semantic_types', [])) - - if len(semantic_types) == 0: - cls.logger.warning("No semantic types found in column metadata") - return False - # Making sure all accepted_semantic_types are available in semantic_types - if len(accepted_semantic_types - semantic_types) == 0: - return True - - return False - - @classmethod - def _get_targets(cls, data: d3m_dataframe, hyperparams: Hyperparams): - if not hyperparams['use_semantic_types']: - return data, list(data.columns), list(range(len(data.columns))) - - metadata = data.metadata - - def can_produce_column(column_index: int) -> bool: - accepted_semantic_types = set() - accepted_semantic_types.add("https://metadata.datadrivendiscovery.org/types/TrueTarget") - column_metadata = metadata.query((metadata_base.ALL_ELEMENTS, column_index)) - semantic_types = set(column_metadata.get('semantic_types', [])) - if len(semantic_types) == 0: - cls.logger.warning("No semantic types found in column metadata") - return False - # Making sure all accepted_semantic_types are available in semantic_types - if len(accepted_semantic_types - semantic_types) == 0: - return True - return False - - target_column_indices, target_columns_not_to_produce = base_utils.get_columns_to_use(metadata, - use_columns=hyperparams[ - 'use_outputs_columns'], - exclude_columns= - hyperparams[ - 'exclude_outputs_columns'], - can_use_column=can_produce_column) - targets = [] - if target_column_indices: - targets = data.select_columns(target_column_indices) - target_column_names = [] - for idx in target_column_indices: - target_column_names.append(data.columns[idx]) - return targets, target_column_names, target_column_indices - - @classmethod - def _get_target_columns_metadata(cls, outputs_metadata: metadata_base.DataMetadata, hyperparams) -> List[OrderedDict]: - outputs_length = outputs_metadata.query((metadata_base.ALL_ELEMENTS,))['dimension']['length'] - - target_columns_metadata: List[OrderedDict] = [] - for column_index in range(outputs_length): - column_metadata = OrderedDict(outputs_metadata.query_column(column_index)) - - # Update semantic types and prepare it for predicted targets. - semantic_types = set(column_metadata.get('semantic_types', [])) - semantic_types_to_remove = set(["https://metadata.datadrivendiscovery.org/types/TrueTarget","https://metadata.datadrivendiscovery.org/types/SuggestedTarget",]) - add_semantic_types = set(["https://metadata.datadrivendiscovery.org/types/PredictedTarget",]) - add_semantic_types.add(hyperparams["return_semantic_type"]) - semantic_types = semantic_types - semantic_types_to_remove - semantic_types = semantic_types.union(add_semantic_types) - column_metadata['semantic_types'] = list(semantic_types) - - target_columns_metadata.append(column_metadata) - - return target_columns_metadata - - @classmethod - def _update_predictions_metadata(cls, inputs_metadata: metadata_base.DataMetadata, outputs: Optional[Outputs], - target_columns_metadata: List[OrderedDict]) -> metadata_base.DataMetadata: - outputs_metadata = metadata_base.DataMetadata().generate(value=outputs) - - for column_index, column_metadata in enumerate(target_columns_metadata): - column_metadata.pop("structural_type", None) - outputs_metadata = outputs_metadata.update_column(column_index, column_metadata) - - return outputs_metadata - - def _wrap_predictions(self, inputs: Inputs, predictions: ndarray) -> Outputs: - outputs = d3m_dataframe(predictions, generate_metadata=False) - outputs.metadata = self._update_predictions_metadata(inputs.metadata, outputs, self._target_columns_metadata) - return outputs - - - @classmethod - def _add_target_columns_metadata(cls, outputs_metadata: metadata_base.DataMetadata): - outputs_length = outputs_metadata.query((metadata_base.ALL_ELEMENTS,))['dimension']['length'] - - target_columns_metadata: List[OrderedDict] = [] - for column_index in range(outputs_length): - column_metadata = OrderedDict() - semantic_types = [] - semantic_types.append('https://metadata.datadrivendiscovery.org/types/PredictedTarget') - column_name = outputs_metadata.query((metadata_base.ALL_ELEMENTS, column_index)).get("name") - if column_name is None: - column_name = "output_{}".format(column_index) - column_metadata["semantic_types"] = semantic_types - column_metadata["name"] = str(column_name) - target_columns_metadata.append(column_metadata) - - return target_columns_metadata - - -SKQuadraticDiscriminantAnalysis.__doc__ = QuadraticDiscriminantAnalysis.__doc__ \ No newline at end of file diff --git a/common-primitives/sklearn-wrap/sklearn_wrap/SKQuantileTransformer.py b/common-primitives/sklearn-wrap/sklearn_wrap/SKQuantileTransformer.py deleted file mode 100644 index e077dd2..0000000 --- a/common-primitives/sklearn-wrap/sklearn_wrap/SKQuantileTransformer.py +++ /dev/null @@ -1,364 +0,0 @@ -from typing import Any, Callable, List, Dict, Union, Optional, Sequence, Tuple -from numpy import ndarray -from collections import OrderedDict -from scipy import sparse -import os -import sklearn -import numpy -import typing - -# Custom import commands if any -from sklearn.preprocessing.data import QuantileTransformer - - -from d3m.container.numpy import ndarray as d3m_ndarray -from d3m.container import DataFrame as d3m_dataframe -from d3m.metadata import hyperparams, params, base as metadata_base -from d3m import utils -from d3m.base import utils as base_utils -from d3m.exceptions import PrimitiveNotFittedError -from d3m.primitive_interfaces.base import CallResult, DockerContainer -from d3m.primitive_interfaces.unsupervised_learning import UnsupervisedLearnerPrimitiveBase - - -Inputs = d3m_dataframe -Outputs = d3m_dataframe - - -class Params(params.Params): - quantiles_: Optional[ndarray] - references_: Optional[ndarray] - input_column_names: Optional[Any] - target_names_: Optional[Sequence[Any]] - training_indices_: Optional[Sequence[int]] - target_column_indices_: Optional[Sequence[int]] - target_columns_metadata_: Optional[List[OrderedDict]] - - - -class Hyperparams(hyperparams.Hyperparams): - n_quantiles = hyperparams.UniformInt( - default=1000, - lower=100, - upper=10000, - description='Number of quantiles to be computed. It corresponds to the number of landmarks used to discretize the cumulative distribution function.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - output_distribution = hyperparams.Enumeration[str]( - default='uniform', - values=['uniform', 'normal'], - description='Marginal distribution for the transformed data. The choices are \'uniform\' (default) or \'normal\'.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - ignore_implicit_zeros = hyperparams.UniformBool( - default=False, - description='Only applies to sparse matrices. If True, the sparse entries of the matrix are discarded to compute the quantile statistics. If False, these entries are treated as zeros.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - subsample = hyperparams.Bounded[float]( - default=100000.0, - lower=1000.0, - upper=100000.0, - description='Maximum number of samples used to estimate the quantiles for computational efficiency. Note that the subsampling procedure may differ for value-identical sparse and dense matrices.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - - use_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to force primitive to operate on. If any specified column cannot be parsed, it is skipped.", - ) - exclude_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to not operate on. Applicable only if \"use_columns\" is not provided.", - ) - return_result = hyperparams.Enumeration( - values=['append', 'replace', 'new'], - default='new', - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Should parsed columns be appended, should they replace original columns, or should only parsed columns be returned? This hyperparam is ignored if use_semantic_types is set to false.", - ) - use_semantic_types = hyperparams.UniformBool( - default=False, - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Controls whether semantic_types metadata will be used for filtering columns in input dataframe. Setting this to false makes the code ignore return_result and will produce only the output dataframe" - ) - add_index_columns = hyperparams.UniformBool( - default=False, - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Also include primary index columns if input data has them. Applicable only if \"return_result\" is set to \"new\".", - ) - error_on_no_input = hyperparams.UniformBool( - default=True, - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Throw an exception if no input column is selected/provided. Defaults to true to behave like sklearn. To prevent pipelines from breaking set this to False.", - ) - - return_semantic_type = hyperparams.Enumeration[str]( - values=['https://metadata.datadrivendiscovery.org/types/Attribute', 'https://metadata.datadrivendiscovery.org/types/ConstructedAttribute'], - default='https://metadata.datadrivendiscovery.org/types/Attribute', - description='Decides what semantic type to attach to generated attributes', - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'] - ) - -class SKQuantileTransformer(UnsupervisedLearnerPrimitiveBase[Inputs, Outputs, Params, Hyperparams]): - """ - Primitive wrapping for sklearn QuantileTransformer - `sklearn documentation `_ - - """ - - __author__ = "JPL MARVIN" - metadata = metadata_base.PrimitiveMetadata({ - "algorithm_types": [metadata_base.PrimitiveAlgorithmType.DATA_CONVERSION, ], - "name": "sklearn.preprocessing.data.QuantileTransformer", - "primitive_family": metadata_base.PrimitiveFamily.DATA_PREPROCESSING, - "python_path": "d3m.primitives.data_preprocessing.quantile_transformer.SKlearn", - "source": {'name': 'JPL', 'contact': 'mailto:shah@jpl.nasa.gov', 'uris': ['https://gitlab.com/datadrivendiscovery/sklearn-wrap/issues', 'https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.QuantileTransformer.html']}, - "version": "2019.11.13", - "id": "54c5e71f-0909-400b-ae65-b33631e7648f", - "hyperparams_to_tune": ['n_quantiles', 'output_distribution'], - 'installation': [ - {'type': metadata_base.PrimitiveInstallationType.PIP, - 'package_uri': 'git+https://gitlab.com/datadrivendiscovery/sklearn-wrap.git@{git_commit}#egg=sklearn_wrap'.format( - git_commit=utils.current_git_commit(os.path.dirname(__file__)), - ), - }] - }) - - def __init__(self, *, - hyperparams: Hyperparams, - random_seed: int = 0, - docker_containers: Dict[str, DockerContainer] = None) -> None: - - super().__init__(hyperparams=hyperparams, random_seed=random_seed, docker_containers=docker_containers) - - # False - self._clf = QuantileTransformer( - n_quantiles=self.hyperparams['n_quantiles'], - output_distribution=self.hyperparams['output_distribution'], - ignore_implicit_zeros=self.hyperparams['ignore_implicit_zeros'], - subsample=self.hyperparams['subsample'], - random_state=self.random_seed, - ) - - self._inputs = None - self._outputs = None - self._training_inputs = None - self._training_outputs = None - self._target_names = None - self._training_indices = None - self._target_column_indices = None - self._target_columns_metadata: List[OrderedDict] = None - self._input_column_names = None - self._fitted = False - - - def set_training_data(self, *, inputs: Inputs) -> None: - self._inputs = inputs - self._fitted = False - - def fit(self, *, timeout: float = None, iterations: int = None) -> CallResult[None]: - if self._fitted: - return CallResult(None) - - self._training_inputs, self._training_indices = self._get_columns_to_fit(self._inputs, self.hyperparams) - self._input_column_names = self._training_inputs.columns - - if self._training_inputs is None: - return CallResult(None) - - if len(self._training_indices) > 0: - self._clf.fit(self._training_inputs) - self._fitted = True - else: - if self.hyperparams['error_on_no_input']: - raise RuntimeError("No input columns were selected") - self.logger.warn("No input columns were selected") - return CallResult(None) - - def produce(self, *, inputs: Inputs, timeout: float = None, iterations: int = None) -> CallResult[Outputs]: - if not self._fitted: - raise PrimitiveNotFittedError("Primitive not fitted.") - sk_inputs = inputs - if self.hyperparams['use_semantic_types']: - sk_inputs = inputs.iloc[:, self._training_indices] - output_columns = [] - if len(self._training_indices) > 0: - sk_output = self._clf.transform(sk_inputs) - if sparse.issparse(sk_output): - sk_output = sk_output.toarray() - outputs = self._wrap_predictions(inputs, sk_output) - if len(outputs.columns) == len(self._input_column_names): - outputs.columns = self._input_column_names - output_columns = [outputs] - else: - if self.hyperparams['error_on_no_input']: - raise RuntimeError("No input columns were selected") - self.logger.warn("No input columns were selected") - outputs = base_utils.combine_columns(return_result=self.hyperparams['return_result'], - add_index_columns=self.hyperparams['add_index_columns'], - inputs=inputs, column_indices=self._training_indices, - columns_list=output_columns) - return CallResult(outputs) - - - def get_params(self) -> Params: - if not self._fitted: - return Params( - quantiles_=None, - references_=None, - input_column_names=self._input_column_names, - training_indices_=self._training_indices, - target_names_=self._target_names, - target_column_indices_=self._target_column_indices, - target_columns_metadata_=self._target_columns_metadata - ) - - return Params( - quantiles_=getattr(self._clf, 'quantiles_', None), - references_=getattr(self._clf, 'references_', None), - input_column_names=self._input_column_names, - training_indices_=self._training_indices, - target_names_=self._target_names, - target_column_indices_=self._target_column_indices, - target_columns_metadata_=self._target_columns_metadata - ) - - def set_params(self, *, params: Params) -> None: - self._clf.quantiles_ = params['quantiles_'] - self._clf.references_ = params['references_'] - self._input_column_names = params['input_column_names'] - self._training_indices = params['training_indices_'] - self._target_names = params['target_names_'] - self._target_column_indices = params['target_column_indices_'] - self._target_columns_metadata = params['target_columns_metadata_'] - - if params['quantiles_'] is not None: - self._fitted = True - if params['references_'] is not None: - self._fitted = True - - - - - - @classmethod - def _get_columns_to_fit(cls, inputs: Inputs, hyperparams: Hyperparams): - if not hyperparams['use_semantic_types']: - return inputs, list(range(len(inputs.columns))) - - inputs_metadata = inputs.metadata - - def can_produce_column(column_index: int) -> bool: - return cls._can_produce_column(inputs_metadata, column_index, hyperparams) - - columns_to_produce, columns_not_to_produce = base_utils.get_columns_to_use(inputs_metadata, - use_columns=hyperparams['use_columns'], - exclude_columns=hyperparams['exclude_columns'], - can_use_column=can_produce_column) - return inputs.iloc[:, columns_to_produce], columns_to_produce - # return columns_to_produce - - @classmethod - def _can_produce_column(cls, inputs_metadata: metadata_base.DataMetadata, column_index: int, hyperparams: Hyperparams) -> bool: - column_metadata = inputs_metadata.query((metadata_base.ALL_ELEMENTS, column_index)) - - accepted_structural_types = (int, float, numpy.integer, numpy.float64) - accepted_semantic_types = set() - accepted_semantic_types.add("https://metadata.datadrivendiscovery.org/types/Attribute") - if not issubclass(column_metadata['structural_type'], accepted_structural_types): - return False - - semantic_types = set(column_metadata.get('semantic_types', [])) - - if len(semantic_types) == 0: - cls.logger.warning("No semantic types found in column metadata") - return False - - # Making sure all accepted_semantic_types are available in semantic_types - if len(accepted_semantic_types - semantic_types) == 0: - return True - - return False - - - @classmethod - def _get_target_columns_metadata(cls, outputs_metadata: metadata_base.DataMetadata, hyperparams) -> List[OrderedDict]: - outputs_length = outputs_metadata.query((metadata_base.ALL_ELEMENTS,))['dimension']['length'] - - target_columns_metadata: List[OrderedDict] = [] - for column_index in range(outputs_length): - column_metadata = OrderedDict(outputs_metadata.query_column(column_index)) - - # Update semantic types and prepare it for predicted targets. - semantic_types = set(column_metadata.get('semantic_types', [])) - semantic_types_to_remove = set([]) - add_semantic_types = [] - add_semantic_types.add(hyperparams["return_semantic_type"]) - semantic_types = semantic_types - semantic_types_to_remove - semantic_types = semantic_types.union(add_semantic_types) - column_metadata['semantic_types'] = list(semantic_types) - - target_columns_metadata.append(column_metadata) - - return target_columns_metadata - - @classmethod - def _update_predictions_metadata(cls, inputs_metadata: metadata_base.DataMetadata, outputs: Optional[Outputs], - target_columns_metadata: List[OrderedDict]) -> metadata_base.DataMetadata: - outputs_metadata = metadata_base.DataMetadata().generate(value=outputs) - - for column_index, column_metadata in enumerate(target_columns_metadata): - column_metadata.pop("structural_type", None) - outputs_metadata = outputs_metadata.update_column(column_index, column_metadata) - - return outputs_metadata - - def _wrap_predictions(self, inputs: Inputs, predictions: ndarray) -> Outputs: - outputs = d3m_dataframe(predictions, generate_metadata=True) - target_columns_metadata = self._copy_inputs_metadata(inputs.metadata, self._training_indices, outputs.metadata, self.hyperparams) - outputs.metadata = self._update_predictions_metadata(inputs.metadata, outputs, target_columns_metadata) - return outputs - - - @classmethod - def _copy_inputs_metadata(cls, inputs_metadata: metadata_base.DataMetadata, input_indices: List[int], - outputs_metadata: metadata_base.DataMetadata, hyperparams): - outputs_length = outputs_metadata.query((metadata_base.ALL_ELEMENTS,))['dimension']['length'] - target_columns_metadata: List[OrderedDict] = [] - for column_index in input_indices: - column_name = inputs_metadata.query((metadata_base.ALL_ELEMENTS, column_index)).get("name") - if column_name is None: - column_name = "output_{}".format(column_index) - - column_metadata = OrderedDict(inputs_metadata.query_column(column_index)) - semantic_types = set(column_metadata.get('semantic_types', [])) - semantic_types_to_remove = set([]) - add_semantic_types = set() - add_semantic_types.add(hyperparams["return_semantic_type"]) - semantic_types = semantic_types - semantic_types_to_remove - semantic_types = semantic_types.union(add_semantic_types) - column_metadata['semantic_types'] = list(semantic_types) - - column_metadata["name"] = str(column_name) - target_columns_metadata.append(column_metadata) - - # If outputs has more columns than index, add Attribute Type to all remaining - if outputs_length > len(input_indices): - for column_index in range(len(input_indices), outputs_length): - column_metadata = OrderedDict() - semantic_types = set() - semantic_types.add(hyperparams["return_semantic_type"]) - column_name = "output_{}".format(column_index) - column_metadata["semantic_types"] = list(semantic_types) - column_metadata["name"] = str(column_name) - target_columns_metadata.append(column_metadata) - - return target_columns_metadata - - -SKQuantileTransformer.__doc__ = QuantileTransformer.__doc__ \ No newline at end of file diff --git a/common-primitives/sklearn-wrap/sklearn_wrap/SKRBFSampler.py b/common-primitives/sklearn-wrap/sklearn_wrap/SKRBFSampler.py deleted file mode 100644 index 03cd11c..0000000 --- a/common-primitives/sklearn-wrap/sklearn_wrap/SKRBFSampler.py +++ /dev/null @@ -1,349 +0,0 @@ -from typing import Any, Callable, List, Dict, Union, Optional, Sequence, Tuple -from numpy import ndarray -from collections import OrderedDict -from scipy import sparse -import os -import sklearn -import numpy -import typing - -# Custom import commands if any -from sklearn.kernel_approximation import RBFSampler - - -from d3m.container.numpy import ndarray as d3m_ndarray -from d3m.container import DataFrame as d3m_dataframe -from d3m.metadata import hyperparams, params, base as metadata_base -from d3m import utils -from d3m.base import utils as base_utils -from d3m.exceptions import PrimitiveNotFittedError -from d3m.primitive_interfaces.base import CallResult, DockerContainer -from d3m.primitive_interfaces.unsupervised_learning import UnsupervisedLearnerPrimitiveBase - - -Inputs = d3m_dataframe -Outputs = d3m_dataframe - - -class Params(params.Params): - random_weights_: Optional[ndarray] - random_offset_: Optional[ndarray] - input_column_names: Optional[Any] - target_names_: Optional[Sequence[Any]] - training_indices_: Optional[Sequence[int]] - target_column_indices_: Optional[Sequence[int]] - target_columns_metadata_: Optional[List[OrderedDict]] - - - -class Hyperparams(hyperparams.Hyperparams): - gamma = hyperparams.Hyperparameter[float]( - default=1, - description='Parameter of RBF kernel: exp(-gamma * x^2)', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - n_components = hyperparams.Bounded[int]( - lower=0, - upper=None, - default=100, - description='Number of Monte Carlo samples per original feature. Equals the dimensionality of the computed feature space.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - - use_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to force primitive to operate on. If any specified column cannot be parsed, it is skipped.", - ) - exclude_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to not operate on. Applicable only if \"use_columns\" is not provided.", - ) - return_result = hyperparams.Enumeration( - values=['append', 'replace', 'new'], - default='new', - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Should parsed columns be appended, should they replace original columns, or should only parsed columns be returned? This hyperparam is ignored if use_semantic_types is set to false.", - ) - use_semantic_types = hyperparams.UniformBool( - default=False, - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Controls whether semantic_types metadata will be used for filtering columns in input dataframe. Setting this to false makes the code ignore return_result and will produce only the output dataframe" - ) - add_index_columns = hyperparams.UniformBool( - default=False, - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Also include primary index columns if input data has them. Applicable only if \"return_result\" is set to \"new\".", - ) - error_on_no_input = hyperparams.UniformBool( - default=True, - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Throw an exception if no input column is selected/provided. Defaults to true to behave like sklearn. To prevent pipelines from breaking set this to False.", - ) - - return_semantic_type = hyperparams.Enumeration[str]( - values=['https://metadata.datadrivendiscovery.org/types/Attribute', 'https://metadata.datadrivendiscovery.org/types/ConstructedAttribute'], - default='https://metadata.datadrivendiscovery.org/types/Attribute', - description='Decides what semantic type to attach to generated attributes', - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'] - ) - -class SKRBFSampler(UnsupervisedLearnerPrimitiveBase[Inputs, Outputs, Params, Hyperparams]): - """ - Primitive wrapping for sklearn RBFSampler - `sklearn documentation `_ - - """ - - __author__ = "JPL MARVIN" - metadata = metadata_base.PrimitiveMetadata({ - "algorithm_types": [metadata_base.PrimitiveAlgorithmType.KERNEL_METHOD, ], - "name": "sklearn.kernel_approximation.RBFSampler", - "primitive_family": metadata_base.PrimitiveFamily.DATA_PREPROCESSING, - "python_path": "d3m.primitives.data_preprocessing.rbf_sampler.SKlearn", - "source": {'name': 'JPL', 'contact': 'mailto:shah@jpl.nasa.gov', 'uris': ['https://gitlab.com/datadrivendiscovery/sklearn-wrap/issues', 'https://scikit-learn.org/stable/modules/generated/sklearn.kernel_approximation.RBFSampler.html']}, - "version": "2019.11.13", - "id": "0823123d-45a3-3dc8-9ef1-ff643236993a", - "hyperparams_to_tune": ['gamma', 'n_components'], - 'installation': [ - {'type': metadata_base.PrimitiveInstallationType.PIP, - 'package_uri': 'git+https://gitlab.com/datadrivendiscovery/sklearn-wrap.git@{git_commit}#egg=sklearn_wrap'.format( - git_commit=utils.current_git_commit(os.path.dirname(__file__)), - ), - }] - }) - - def __init__(self, *, - hyperparams: Hyperparams, - random_seed: int = 0, - docker_containers: Dict[str, DockerContainer] = None) -> None: - - super().__init__(hyperparams=hyperparams, random_seed=random_seed, docker_containers=docker_containers) - - # False - self._clf = RBFSampler( - gamma=self.hyperparams['gamma'], - n_components=self.hyperparams['n_components'], - random_state=self.random_seed, - ) - - self._inputs = None - self._outputs = None - self._training_inputs = None - self._training_outputs = None - self._target_names = None - self._training_indices = None - self._target_column_indices = None - self._target_columns_metadata: List[OrderedDict] = None - self._input_column_names = None - self._fitted = False - - - def set_training_data(self, *, inputs: Inputs) -> None: - self._inputs = inputs - self._fitted = False - - def fit(self, *, timeout: float = None, iterations: int = None) -> CallResult[None]: - if self._fitted: - return CallResult(None) - - self._training_inputs, self._training_indices = self._get_columns_to_fit(self._inputs, self.hyperparams) - self._input_column_names = self._training_inputs.columns - - if self._training_inputs is None: - return CallResult(None) - - if len(self._training_indices) > 0: - self._clf.fit(self._training_inputs) - self._fitted = True - else: - if self.hyperparams['error_on_no_input']: - raise RuntimeError("No input columns were selected") - self.logger.warn("No input columns were selected") - return CallResult(None) - - def produce(self, *, inputs: Inputs, timeout: float = None, iterations: int = None) -> CallResult[Outputs]: - if not self._fitted: - raise PrimitiveNotFittedError("Primitive not fitted.") - sk_inputs = inputs - if self.hyperparams['use_semantic_types']: - sk_inputs = inputs.iloc[:, self._training_indices] - output_columns = [] - if len(self._training_indices) > 0: - sk_output = self._clf.transform(sk_inputs) - if sparse.issparse(sk_output): - sk_output = sk_output.toarray() - outputs = self._wrap_predictions(inputs, sk_output) - if len(outputs.columns) == len(self._input_column_names): - outputs.columns = self._input_column_names - output_columns = [outputs] - else: - if self.hyperparams['error_on_no_input']: - raise RuntimeError("No input columns were selected") - self.logger.warn("No input columns were selected") - outputs = base_utils.combine_columns(return_result=self.hyperparams['return_result'], - add_index_columns=self.hyperparams['add_index_columns'], - inputs=inputs, column_indices=self._training_indices, - columns_list=output_columns) - return CallResult(outputs) - - - def get_params(self) -> Params: - if not self._fitted: - return Params( - random_weights_=None, - random_offset_=None, - input_column_names=self._input_column_names, - training_indices_=self._training_indices, - target_names_=self._target_names, - target_column_indices_=self._target_column_indices, - target_columns_metadata_=self._target_columns_metadata - ) - - return Params( - random_weights_=getattr(self._clf, 'random_weights_', None), - random_offset_=getattr(self._clf, 'random_offset_', None), - input_column_names=self._input_column_names, - training_indices_=self._training_indices, - target_names_=self._target_names, - target_column_indices_=self._target_column_indices, - target_columns_metadata_=self._target_columns_metadata - ) - - def set_params(self, *, params: Params) -> None: - self._clf.random_weights_ = params['random_weights_'] - self._clf.random_offset_ = params['random_offset_'] - self._input_column_names = params['input_column_names'] - self._training_indices = params['training_indices_'] - self._target_names = params['target_names_'] - self._target_column_indices = params['target_column_indices_'] - self._target_columns_metadata = params['target_columns_metadata_'] - - if params['random_weights_'] is not None: - self._fitted = True - if params['random_offset_'] is not None: - self._fitted = True - - - - - - @classmethod - def _get_columns_to_fit(cls, inputs: Inputs, hyperparams: Hyperparams): - if not hyperparams['use_semantic_types']: - return inputs, list(range(len(inputs.columns))) - - inputs_metadata = inputs.metadata - - def can_produce_column(column_index: int) -> bool: - return cls._can_produce_column(inputs_metadata, column_index, hyperparams) - - columns_to_produce, columns_not_to_produce = base_utils.get_columns_to_use(inputs_metadata, - use_columns=hyperparams['use_columns'], - exclude_columns=hyperparams['exclude_columns'], - can_use_column=can_produce_column) - return inputs.iloc[:, columns_to_produce], columns_to_produce - # return columns_to_produce - - @classmethod - def _can_produce_column(cls, inputs_metadata: metadata_base.DataMetadata, column_index: int, hyperparams: Hyperparams) -> bool: - column_metadata = inputs_metadata.query((metadata_base.ALL_ELEMENTS, column_index)) - - accepted_structural_types = (int, float, numpy.integer, numpy.float64) - accepted_semantic_types = set() - accepted_semantic_types.add("https://metadata.datadrivendiscovery.org/types/Attribute") - if not issubclass(column_metadata['structural_type'], accepted_structural_types): - return False - - semantic_types = set(column_metadata.get('semantic_types', [])) - - if len(semantic_types) == 0: - cls.logger.warning("No semantic types found in column metadata") - return False - - # Making sure all accepted_semantic_types are available in semantic_types - if len(accepted_semantic_types - semantic_types) == 0: - return True - - return False - - - @classmethod - def _get_target_columns_metadata(cls, outputs_metadata: metadata_base.DataMetadata, hyperparams) -> List[OrderedDict]: - outputs_length = outputs_metadata.query((metadata_base.ALL_ELEMENTS,))['dimension']['length'] - - target_columns_metadata: List[OrderedDict] = [] - for column_index in range(outputs_length): - column_metadata = OrderedDict(outputs_metadata.query_column(column_index)) - - # Update semantic types and prepare it for predicted targets. - semantic_types = set(column_metadata.get('semantic_types', [])) - semantic_types_to_remove = set([]) - add_semantic_types = [] - add_semantic_types.add(hyperparams["return_semantic_type"]) - semantic_types = semantic_types - semantic_types_to_remove - semantic_types = semantic_types.union(add_semantic_types) - column_metadata['semantic_types'] = list(semantic_types) - - target_columns_metadata.append(column_metadata) - - return target_columns_metadata - - @classmethod - def _update_predictions_metadata(cls, inputs_metadata: metadata_base.DataMetadata, outputs: Optional[Outputs], - target_columns_metadata: List[OrderedDict]) -> metadata_base.DataMetadata: - outputs_metadata = metadata_base.DataMetadata().generate(value=outputs) - - for column_index, column_metadata in enumerate(target_columns_metadata): - column_metadata.pop("structural_type", None) - outputs_metadata = outputs_metadata.update_column(column_index, column_metadata) - - return outputs_metadata - - def _wrap_predictions(self, inputs: Inputs, predictions: ndarray) -> Outputs: - outputs = d3m_dataframe(predictions, generate_metadata=True) - target_columns_metadata = self._copy_inputs_metadata(inputs.metadata, self._training_indices, outputs.metadata, self.hyperparams) - outputs.metadata = self._update_predictions_metadata(inputs.metadata, outputs, target_columns_metadata) - return outputs - - - @classmethod - def _copy_inputs_metadata(cls, inputs_metadata: metadata_base.DataMetadata, input_indices: List[int], - outputs_metadata: metadata_base.DataMetadata, hyperparams): - outputs_length = outputs_metadata.query((metadata_base.ALL_ELEMENTS,))['dimension']['length'] - target_columns_metadata: List[OrderedDict] = [] - for column_index in input_indices: - column_name = inputs_metadata.query((metadata_base.ALL_ELEMENTS, column_index)).get("name") - if column_name is None: - column_name = "output_{}".format(column_index) - - column_metadata = OrderedDict(inputs_metadata.query_column(column_index)) - semantic_types = set(column_metadata.get('semantic_types', [])) - semantic_types_to_remove = set([]) - add_semantic_types = set() - add_semantic_types.add(hyperparams["return_semantic_type"]) - semantic_types = semantic_types - semantic_types_to_remove - semantic_types = semantic_types.union(add_semantic_types) - column_metadata['semantic_types'] = list(semantic_types) - - column_metadata["name"] = str(column_name) - target_columns_metadata.append(column_metadata) - - # If outputs has more columns than index, add Attribute Type to all remaining - if outputs_length > len(input_indices): - for column_index in range(len(input_indices), outputs_length): - column_metadata = OrderedDict() - semantic_types = set() - semantic_types.add(hyperparams["return_semantic_type"]) - column_name = "output_{}".format(column_index) - column_metadata["semantic_types"] = list(semantic_types) - column_metadata["name"] = str(column_name) - target_columns_metadata.append(column_metadata) - - return target_columns_metadata - - -SKRBFSampler.__doc__ = RBFSampler.__doc__ \ No newline at end of file diff --git a/common-primitives/sklearn-wrap/sklearn_wrap/SKRandomForestClassifier.py b/common-primitives/sklearn-wrap/sklearn_wrap/SKRandomForestClassifier.py deleted file mode 100644 index ddef232..0000000 --- a/common-primitives/sklearn-wrap/sklearn_wrap/SKRandomForestClassifier.py +++ /dev/null @@ -1,682 +0,0 @@ -from typing import Any, Callable, List, Dict, Union, Optional, Sequence, Tuple -from numpy import ndarray -from collections import OrderedDict -from scipy import sparse -import os -import sklearn -import numpy -import typing - -# Custom import commands if any -from sklearn.ensemble.forest import RandomForestClassifier - - -from d3m.container.numpy import ndarray as d3m_ndarray -from d3m.container import DataFrame as d3m_dataframe -from d3m.metadata import hyperparams, params, base as metadata_base -from d3m import utils -from d3m.base import utils as base_utils -from d3m.exceptions import PrimitiveNotFittedError -from d3m.primitive_interfaces.base import CallResult, DockerContainer - -from d3m.primitive_interfaces.supervised_learning import SupervisedLearnerPrimitiveBase -from d3m.primitive_interfaces.base import ProbabilisticCompositionalityMixin, ContinueFitMixin -from d3m import exceptions -import pandas - - - -Inputs = d3m_dataframe -Outputs = d3m_dataframe - - -class Params(params.Params): - estimators_: Optional[List[sklearn.tree.DecisionTreeClassifier]] - classes_: Optional[Union[ndarray, List[ndarray]]] - n_classes_: Optional[Union[int, List[int]]] - n_features_: Optional[int] - n_outputs_: Optional[int] - oob_score_: Optional[float] - oob_decision_function_: Optional[ndarray] - base_estimator_: Optional[object] - estimator_params: Optional[tuple] - base_estimator: Optional[object] - input_column_names: Optional[Any] - target_names_: Optional[Sequence[Any]] - training_indices_: Optional[Sequence[int]] - target_column_indices_: Optional[Sequence[int]] - target_columns_metadata_: Optional[List[OrderedDict]] - - - -class Hyperparams(hyperparams.Hyperparams): - n_estimators = hyperparams.Bounded[int]( - default=10, - lower=1, - upper=None, - description='The number of trees in the forest.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - criterion = hyperparams.Enumeration[str]( - values=['gini', 'entropy'], - default='gini', - description='The function to measure the quality of a split. Supported criteria are "gini" for the Gini impurity and "entropy" for the information gain. Note: this parameter is tree-specific.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - max_features = hyperparams.Union( - configuration=OrderedDict({ - 'specified_int': hyperparams.Bounded[int]( - lower=0, - upper=None, - default=0, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ), - 'calculated': hyperparams.Enumeration[str]( - values=['auto', 'sqrt', 'log2'], - default='auto', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ), - 'none': hyperparams.Constant( - default=None, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ), - 'percent': hyperparams.Uniform( - default=0.25, - lower=0, - upper=1, - lower_inclusive=True, - upper_inclusive=False, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ) - }), - default='calculated', - description='The number of features to consider when looking for the best split: - If int, then consider `max_features` features at each split. - If float, then `max_features` is a percentage and `int(max_features * n_features)` features are considered at each split. - If "auto", then `max_features=sqrt(n_features)`. - If "sqrt", then `max_features=sqrt(n_features)` (same as "auto"). - If "log2", then `max_features=log2(n_features)`. - If None, then `max_features=n_features`. Note: the search for a split does not stop until at least one valid partition of the node samples is found, even if it requires to effectively inspect more than ``max_features`` features.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - max_depth = hyperparams.Union( - configuration=OrderedDict({ - 'int': hyperparams.Bounded[int]( - default=10, - lower=0, - upper=None, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ), - 'none': hyperparams.Constant( - default=None, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ) - }), - default='none', - description='The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - min_samples_split = hyperparams.Union( - configuration=OrderedDict({ - 'absolute': hyperparams.Bounded[int]( - default=2, - lower=1, - upper=None, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ), - 'percent': hyperparams.Uniform( - default=0.25, - lower=0, - upper=1, - lower_inclusive=False, - upper_inclusive=True, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ) - }), - default='absolute', - description='The minimum number of samples required to split an internal node: - If int, then consider `min_samples_split` as the minimum number. - If float, then `min_samples_split` is a percentage and `ceil(min_samples_split * n_samples)` are the minimum number of samples for each split. .. versionchanged:: 0.18 Added float values for percentages.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - min_samples_leaf = hyperparams.Union( - configuration=OrderedDict({ - 'absolute': hyperparams.Bounded[int]( - default=1, - lower=1, - upper=None, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ), - 'percent': hyperparams.Uniform( - default=0.25, - lower=0, - upper=0.5, - lower_inclusive=False, - upper_inclusive=True, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ) - }), - default='absolute', - description='The minimum number of samples required to be at a leaf node: - If int, then consider `min_samples_leaf` as the minimum number. - If float, then `min_samples_leaf` is a percentage and `ceil(min_samples_leaf * n_samples)` are the minimum number of samples for each node. .. versionchanged:: 0.18 Added float values for percentages.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - min_weight_fraction_leaf = hyperparams.Uniform( - default=0, - lower=0, - upper=0.5, - upper_inclusive=True, - description='The minimum weighted fraction of the sum total of weights (of all the input samples) required to be at a leaf node. Samples have equal weight when sample_weight is not provided.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - max_leaf_nodes = hyperparams.Union( - configuration=OrderedDict({ - 'int': hyperparams.Bounded[int]( - default=10, - lower=0, - upper=None, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ), - 'none': hyperparams.Constant( - default=None, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ) - }), - default='none', - description='Grow trees with ``max_leaf_nodes`` in best-first fashion. Best nodes are defined as relative reduction in impurity. If None then unlimited number of leaf nodes.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - min_impurity_decrease = hyperparams.Bounded[float]( - default=0.0, - lower=0.0, - upper=None, - description='A node will be split if this split induces a decrease of the impurity greater than or equal to this value. The weighted impurity decrease equation is the following:: N_t / N * (impurity - N_t_R / N_t * right_impurity - N_t_L / N_t * left_impurity) where ``N`` is the total number of samples, ``N_t`` is the number of samples at the current node, ``N_t_L`` is the number of samples in the left child, and ``N_t_R`` is the number of samples in the right child. ``N``, ``N_t``, ``N_t_R`` and ``N_t_L`` all refer to the weighted sum, if ``sample_weight`` is passed. .. versionadded:: 0.19 ', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - bootstrap = hyperparams.Enumeration[str]( - values=['bootstrap', 'bootstrap_with_oob_score', 'disabled'], - default='bootstrap', - description='Whether bootstrap samples are used when building trees.' - ' And whether to use out-of-bag samples to estimate the generalization accuracy.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ) - n_jobs = hyperparams.Union( - configuration=OrderedDict({ - 'limit': hyperparams.Bounded[int]( - default=1, - lower=1, - upper=None, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ), - 'all_cores': hyperparams.Constant( - default=-1, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ) - }), - default='limit', - description='The number of jobs to run in parallel for both `fit` and `predict`. If -1, then the number of jobs is set to the number of cores.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/ResourcesUseParameter'] - ) - warm_start = hyperparams.UniformBool( - default=False, - description='When set to ``True``, reuse the solution of the previous call to fit and add more estimators to the ensemble, otherwise, just fit a whole new forest.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - class_weight = hyperparams.Union( - configuration=OrderedDict({ - 'str': hyperparams.Enumeration[str]( - default='balanced', - values=['balanced', 'balanced_subsample'], - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ), - 'none': hyperparams.Constant( - default=None, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ) - }), - default='none', - description='"balanced_subsample" or None, optional (default=None) Weights associated with classes in the form ``{class_label: weight}``. If not given, all classes are supposed to have weight one. For multi-output problems, a list of dicts can be provided in the same order as the columns of y. The "balanced" mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data as ``n_samples / (n_classes * np.bincount(y))`` The "balanced_subsample" mode is the same as "balanced" except that weights are computed based on the bootstrap sample for every tree grown. For multi-output, the weights of each column of y will be multiplied. Note that these weights will be multiplied with sample_weight (passed through the fit method) if sample_weight is specified.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - - use_inputs_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to force primitive to use as training input. If any specified column cannot be parsed, it is skipped.", - ) - use_outputs_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to force primitive to use as training target. If any specified column cannot be parsed, it is skipped.", - ) - exclude_inputs_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to not use as training inputs. Applicable only if \"use_columns\" is not provided.", - ) - exclude_outputs_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to not use as training target. Applicable only if \"use_columns\" is not provided.", - ) - return_result = hyperparams.Enumeration( - values=['append', 'replace', 'new'], - default='new', - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Should parsed columns be appended, should they replace original columns, or should only parsed columns be returned? This hyperparam is ignored if use_semantic_types is set to false.", - ) - use_semantic_types = hyperparams.UniformBool( - default=False, - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Controls whether semantic_types metadata will be used for filtering columns in input dataframe. Setting this to false makes the code ignore return_result and will produce only the output dataframe" - ) - add_index_columns = hyperparams.UniformBool( - default=False, - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Also include primary index columns if input data has them. Applicable only if \"return_result\" is set to \"new\".", - ) - error_on_no_input = hyperparams.UniformBool( - default=True, - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Throw an exception if no input column is selected/provided. Defaults to true to behave like sklearn. To prevent pipelines from breaking set this to False.", - ) - - return_semantic_type = hyperparams.Enumeration[str]( - values=['https://metadata.datadrivendiscovery.org/types/Attribute', 'https://metadata.datadrivendiscovery.org/types/ConstructedAttribute', 'https://metadata.datadrivendiscovery.org/types/PredictedTarget'], - default='https://metadata.datadrivendiscovery.org/types/PredictedTarget', - description='Decides what semantic type to attach to generated output', - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'] - ) - -class SKRandomForestClassifier(SupervisedLearnerPrimitiveBase[Inputs, Outputs, Params, Hyperparams], - ProbabilisticCompositionalityMixin[Inputs, Outputs, Params, Hyperparams]): - """ - Primitive wrapping for sklearn RandomForestClassifier - `sklearn documentation `_ - - """ - - __author__ = "JPL MARVIN" - metadata = metadata_base.PrimitiveMetadata({ - "algorithm_types": [metadata_base.PrimitiveAlgorithmType.RANDOM_FOREST, ], - "name": "sklearn.ensemble.forest.RandomForestClassifier", - "primitive_family": metadata_base.PrimitiveFamily.CLASSIFICATION, - "python_path": "d3m.primitives.classification.random_forest.SKlearn", - "source": {'name': 'JPL', 'contact': 'mailto:shah@jpl.nasa.gov', 'uris': ['https://gitlab.com/datadrivendiscovery/sklearn-wrap/issues', 'https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html']}, - "version": "2019.11.13", - "id": "1dd82833-5692-39cb-84fb-2455683075f3", - "hyperparams_to_tune": ['n_estimators', 'max_depth', 'min_samples_split', 'min_samples_leaf', 'max_features'], - 'installation': [ - {'type': metadata_base.PrimitiveInstallationType.PIP, - 'package_uri': 'git+https://gitlab.com/datadrivendiscovery/sklearn-wrap.git@{git_commit}#egg=sklearn_wrap'.format( - git_commit=utils.current_git_commit(os.path.dirname(__file__)), - ), - }] - }) - - def __init__(self, *, - hyperparams: Hyperparams, - random_seed: int = 0, - docker_containers: Dict[str, DockerContainer] = None, - _verbose: int = 0) -> None: - - super().__init__(hyperparams=hyperparams, random_seed=random_seed, docker_containers=docker_containers) - - # False - self._clf = RandomForestClassifier( - n_estimators=self.hyperparams['n_estimators'], - criterion=self.hyperparams['criterion'], - max_features=self.hyperparams['max_features'], - max_depth=self.hyperparams['max_depth'], - min_samples_split=self.hyperparams['min_samples_split'], - min_samples_leaf=self.hyperparams['min_samples_leaf'], - min_weight_fraction_leaf=self.hyperparams['min_weight_fraction_leaf'], - max_leaf_nodes=self.hyperparams['max_leaf_nodes'], - min_impurity_decrease=self.hyperparams['min_impurity_decrease'], - bootstrap=self.hyperparams['bootstrap'] in ['bootstrap', 'bootstrap_with_oob_score'], - oob_score=self.hyperparams['bootstrap'] in ['bootstrap_with_oob_score'], - n_jobs=self.hyperparams['n_jobs'], - warm_start=self.hyperparams['warm_start'], - class_weight=self.hyperparams['class_weight'], - random_state=self.random_seed, - verbose=_verbose - ) - - self._inputs = None - self._outputs = None - self._training_inputs = None - self._training_outputs = None - self._target_names = None - self._training_indices = None - self._target_column_indices = None - self._target_columns_metadata: List[OrderedDict] = None - self._input_column_names = None - self._fitted = False - self._new_training_data = False - - def set_training_data(self, *, inputs: Inputs, outputs: Outputs) -> None: - self._inputs = inputs - self._outputs = outputs - self._fitted = False - self._new_training_data = True - - def fit(self, *, timeout: float = None, iterations: int = None) -> CallResult[None]: - if self._inputs is None or self._outputs is None: - raise ValueError("Missing training data.") - - if not self._new_training_data: - return CallResult(None) - self._new_training_data = False - - self._training_inputs, self._training_indices = self._get_columns_to_fit(self._inputs, self.hyperparams) - self._training_outputs, self._target_names, self._target_column_indices = self._get_targets(self._outputs, self.hyperparams) - self._input_column_names = self._training_inputs.columns - - if len(self._training_indices) > 0 and len(self._target_column_indices) > 0: - self._target_columns_metadata = self._get_target_columns_metadata(self._training_outputs.metadata, self.hyperparams) - sk_training_output = self._training_outputs.values - - shape = sk_training_output.shape - if len(shape) == 2 and shape[1] == 1: - sk_training_output = numpy.ravel(sk_training_output) - - self._clf.fit(self._training_inputs, sk_training_output) - self._fitted = True - else: - if self.hyperparams['error_on_no_input']: - raise RuntimeError("No input columns were selected") - self.logger.warn("No input columns were selected") - - return CallResult(None) - - - - def produce(self, *, inputs: Inputs, timeout: float = None, iterations: int = None) -> CallResult[Outputs]: - sk_inputs, columns_to_use = self._get_columns_to_fit(inputs, self.hyperparams) - output = [] - if len(sk_inputs.columns): - try: - sk_output = self._clf.predict(sk_inputs) - except sklearn.exceptions.NotFittedError as error: - raise PrimitiveNotFittedError("Primitive not fitted.") from error - # For primitives that allow predicting without fitting like GaussianProcessRegressor - if not self._fitted: - raise PrimitiveNotFittedError("Primitive not fitted.") - if sparse.issparse(sk_output): - sk_output = sk_output.toarray() - output = self._wrap_predictions(inputs, sk_output) - output.columns = self._target_names - output = [output] - else: - if self.hyperparams['error_on_no_input']: - raise RuntimeError("No input columns were selected") - self.logger.warn("No input columns were selected") - outputs = base_utils.combine_columns(return_result=self.hyperparams['return_result'], - add_index_columns=self.hyperparams['add_index_columns'], - inputs=inputs, column_indices=self._target_column_indices, - columns_list=output) - - return CallResult(outputs) - - - def get_params(self) -> Params: - if not self._fitted: - return Params( - estimators_=None, - classes_=None, - n_classes_=None, - n_features_=None, - n_outputs_=None, - oob_score_=None, - oob_decision_function_=None, - base_estimator_=None, - estimator_params=None, - base_estimator=None, - input_column_names=self._input_column_names, - training_indices_=self._training_indices, - target_names_=self._target_names, - target_column_indices_=self._target_column_indices, - target_columns_metadata_=self._target_columns_metadata - ) - - return Params( - estimators_=getattr(self._clf, 'estimators_', None), - classes_=getattr(self._clf, 'classes_', None), - n_classes_=getattr(self._clf, 'n_classes_', None), - n_features_=getattr(self._clf, 'n_features_', None), - n_outputs_=getattr(self._clf, 'n_outputs_', None), - oob_score_=getattr(self._clf, 'oob_score_', None), - oob_decision_function_=getattr(self._clf, 'oob_decision_function_', None), - base_estimator_=getattr(self._clf, 'base_estimator_', None), - estimator_params=getattr(self._clf, 'estimator_params', None), - base_estimator=getattr(self._clf, 'base_estimator', None), - input_column_names=self._input_column_names, - training_indices_=self._training_indices, - target_names_=self._target_names, - target_column_indices_=self._target_column_indices, - target_columns_metadata_=self._target_columns_metadata - ) - - def set_params(self, *, params: Params) -> None: - self._clf.estimators_ = params['estimators_'] - self._clf.classes_ = params['classes_'] - self._clf.n_classes_ = params['n_classes_'] - self._clf.n_features_ = params['n_features_'] - self._clf.n_outputs_ = params['n_outputs_'] - self._clf.oob_score_ = params['oob_score_'] - self._clf.oob_decision_function_ = params['oob_decision_function_'] - self._clf.base_estimator_ = params['base_estimator_'] - self._clf.estimator_params = params['estimator_params'] - self._clf.base_estimator = params['base_estimator'] - self._input_column_names = params['input_column_names'] - self._training_indices = params['training_indices_'] - self._target_names = params['target_names_'] - self._target_column_indices = params['target_column_indices_'] - self._target_columns_metadata = params['target_columns_metadata_'] - - if params['estimators_'] is not None: - self._fitted = True - if params['classes_'] is not None: - self._fitted = True - if params['n_classes_'] is not None: - self._fitted = True - if params['n_features_'] is not None: - self._fitted = True - if params['n_outputs_'] is not None: - self._fitted = True - if params['oob_score_'] is not None: - self._fitted = True - if params['oob_decision_function_'] is not None: - self._fitted = True - if params['base_estimator_'] is not None: - self._fitted = True - if params['estimator_params'] is not None: - self._fitted = True - if params['base_estimator'] is not None: - self._fitted = True - - - def log_likelihoods(self, *, - outputs: Outputs, - inputs: Inputs, - timeout: float = None, - iterations: int = None) -> CallResult[Sequence[float]]: - inputs = inputs.iloc[:, self._training_indices] # Get ndarray - outputs = outputs.iloc[:, self._target_column_indices] - - if len(inputs.columns) and len(outputs.columns): - - if outputs.shape[1] != self._clf.n_outputs_: - raise exceptions.InvalidArgumentValueError("\"outputs\" argument does not have the correct number of target columns.") - - log_proba = self._clf.predict_log_proba(inputs) - - # Making it always a list, even when only one target. - if self._clf.n_outputs_ == 1: - log_proba = [log_proba] - classes = [self._clf.classes_] - else: - classes = self._clf.classes_ - - samples_length = inputs.shape[0] - - log_likelihoods = [] - for k in range(self._clf.n_outputs_): - # We have to map each class to its internal (numerical) index used in the learner. - # This allows "outputs" to contain string classes. - outputs_column = outputs.iloc[:, k] - classes_map = pandas.Series(numpy.arange(len(classes[k])), index=classes[k]) - mapped_outputs_column = outputs_column.map(classes_map) - - # For each target column (column in "outputs"), for each sample (row) we pick the log - # likelihood for a given class. - log_likelihoods.append(log_proba[k][numpy.arange(samples_length), mapped_outputs_column]) - - results = d3m_dataframe(dict(enumerate(log_likelihoods)), generate_metadata=True) - results.columns = outputs.columns - - for k in range(self._clf.n_outputs_): - column_metadata = outputs.metadata.query_column(k) - if 'name' in column_metadata: - results.metadata = results.metadata.update_column(k, {'name': column_metadata['name']}) - - else: - results = d3m_dataframe(generate_metadata=True) - - return CallResult(results) - - - - def produce_feature_importances(self, *, timeout: float = None, iterations: int = None) -> CallResult[d3m_dataframe]: - output = d3m_dataframe(self._clf.feature_importances_.reshape((1, len(self._input_column_names)))) - output.columns = self._input_column_names - for i in range(len(self._input_column_names)): - output.metadata = output.metadata.update_column(i, {"name": self._input_column_names[i]}) - return CallResult(output) - - @classmethod - def _get_columns_to_fit(cls, inputs: Inputs, hyperparams: Hyperparams): - if not hyperparams['use_semantic_types']: - return inputs, list(range(len(inputs.columns))) - - inputs_metadata = inputs.metadata - - def can_produce_column(column_index: int) -> bool: - return cls._can_produce_column(inputs_metadata, column_index, hyperparams) - - columns_to_produce, columns_not_to_produce = base_utils.get_columns_to_use(inputs_metadata, - use_columns=hyperparams['use_inputs_columns'], - exclude_columns=hyperparams['exclude_inputs_columns'], - can_use_column=can_produce_column) - return inputs.iloc[:, columns_to_produce], columns_to_produce - # return columns_to_produce - - @classmethod - def _can_produce_column(cls, inputs_metadata: metadata_base.DataMetadata, column_index: int, hyperparams: Hyperparams) -> bool: - column_metadata = inputs_metadata.query((metadata_base.ALL_ELEMENTS, column_index)) - - accepted_structural_types = (int, float, numpy.integer, numpy.float64) - accepted_semantic_types = set() - accepted_semantic_types.add("https://metadata.datadrivendiscovery.org/types/Attribute") - if not issubclass(column_metadata['structural_type'], accepted_structural_types): - return False - - semantic_types = set(column_metadata.get('semantic_types', [])) - - if len(semantic_types) == 0: - cls.logger.warning("No semantic types found in column metadata") - return False - # Making sure all accepted_semantic_types are available in semantic_types - if len(accepted_semantic_types - semantic_types) == 0: - return True - - return False - - @classmethod - def _get_targets(cls, data: d3m_dataframe, hyperparams: Hyperparams): - if not hyperparams['use_semantic_types']: - return data, list(data.columns), list(range(len(data.columns))) - - metadata = data.metadata - - def can_produce_column(column_index: int) -> bool: - accepted_semantic_types = set() - accepted_semantic_types.add("https://metadata.datadrivendiscovery.org/types/TrueTarget") - column_metadata = metadata.query((metadata_base.ALL_ELEMENTS, column_index)) - semantic_types = set(column_metadata.get('semantic_types', [])) - if len(semantic_types) == 0: - cls.logger.warning("No semantic types found in column metadata") - return False - # Making sure all accepted_semantic_types are available in semantic_types - if len(accepted_semantic_types - semantic_types) == 0: - return True - return False - - target_column_indices, target_columns_not_to_produce = base_utils.get_columns_to_use(metadata, - use_columns=hyperparams[ - 'use_outputs_columns'], - exclude_columns= - hyperparams[ - 'exclude_outputs_columns'], - can_use_column=can_produce_column) - targets = [] - if target_column_indices: - targets = data.select_columns(target_column_indices) - target_column_names = [] - for idx in target_column_indices: - target_column_names.append(data.columns[idx]) - return targets, target_column_names, target_column_indices - - @classmethod - def _get_target_columns_metadata(cls, outputs_metadata: metadata_base.DataMetadata, hyperparams) -> List[OrderedDict]: - outputs_length = outputs_metadata.query((metadata_base.ALL_ELEMENTS,))['dimension']['length'] - - target_columns_metadata: List[OrderedDict] = [] - for column_index in range(outputs_length): - column_metadata = OrderedDict(outputs_metadata.query_column(column_index)) - - # Update semantic types and prepare it for predicted targets. - semantic_types = set(column_metadata.get('semantic_types', [])) - semantic_types_to_remove = set(["https://metadata.datadrivendiscovery.org/types/TrueTarget","https://metadata.datadrivendiscovery.org/types/SuggestedTarget",]) - add_semantic_types = set(["https://metadata.datadrivendiscovery.org/types/PredictedTarget",]) - add_semantic_types.add(hyperparams["return_semantic_type"]) - semantic_types = semantic_types - semantic_types_to_remove - semantic_types = semantic_types.union(add_semantic_types) - column_metadata['semantic_types'] = list(semantic_types) - - target_columns_metadata.append(column_metadata) - - return target_columns_metadata - - @classmethod - def _update_predictions_metadata(cls, inputs_metadata: metadata_base.DataMetadata, outputs: Optional[Outputs], - target_columns_metadata: List[OrderedDict]) -> metadata_base.DataMetadata: - outputs_metadata = metadata_base.DataMetadata().generate(value=outputs) - - for column_index, column_metadata in enumerate(target_columns_metadata): - column_metadata.pop("structural_type", None) - outputs_metadata = outputs_metadata.update_column(column_index, column_metadata) - - return outputs_metadata - - def _wrap_predictions(self, inputs: Inputs, predictions: ndarray) -> Outputs: - outputs = d3m_dataframe(predictions, generate_metadata=False) - outputs.metadata = self._update_predictions_metadata(inputs.metadata, outputs, self._target_columns_metadata) - return outputs - - - @classmethod - def _add_target_columns_metadata(cls, outputs_metadata: metadata_base.DataMetadata): - outputs_length = outputs_metadata.query((metadata_base.ALL_ELEMENTS,))['dimension']['length'] - - target_columns_metadata: List[OrderedDict] = [] - for column_index in range(outputs_length): - column_metadata = OrderedDict() - semantic_types = [] - semantic_types.append('https://metadata.datadrivendiscovery.org/types/PredictedTarget') - column_name = outputs_metadata.query((metadata_base.ALL_ELEMENTS, column_index)).get("name") - if column_name is None: - column_name = "output_{}".format(column_index) - column_metadata["semantic_types"] = semantic_types - column_metadata["name"] = str(column_name) - target_columns_metadata.append(column_metadata) - - return target_columns_metadata - - -SKRandomForestClassifier.__doc__ = RandomForestClassifier.__doc__ \ No newline at end of file diff --git a/common-primitives/sklearn-wrap/sklearn_wrap/SKRandomForestRegressor.py b/common-primitives/sklearn-wrap/sklearn_wrap/SKRandomForestRegressor.py deleted file mode 100644 index 181105a..0000000 --- a/common-primitives/sklearn-wrap/sklearn_wrap/SKRandomForestRegressor.py +++ /dev/null @@ -1,609 +0,0 @@ -from typing import Any, Callable, List, Dict, Union, Optional, Sequence, Tuple -from numpy import ndarray -from collections import OrderedDict -from scipy import sparse -import os -import sklearn -import numpy -import typing - -# Custom import commands if any -from sklearn.ensemble.forest import RandomForestRegressor - - -from d3m.container.numpy import ndarray as d3m_ndarray -from d3m.container import DataFrame as d3m_dataframe -from d3m.metadata import hyperparams, params, base as metadata_base -from d3m import utils -from d3m.base import utils as base_utils -from d3m.exceptions import PrimitiveNotFittedError -from d3m.primitive_interfaces.base import CallResult, DockerContainer - -from d3m.primitive_interfaces.supervised_learning import SupervisedLearnerPrimitiveBase -from d3m.primitive_interfaces.base import ProbabilisticCompositionalityMixin, ContinueFitMixin -from d3m import exceptions -import pandas - - - -Inputs = d3m_dataframe -Outputs = d3m_dataframe - - -class Params(params.Params): - estimators_: Optional[List[sklearn.tree.DecisionTreeRegressor]] - n_features_: Optional[int] - n_outputs_: Optional[int] - oob_score_: Optional[float] - oob_prediction_: Optional[ndarray] - base_estimator_: Optional[object] - estimator_params: Optional[tuple] - base_estimator: Optional[object] - class_weight: Optional[Union[str, dict, List[dict]]] - input_column_names: Optional[Any] - target_names_: Optional[Sequence[Any]] - training_indices_: Optional[Sequence[int]] - target_column_indices_: Optional[Sequence[int]] - target_columns_metadata_: Optional[List[OrderedDict]] - - - -class Hyperparams(hyperparams.Hyperparams): - n_estimators = hyperparams.Bounded[int]( - default=10, - lower=1, - upper=None, - description='The number of trees in the forest.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - criterion = hyperparams.Enumeration[str]( - values=['mse', 'mae'], - default='mse', - description='The function to measure the quality of a split. Supported criteria are "mse" for the mean squared error, which is equal to variance reduction as feature selection criterion, and "mae" for the mean absolute error. .. versionadded:: 0.18 Mean Absolute Error (MAE) criterion.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - max_features = hyperparams.Union( - configuration=OrderedDict({ - 'specified_int': hyperparams.Bounded[int]( - lower=0, - upper=None, - default=0, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ), - 'calculated': hyperparams.Enumeration[str]( - values=['auto', 'sqrt', 'log2'], - default='auto', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ), - 'none': hyperparams.Constant( - default=None, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ), - 'percent': hyperparams.Uniform( - default=0.25, - lower=0, - upper=1, - lower_inclusive=True, - upper_inclusive=False, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ) - }), - default='calculated', - description='The number of features to consider when looking for the best split: - If int, then consider `max_features` features at each split. - If float, then `max_features` is a percentage and `int(max_features * n_features)` features are considered at each split. - If "auto", then `max_features=n_features`. - If "sqrt", then `max_features=sqrt(n_features)`. - If "log2", then `max_features=log2(n_features)`. - If None, then `max_features=n_features`. Note: the search for a split does not stop until at least one valid partition of the node samples is found, even if it requires to effectively inspect more than ``max_features`` features.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - max_depth = hyperparams.Union( - configuration=OrderedDict({ - 'int': hyperparams.Bounded[int]( - default=10, - lower=0, - upper=None, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ), - 'none': hyperparams.Constant( - default=None, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ) - }), - default='none', - description='The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - min_samples_split = hyperparams.Union( - configuration=OrderedDict({ - 'absolute': hyperparams.Bounded[int]( - default=2, - lower=1, - upper=None, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ), - 'percent': hyperparams.Uniform( - default=0.25, - lower=0, - upper=1, - lower_inclusive=False, - upper_inclusive=True, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ) - }), - default='absolute', - description='The minimum number of samples required to split an internal node: - If int, then consider `min_samples_split` as the minimum number. - If float, then `min_samples_split` is a percentage and `ceil(min_samples_split * n_samples)` are the minimum number of samples for each split. .. versionchanged:: 0.18 Added float values for percentages.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - min_samples_leaf = hyperparams.Union( - configuration=OrderedDict({ - 'absolute': hyperparams.Bounded[int]( - default=1, - lower=1, - upper=None, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ), - 'percent': hyperparams.Uniform( - default=0.25, - lower=0, - upper=0.5, - lower_inclusive=False, - upper_inclusive=True, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ) - }), - default='absolute', - description='The minimum number of samples required to be at a leaf node: - If int, then consider `min_samples_leaf` as the minimum number. - If float, then `min_samples_leaf` is a percentage and `ceil(min_samples_leaf * n_samples)` are the minimum number of samples for each node. .. versionchanged:: 0.18 Added float values for percentages.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - min_weight_fraction_leaf = hyperparams.Uniform( - default=0, - lower=0, - upper=0.5, - upper_inclusive=True, - description='The minimum weighted fraction of the sum total of weights (of all the input samples) required to be at a leaf node. Samples have equal weight when sample_weight is not provided.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - max_leaf_nodes = hyperparams.Union( - configuration=OrderedDict({ - 'int': hyperparams.Bounded[int]( - default=10, - lower=0, - upper=None, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ), - 'none': hyperparams.Constant( - default=None, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ) - }), - default='none', - description='Grow trees with ``max_leaf_nodes`` in best-first fashion. Best nodes are defined as relative reduction in impurity. If None then unlimited number of leaf nodes.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - min_impurity_decrease = hyperparams.Bounded[float]( - default=0.0, - lower=0.0, - upper=None, - description='A node will be split if this split induces a decrease of the impurity greater than or equal to this value. The weighted impurity decrease equation is the following:: N_t / N * (impurity - N_t_R / N_t * right_impurity - N_t_L / N_t * left_impurity) where ``N`` is the total number of samples, ``N_t`` is the number of samples at the current node, ``N_t_L`` is the number of samples in the left child, and ``N_t_R`` is the number of samples in the right child. ``N``, ``N_t``, ``N_t_R`` and ``N_t_L`` all refer to the weighted sum, if ``sample_weight`` is passed. .. versionadded:: 0.19 ', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - bootstrap = hyperparams.Enumeration[str]( - values=['bootstrap', 'bootstrap_with_oob_score', 'disabled'], - default='bootstrap', - description='Whether bootstrap samples are used when building trees.' - ' And whether to use out-of-bag samples to estimate the generalization accuracy.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ) - n_jobs = hyperparams.Union( - configuration=OrderedDict({ - 'limit': hyperparams.Bounded[int]( - default=1, - lower=1, - upper=None, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ), - 'all_cores': hyperparams.Constant( - default=-1, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ) - }), - default='limit', - description='The number of jobs to run in parallel for both `fit` and `predict`. If -1, then the number of jobs is set to the number of cores.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/ResourcesUseParameter'] - ) - warm_start = hyperparams.UniformBool( - default=False, - description='When set to ``True``, reuse the solution of the previous call to fit and add more estimators to the ensemble, otherwise, just fit a whole new forest.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - - use_inputs_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to force primitive to use as training input. If any specified column cannot be parsed, it is skipped.", - ) - use_outputs_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to force primitive to use as training target. If any specified column cannot be parsed, it is skipped.", - ) - exclude_inputs_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to not use as training inputs. Applicable only if \"use_columns\" is not provided.", - ) - exclude_outputs_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to not use as training target. Applicable only if \"use_columns\" is not provided.", - ) - return_result = hyperparams.Enumeration( - values=['append', 'replace', 'new'], - default='new', - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Should parsed columns be appended, should they replace original columns, or should only parsed columns be returned? This hyperparam is ignored if use_semantic_types is set to false.", - ) - use_semantic_types = hyperparams.UniformBool( - default=False, - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Controls whether semantic_types metadata will be used for filtering columns in input dataframe. Setting this to false makes the code ignore return_result and will produce only the output dataframe" - ) - add_index_columns = hyperparams.UniformBool( - default=False, - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Also include primary index columns if input data has them. Applicable only if \"return_result\" is set to \"new\".", - ) - error_on_no_input = hyperparams.UniformBool( - default=True, - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Throw an exception if no input column is selected/provided. Defaults to true to behave like sklearn. To prevent pipelines from breaking set this to False.", - ) - - return_semantic_type = hyperparams.Enumeration[str]( - values=['https://metadata.datadrivendiscovery.org/types/Attribute', 'https://metadata.datadrivendiscovery.org/types/ConstructedAttribute', 'https://metadata.datadrivendiscovery.org/types/PredictedTarget'], - default='https://metadata.datadrivendiscovery.org/types/PredictedTarget', - description='Decides what semantic type to attach to generated output', - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'] - ) - -class SKRandomForestRegressor(SupervisedLearnerPrimitiveBase[Inputs, Outputs, Params, Hyperparams]): - """ - Primitive wrapping for sklearn RandomForestRegressor - `sklearn documentation `_ - - """ - - __author__ = "JPL MARVIN" - metadata = metadata_base.PrimitiveMetadata({ - "algorithm_types": [metadata_base.PrimitiveAlgorithmType.RANDOM_FOREST, ], - "name": "sklearn.ensemble.forest.RandomForestRegressor", - "primitive_family": metadata_base.PrimitiveFamily.REGRESSION, - "python_path": "d3m.primitives.regression.random_forest.SKlearn", - "source": {'name': 'JPL', 'contact': 'mailto:shah@jpl.nasa.gov', 'uris': ['https://gitlab.com/datadrivendiscovery/sklearn-wrap/issues', 'https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html']}, - "version": "2019.11.13", - "id": "f0fd7a62-09b5-3abc-93bb-f5f999f7cc80", - "hyperparams_to_tune": ['n_estimators', 'max_depth', 'min_samples_split', 'min_samples_leaf', 'max_features'], - 'installation': [ - {'type': metadata_base.PrimitiveInstallationType.PIP, - 'package_uri': 'git+https://gitlab.com/datadrivendiscovery/sklearn-wrap.git@{git_commit}#egg=sklearn_wrap'.format( - git_commit=utils.current_git_commit(os.path.dirname(__file__)), - ), - }] - }) - - def __init__(self, *, - hyperparams: Hyperparams, - random_seed: int = 0, - docker_containers: Dict[str, DockerContainer] = None, - _verbose: int = 0) -> None: - - super().__init__(hyperparams=hyperparams, random_seed=random_seed, docker_containers=docker_containers) - - # False - self._clf = RandomForestRegressor( - n_estimators=self.hyperparams['n_estimators'], - criterion=self.hyperparams['criterion'], - max_features=self.hyperparams['max_features'], - max_depth=self.hyperparams['max_depth'], - min_samples_split=self.hyperparams['min_samples_split'], - min_samples_leaf=self.hyperparams['min_samples_leaf'], - min_weight_fraction_leaf=self.hyperparams['min_weight_fraction_leaf'], - max_leaf_nodes=self.hyperparams['max_leaf_nodes'], - min_impurity_decrease=self.hyperparams['min_impurity_decrease'], - bootstrap=self.hyperparams['bootstrap'] in ['bootstrap', 'bootstrap_with_oob_score'], - oob_score=self.hyperparams['bootstrap'] in ['bootstrap_with_oob_score'], - n_jobs=self.hyperparams['n_jobs'], - warm_start=self.hyperparams['warm_start'], - random_state=self.random_seed, - verbose=_verbose - ) - - self._inputs = None - self._outputs = None - self._training_inputs = None - self._training_outputs = None - self._target_names = None - self._training_indices = None - self._target_column_indices = None - self._target_columns_metadata: List[OrderedDict] = None - self._input_column_names = None - self._fitted = False - self._new_training_data = False - - def set_training_data(self, *, inputs: Inputs, outputs: Outputs) -> None: - self._inputs = inputs - self._outputs = outputs - self._fitted = False - self._new_training_data = True - - def fit(self, *, timeout: float = None, iterations: int = None) -> CallResult[None]: - if self._inputs is None or self._outputs is None: - raise ValueError("Missing training data.") - - if not self._new_training_data: - return CallResult(None) - self._new_training_data = False - - self._training_inputs, self._training_indices = self._get_columns_to_fit(self._inputs, self.hyperparams) - self._training_outputs, self._target_names, self._target_column_indices = self._get_targets(self._outputs, self.hyperparams) - self._input_column_names = self._training_inputs.columns - - if len(self._training_indices) > 0 and len(self._target_column_indices) > 0: - self._target_columns_metadata = self._get_target_columns_metadata(self._training_outputs.metadata, self.hyperparams) - sk_training_output = self._training_outputs.values - - shape = sk_training_output.shape - if len(shape) == 2 and shape[1] == 1: - sk_training_output = numpy.ravel(sk_training_output) - self._clf.fit(self._training_inputs, sk_training_output) - self._fitted = True - else: - if self.hyperparams['error_on_no_input']: - raise RuntimeError("No input columns were selected") - self.logger.warn("No input columns were selected") - - return CallResult(None) - - - - def produce(self, *, inputs: Inputs, timeout: float = None, iterations: int = None) -> CallResult[Outputs]: - sk_inputs, columns_to_use = self._get_columns_to_fit(inputs, self.hyperparams) - output = [] - if len(sk_inputs.columns): - try: - sk_output = self._clf.predict(sk_inputs) - except sklearn.exceptions.NotFittedError as error: - raise PrimitiveNotFittedError("Primitive not fitted.") from error - # For primitives that allow predicting without fitting like GaussianProcessRegressor - if not self._fitted: - raise PrimitiveNotFittedError("Primitive not fitted.") - if sparse.issparse(sk_output): - sk_output = sk_output.toarray() - output = self._wrap_predictions(inputs, sk_output) - output.columns = self._target_names - output = [output] - else: - if self.hyperparams['error_on_no_input']: - raise RuntimeError("No input columns were selected") - self.logger.warn("No input columns were selected") - outputs = base_utils.combine_columns(return_result=self.hyperparams['return_result'], - add_index_columns=self.hyperparams['add_index_columns'], - inputs=inputs, column_indices=self._target_column_indices, - columns_list=output) - - return CallResult(outputs) - - - def get_params(self) -> Params: - if not self._fitted: - return Params( - estimators_=None, - n_features_=None, - n_outputs_=None, - oob_score_=None, - oob_prediction_=None, - base_estimator_=None, - estimator_params=None, - base_estimator=None, - class_weight=None, - input_column_names=self._input_column_names, - training_indices_=self._training_indices, - target_names_=self._target_names, - target_column_indices_=self._target_column_indices, - target_columns_metadata_=self._target_columns_metadata - ) - - return Params( - estimators_=getattr(self._clf, 'estimators_', None), - n_features_=getattr(self._clf, 'n_features_', None), - n_outputs_=getattr(self._clf, 'n_outputs_', None), - oob_score_=getattr(self._clf, 'oob_score_', None), - oob_prediction_=getattr(self._clf, 'oob_prediction_', None), - base_estimator_=getattr(self._clf, 'base_estimator_', None), - estimator_params=getattr(self._clf, 'estimator_params', None), - base_estimator=getattr(self._clf, 'base_estimator', None), - class_weight=getattr(self._clf, 'class_weight', None), - input_column_names=self._input_column_names, - training_indices_=self._training_indices, - target_names_=self._target_names, - target_column_indices_=self._target_column_indices, - target_columns_metadata_=self._target_columns_metadata - ) - - def set_params(self, *, params: Params) -> None: - self._clf.estimators_ = params['estimators_'] - self._clf.n_features_ = params['n_features_'] - self._clf.n_outputs_ = params['n_outputs_'] - self._clf.oob_score_ = params['oob_score_'] - self._clf.oob_prediction_ = params['oob_prediction_'] - self._clf.base_estimator_ = params['base_estimator_'] - self._clf.estimator_params = params['estimator_params'] - self._clf.base_estimator = params['base_estimator'] - self._clf.class_weight = params['class_weight'] - self._input_column_names = params['input_column_names'] - self._training_indices = params['training_indices_'] - self._target_names = params['target_names_'] - self._target_column_indices = params['target_column_indices_'] - self._target_columns_metadata = params['target_columns_metadata_'] - - if params['estimators_'] is not None: - self._fitted = True - if params['n_features_'] is not None: - self._fitted = True - if params['n_outputs_'] is not None: - self._fitted = True - if params['oob_score_'] is not None: - self._fitted = True - if params['oob_prediction_'] is not None: - self._fitted = True - if params['base_estimator_'] is not None: - self._fitted = True - if params['estimator_params'] is not None: - self._fitted = True - if params['base_estimator'] is not None: - self._fitted = True - if params['class_weight'] is not None: - self._fitted = True - - - - - - def produce_feature_importances(self, *, timeout: float = None, iterations: int = None) -> CallResult[d3m_dataframe]: - output = d3m_dataframe(self._clf.feature_importances_.reshape((1, len(self._input_column_names)))) - output.columns = self._input_column_names - for i in range(len(self._input_column_names)): - output.metadata = output.metadata.update_column(i, {"name": self._input_column_names[i]}) - return CallResult(output) - - @classmethod - def _get_columns_to_fit(cls, inputs: Inputs, hyperparams: Hyperparams): - if not hyperparams['use_semantic_types']: - return inputs, list(range(len(inputs.columns))) - - inputs_metadata = inputs.metadata - - def can_produce_column(column_index: int) -> bool: - return cls._can_produce_column(inputs_metadata, column_index, hyperparams) - - columns_to_produce, columns_not_to_produce = base_utils.get_columns_to_use(inputs_metadata, - use_columns=hyperparams['use_inputs_columns'], - exclude_columns=hyperparams['exclude_inputs_columns'], - can_use_column=can_produce_column) - return inputs.iloc[:, columns_to_produce], columns_to_produce - # return columns_to_produce - - @classmethod - def _can_produce_column(cls, inputs_metadata: metadata_base.DataMetadata, column_index: int, hyperparams: Hyperparams) -> bool: - column_metadata = inputs_metadata.query((metadata_base.ALL_ELEMENTS, column_index)) - - accepted_structural_types = (int, float, numpy.integer, numpy.float64) - accepted_semantic_types = set() - accepted_semantic_types.add("https://metadata.datadrivendiscovery.org/types/Attribute") - if not issubclass(column_metadata['structural_type'], accepted_structural_types): - return False - - semantic_types = set(column_metadata.get('semantic_types', [])) - - if len(semantic_types) == 0: - cls.logger.warning("No semantic types found in column metadata") - return False - # Making sure all accepted_semantic_types are available in semantic_types - if len(accepted_semantic_types - semantic_types) == 0: - return True - - return False - - @classmethod - def _get_targets(cls, data: d3m_dataframe, hyperparams: Hyperparams): - if not hyperparams['use_semantic_types']: - return data, list(data.columns), list(range(len(data.columns))) - - metadata = data.metadata - - def can_produce_column(column_index: int) -> bool: - accepted_semantic_types = set() - accepted_semantic_types.add("https://metadata.datadrivendiscovery.org/types/TrueTarget") - column_metadata = metadata.query((metadata_base.ALL_ELEMENTS, column_index)) - semantic_types = set(column_metadata.get('semantic_types', [])) - if len(semantic_types) == 0: - cls.logger.warning("No semantic types found in column metadata") - return False - # Making sure all accepted_semantic_types are available in semantic_types - if len(accepted_semantic_types - semantic_types) == 0: - return True - return False - - target_column_indices, target_columns_not_to_produce = base_utils.get_columns_to_use(metadata, - use_columns=hyperparams[ - 'use_outputs_columns'], - exclude_columns= - hyperparams[ - 'exclude_outputs_columns'], - can_use_column=can_produce_column) - targets = [] - if target_column_indices: - targets = data.select_columns(target_column_indices) - target_column_names = [] - for idx in target_column_indices: - target_column_names.append(data.columns[idx]) - return targets, target_column_names, target_column_indices - - @classmethod - def _get_target_columns_metadata(cls, outputs_metadata: metadata_base.DataMetadata, hyperparams) -> List[OrderedDict]: - outputs_length = outputs_metadata.query((metadata_base.ALL_ELEMENTS,))['dimension']['length'] - - target_columns_metadata: List[OrderedDict] = [] - for column_index in range(outputs_length): - column_metadata = OrderedDict(outputs_metadata.query_column(column_index)) - - # Update semantic types and prepare it for predicted targets. - semantic_types = set(column_metadata.get('semantic_types', [])) - semantic_types_to_remove = set(["https://metadata.datadrivendiscovery.org/types/TrueTarget","https://metadata.datadrivendiscovery.org/types/SuggestedTarget",]) - add_semantic_types = set(["https://metadata.datadrivendiscovery.org/types/PredictedTarget",]) - add_semantic_types.add(hyperparams["return_semantic_type"]) - semantic_types = semantic_types - semantic_types_to_remove - semantic_types = semantic_types.union(add_semantic_types) - column_metadata['semantic_types'] = list(semantic_types) - - target_columns_metadata.append(column_metadata) - - return target_columns_metadata - - @classmethod - def _update_predictions_metadata(cls, inputs_metadata: metadata_base.DataMetadata, outputs: Optional[Outputs], - target_columns_metadata: List[OrderedDict]) -> metadata_base.DataMetadata: - outputs_metadata = metadata_base.DataMetadata().generate(value=outputs) - - for column_index, column_metadata in enumerate(target_columns_metadata): - column_metadata.pop("structural_type", None) - outputs_metadata = outputs_metadata.update_column(column_index, column_metadata) - - return outputs_metadata - - def _wrap_predictions(self, inputs: Inputs, predictions: ndarray) -> Outputs: - outputs = d3m_dataframe(predictions, generate_metadata=False) - outputs.metadata = self._update_predictions_metadata(inputs.metadata, outputs, self._target_columns_metadata) - return outputs - - - @classmethod - def _add_target_columns_metadata(cls, outputs_metadata: metadata_base.DataMetadata): - outputs_length = outputs_metadata.query((metadata_base.ALL_ELEMENTS,))['dimension']['length'] - - target_columns_metadata: List[OrderedDict] = [] - for column_index in range(outputs_length): - column_metadata = OrderedDict() - semantic_types = [] - semantic_types.append('https://metadata.datadrivendiscovery.org/types/PredictedTarget') - column_name = outputs_metadata.query((metadata_base.ALL_ELEMENTS, column_index)).get("name") - if column_name is None: - column_name = "output_{}".format(column_index) - column_metadata["semantic_types"] = semantic_types - column_metadata["name"] = str(column_name) - target_columns_metadata.append(column_metadata) - - return target_columns_metadata - - -SKRandomForestRegressor.__doc__ = RandomForestRegressor.__doc__ diff --git a/common-primitives/sklearn-wrap/sklearn_wrap/SKRandomTreesEmbedding.py b/common-primitives/sklearn-wrap/sklearn_wrap/SKRandomTreesEmbedding.py deleted file mode 100644 index c4f7adf..0000000 --- a/common-primitives/sklearn-wrap/sklearn_wrap/SKRandomTreesEmbedding.py +++ /dev/null @@ -1,482 +0,0 @@ -from typing import Any, Callable, List, Dict, Union, Optional, Sequence, Tuple -from numpy import ndarray -from collections import OrderedDict -from scipy import sparse -import os -import sklearn -import numpy -import typing - -# Custom import commands if any -from sklearn.ensemble.forest import RandomTreesEmbedding - - -from d3m.container.numpy import ndarray as d3m_ndarray -from d3m.container import DataFrame as d3m_dataframe -from d3m.metadata import hyperparams, params, base as metadata_base -from d3m import utils -from d3m.base import utils as base_utils -from d3m.exceptions import PrimitiveNotFittedError -from d3m.primitive_interfaces.base import CallResult, DockerContainer -from d3m.primitive_interfaces.unsupervised_learning import UnsupervisedLearnerPrimitiveBase - - -Inputs = d3m_dataframe -Outputs = d3m_dataframe - - -class Params(params.Params): - estimators_: Optional[Sequence[sklearn.base.BaseEstimator]] - one_hot_encoder_: Optional[object] - input_column_names: Optional[Any] - target_names_: Optional[Sequence[Any]] - training_indices_: Optional[Sequence[int]] - target_column_indices_: Optional[Sequence[int]] - target_columns_metadata_: Optional[List[OrderedDict]] - - - -class Hyperparams(hyperparams.Hyperparams): - n_estimators = hyperparams.Bounded[int]( - default=10, - lower=1, - upper=None, - description='Number of trees in the forest.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - max_depth = hyperparams.Union( - configuration=OrderedDict({ - 'int': hyperparams.Bounded[int]( - lower=0, - upper=None, - default=5, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ), - 'none': hyperparams.Constant( - default=None, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ) - }), - default='int', - description='The maximum depth of each tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - min_samples_split = hyperparams.Union( - configuration=OrderedDict({ - 'float': hyperparams.Bounded[float]( - lower=0, - upper=1, - default=1.0, - description='It\'s a percentage and `ceil(min_samples_split * n_samples)` is the minimum number of samples for each split.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ), - 'int': hyperparams.Bounded[int]( - lower=0, - upper=None, - default=2, - description='Minimum number.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ) - }), - default='int', - description='The minimum number of samples required to split an internal node: - If int, then consider `min_samples_split` as the minimum number. - If float, then `min_samples_split` is a percentage and `ceil(min_samples_split * n_samples)` is the minimum number of samples for each split. .. versionchanged:: 0.18 Added float values for percentages.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - min_samples_leaf = hyperparams.Union( - configuration=OrderedDict({ - 'percent': hyperparams.Bounded[float]( - lower=0, - upper=0.5, - default=0.25, - description='It\'s a percentage and `ceil(min_samples_leaf * n_samples)` is the minimum number of samples for each node.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ), - 'absolute': hyperparams.Bounded[int]( - lower=1, - upper=None, - default=1, - description='Minimum number.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ) - }), - default='absolute', - description='The minimum number of samples required to be at a leaf node: - If int, then consider `min_samples_leaf` as the minimum number. - If float, then `min_samples_leaf` is a percentage and `ceil(min_samples_leaf * n_samples)` is the minimum number of samples for each node. .. versionchanged:: 0.18 Added float values for percentages.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - min_weight_fraction_leaf = hyperparams.Bounded[float]( - default=0, - lower=0, - upper=0.5, - description='The minimum weighted fraction of the sum total of weights (of all the input samples) required to be at a leaf node. Samples have equal weight when sample_weight is not provided.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - max_leaf_nodes = hyperparams.Union( - configuration=OrderedDict({ - 'int': hyperparams.Bounded[int]( - lower=0, - upper=None, - default=10, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ), - 'none': hyperparams.Constant( - default=None, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ) - }), - default='none', - description='Grow trees with ``max_leaf_nodes`` in best-first fashion. Best nodes are defined as relative reduction in impurity. If None then unlimited number of leaf nodes.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - min_impurity_split = hyperparams.Union( - configuration=OrderedDict({ - 'float': hyperparams.Bounded[float]( - lower=0, - upper=None, - default=1, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ), - 'none': hyperparams.Constant( - default=None, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ) - }), - default='none', - description='Threshold for early stopping in tree growth. A node will split if its impurity is above the threshold, otherwise it is a leaf. .. versionadded:: 0.18', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - min_impurity_decrease = hyperparams.Bounded[float]( - default=0, - lower=0, - upper=None, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - n_jobs = hyperparams.Union( - configuration=OrderedDict({ - 'limit': hyperparams.Bounded[int]( - default=1, - lower=1, - upper=None, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ), - 'all_cores': hyperparams.Constant( - default=-1, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ) - }), - default='limit', - description='The number of jobs to run in parallel for both `fit` and `predict`. If -1, then the number of jobs is set to the number of cores.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/ResourcesUseParameter'] - ) - warm_start = hyperparams.UniformBool( - default=False, - description='When set to ``True``, reuse the solution of the previous call to fit and add more estimators to the ensemble, otherwise, just fit a whole new forest.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - - use_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to force primitive to operate on. If any specified column cannot be parsed, it is skipped.", - ) - exclude_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to not operate on. Applicable only if \"use_columns\" is not provided.", - ) - return_result = hyperparams.Enumeration( - values=['append', 'replace', 'new'], - default='new', - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Should parsed columns be appended, should they replace original columns, or should only parsed columns be returned? This hyperparam is ignored if use_semantic_types is set to false.", - ) - use_semantic_types = hyperparams.UniformBool( - default=False, - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Controls whether semantic_types metadata will be used for filtering columns in input dataframe. Setting this to false makes the code ignore return_result and will produce only the output dataframe" - ) - add_index_columns = hyperparams.UniformBool( - default=False, - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Also include primary index columns if input data has them. Applicable only if \"return_result\" is set to \"new\".", - ) - error_on_no_input = hyperparams.UniformBool( - default=True, - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Throw an exception if no input column is selected/provided. Defaults to true to behave like sklearn. To prevent pipelines from breaking set this to False.", - ) - - return_semantic_type = hyperparams.Enumeration[str]( - values=['https://metadata.datadrivendiscovery.org/types/Attribute', 'https://metadata.datadrivendiscovery.org/types/ConstructedAttribute'], - default='https://metadata.datadrivendiscovery.org/types/Attribute', - description='Decides what semantic type to attach to generated attributes', - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'] - ) - -class SKRandomTreesEmbedding(UnsupervisedLearnerPrimitiveBase[Inputs, Outputs, Params, Hyperparams]): - """ - Primitive wrapping for sklearn RandomTreesEmbedding - `sklearn documentation `_ - - """ - - __author__ = "JPL MARVIN" - metadata = metadata_base.PrimitiveMetadata({ - "algorithm_types": [metadata_base.PrimitiveAlgorithmType.RANDOM_FOREST, ], - "name": "sklearn.ensemble.forest.RandomTreesEmbedding", - "primitive_family": metadata_base.PrimitiveFamily.DATA_PREPROCESSING, - "python_path": "d3m.primitives.data_preprocessing.random_trees_embedding.SKlearn", - "source": {'name': 'JPL', 'contact': 'mailto:shah@jpl.nasa.gov', 'uris': ['https://gitlab.com/datadrivendiscovery/sklearn-wrap/issues', 'https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomTreesEmbedding.html']}, - "version": "2019.11.13", - "id": "8889ff47-1d2e-3a80-bdef-8397a95e1c6e", - "hyperparams_to_tune": ['n_estimators', 'max_depth', 'min_samples_split', 'min_samples_leaf'], - 'installation': [ - {'type': metadata_base.PrimitiveInstallationType.PIP, - 'package_uri': 'git+https://gitlab.com/datadrivendiscovery/sklearn-wrap.git@{git_commit}#egg=sklearn_wrap'.format( - git_commit=utils.current_git_commit(os.path.dirname(__file__)), - ), - }] - }) - - def __init__(self, *, - hyperparams: Hyperparams, - random_seed: int = 0, - docker_containers: Dict[str, DockerContainer] = None, - _verbose: int = 0) -> None: - - super().__init__(hyperparams=hyperparams, random_seed=random_seed, docker_containers=docker_containers) - - # False - self._clf = RandomTreesEmbedding( - n_estimators=self.hyperparams['n_estimators'], - max_depth=self.hyperparams['max_depth'], - min_samples_split=self.hyperparams['min_samples_split'], - min_samples_leaf=self.hyperparams['min_samples_leaf'], - min_weight_fraction_leaf=self.hyperparams['min_weight_fraction_leaf'], - max_leaf_nodes=self.hyperparams['max_leaf_nodes'], - min_impurity_split=self.hyperparams['min_impurity_split'], - min_impurity_decrease=self.hyperparams['min_impurity_decrease'], - n_jobs=self.hyperparams['n_jobs'], - warm_start=self.hyperparams['warm_start'], - random_state=self.random_seed, - verbose=_verbose - ) - - self._inputs = None - self._outputs = None - self._training_inputs = None - self._training_outputs = None - self._target_names = None - self._training_indices = None - self._target_column_indices = None - self._target_columns_metadata: List[OrderedDict] = None - self._input_column_names = None - self._fitted = False - - - def set_training_data(self, *, inputs: Inputs) -> None: - self._inputs = inputs - self._fitted = False - - def fit(self, *, timeout: float = None, iterations: int = None) -> CallResult[None]: - if self._fitted: - return CallResult(None) - - self._training_inputs, self._training_indices = self._get_columns_to_fit(self._inputs, self.hyperparams) - self._input_column_names = self._training_inputs.columns - - if self._training_inputs is None: - return CallResult(None) - - if len(self._training_indices) > 0: - self._clf.fit(self._training_inputs) - self._fitted = True - else: - if self.hyperparams['error_on_no_input']: - raise RuntimeError("No input columns were selected") - self.logger.warn("No input columns were selected") - return CallResult(None) - - def produce(self, *, inputs: Inputs, timeout: float = None, iterations: int = None) -> CallResult[Outputs]: - if not self._fitted: - raise PrimitiveNotFittedError("Primitive not fitted.") - sk_inputs = inputs - if self.hyperparams['use_semantic_types']: - sk_inputs = inputs.iloc[:, self._training_indices] - output_columns = [] - if len(self._training_indices) > 0: - sk_output = self._clf.transform(sk_inputs) - if sparse.issparse(sk_output): - sk_output = sk_output.toarray() - outputs = self._wrap_predictions(inputs, sk_output) - if len(outputs.columns) == len(self._input_column_names): - outputs.columns = self._input_column_names - output_columns = [outputs] - else: - if self.hyperparams['error_on_no_input']: - raise RuntimeError("No input columns were selected") - self.logger.warn("No input columns were selected") - outputs = base_utils.combine_columns(return_result=self.hyperparams['return_result'], - add_index_columns=self.hyperparams['add_index_columns'], - inputs=inputs, column_indices=self._training_indices, - columns_list=output_columns) - return CallResult(outputs) - - - def get_params(self) -> Params: - if not self._fitted: - return Params( - estimators_=None, - one_hot_encoder_=None, - input_column_names=self._input_column_names, - training_indices_=self._training_indices, - target_names_=self._target_names, - target_column_indices_=self._target_column_indices, - target_columns_metadata_=self._target_columns_metadata - ) - - return Params( - estimators_=getattr(self._clf, 'estimators_', None), - one_hot_encoder_=getattr(self._clf, 'one_hot_encoder_', None), - input_column_names=self._input_column_names, - training_indices_=self._training_indices, - target_names_=self._target_names, - target_column_indices_=self._target_column_indices, - target_columns_metadata_=self._target_columns_metadata - ) - - def set_params(self, *, params: Params) -> None: - self._clf.estimators_ = params['estimators_'] - self._clf.one_hot_encoder_ = params['one_hot_encoder_'] - self._input_column_names = params['input_column_names'] - self._training_indices = params['training_indices_'] - self._target_names = params['target_names_'] - self._target_column_indices = params['target_column_indices_'] - self._target_columns_metadata = params['target_columns_metadata_'] - - if params['estimators_'] is not None: - self._fitted = True - if params['one_hot_encoder_'] is not None: - self._fitted = True - - - - - - @classmethod - def _get_columns_to_fit(cls, inputs: Inputs, hyperparams: Hyperparams): - if not hyperparams['use_semantic_types']: - return inputs, list(range(len(inputs.columns))) - - inputs_metadata = inputs.metadata - - def can_produce_column(column_index: int) -> bool: - return cls._can_produce_column(inputs_metadata, column_index, hyperparams) - - columns_to_produce, columns_not_to_produce = base_utils.get_columns_to_use(inputs_metadata, - use_columns=hyperparams['use_columns'], - exclude_columns=hyperparams['exclude_columns'], - can_use_column=can_produce_column) - return inputs.iloc[:, columns_to_produce], columns_to_produce - # return columns_to_produce - - @classmethod - def _can_produce_column(cls, inputs_metadata: metadata_base.DataMetadata, column_index: int, hyperparams: Hyperparams) -> bool: - column_metadata = inputs_metadata.query((metadata_base.ALL_ELEMENTS, column_index)) - - accepted_structural_types = (int, float, numpy.integer, numpy.float64) - accepted_semantic_types = set() - accepted_semantic_types.add("https://metadata.datadrivendiscovery.org/types/Attribute") - if not issubclass(column_metadata['structural_type'], accepted_structural_types): - return False - - semantic_types = set(column_metadata.get('semantic_types', [])) - - if len(semantic_types) == 0: - cls.logger.warning("No semantic types found in column metadata") - return False - - # Making sure all accepted_semantic_types are available in semantic_types - if len(accepted_semantic_types - semantic_types) == 0: - return True - - return False - - - @classmethod - def _get_target_columns_metadata(cls, outputs_metadata: metadata_base.DataMetadata, hyperparams) -> List[OrderedDict]: - outputs_length = outputs_metadata.query((metadata_base.ALL_ELEMENTS,))['dimension']['length'] - - target_columns_metadata: List[OrderedDict] = [] - for column_index in range(outputs_length): - column_metadata = OrderedDict(outputs_metadata.query_column(column_index)) - - # Update semantic types and prepare it for predicted targets. - semantic_types = set(column_metadata.get('semantic_types', [])) - semantic_types_to_remove = set([]) - add_semantic_types = [] - add_semantic_types.add(hyperparams["return_semantic_type"]) - semantic_types = semantic_types - semantic_types_to_remove - semantic_types = semantic_types.union(add_semantic_types) - column_metadata['semantic_types'] = list(semantic_types) - - target_columns_metadata.append(column_metadata) - - return target_columns_metadata - - @classmethod - def _update_predictions_metadata(cls, inputs_metadata: metadata_base.DataMetadata, outputs: Optional[Outputs], - target_columns_metadata: List[OrderedDict]) -> metadata_base.DataMetadata: - outputs_metadata = metadata_base.DataMetadata().generate(value=outputs) - - for column_index, column_metadata in enumerate(target_columns_metadata): - column_metadata.pop("structural_type", None) - outputs_metadata = outputs_metadata.update_column(column_index, column_metadata) - - return outputs_metadata - - def _wrap_predictions(self, inputs: Inputs, predictions: ndarray) -> Outputs: - outputs = d3m_dataframe(predictions, generate_metadata=True) - target_columns_metadata = self._copy_inputs_metadata(inputs.metadata, self._training_indices, outputs.metadata, self.hyperparams) - outputs.metadata = self._update_predictions_metadata(inputs.metadata, outputs, target_columns_metadata) - return outputs - - - @classmethod - def _copy_inputs_metadata(cls, inputs_metadata: metadata_base.DataMetadata, input_indices: List[int], - outputs_metadata: metadata_base.DataMetadata, hyperparams): - outputs_length = outputs_metadata.query((metadata_base.ALL_ELEMENTS,))['dimension']['length'] - target_columns_metadata: List[OrderedDict] = [] - for column_index in input_indices: - column_name = inputs_metadata.query((metadata_base.ALL_ELEMENTS, column_index)).get("name") - if column_name is None: - column_name = "output_{}".format(column_index) - - column_metadata = OrderedDict(inputs_metadata.query_column(column_index)) - semantic_types = set(column_metadata.get('semantic_types', [])) - semantic_types_to_remove = set([]) - add_semantic_types = set() - add_semantic_types.add(hyperparams["return_semantic_type"]) - semantic_types = semantic_types - semantic_types_to_remove - semantic_types = semantic_types.union(add_semantic_types) - column_metadata['semantic_types'] = list(semantic_types) - - column_metadata["name"] = str(column_name) - target_columns_metadata.append(column_metadata) - - # If outputs has more columns than index, add Attribute Type to all remaining - if outputs_length > len(input_indices): - for column_index in range(len(input_indices), outputs_length): - column_metadata = OrderedDict() - semantic_types = set() - semantic_types.add(hyperparams["return_semantic_type"]) - column_name = "output_{}".format(column_index) - column_metadata["semantic_types"] = list(semantic_types) - column_metadata["name"] = str(column_name) - target_columns_metadata.append(column_metadata) - - return target_columns_metadata - - -SKRandomTreesEmbedding.__doc__ = RandomTreesEmbedding.__doc__ \ No newline at end of file diff --git a/common-primitives/sklearn-wrap/sklearn_wrap/SKRidge.py b/common-primitives/sklearn-wrap/sklearn_wrap/SKRidge.py deleted file mode 100644 index 3ca48ef..0000000 --- a/common-primitives/sklearn-wrap/sklearn_wrap/SKRidge.py +++ /dev/null @@ -1,444 +0,0 @@ -from typing import Any, Callable, List, Dict, Union, Optional, Sequence, Tuple -from numpy import ndarray -from collections import OrderedDict -from scipy import sparse -import os -import sklearn -import numpy -import typing - -# Custom import commands if any -from sklearn.linear_model.ridge import Ridge - - -from d3m.container.numpy import ndarray as d3m_ndarray -from d3m.container import DataFrame as d3m_dataframe -from d3m.metadata import hyperparams, params, base as metadata_base -from d3m import utils -from d3m.base import utils as base_utils -from d3m.exceptions import PrimitiveNotFittedError -from d3m.primitive_interfaces.base import CallResult, DockerContainer - -from d3m.primitive_interfaces.supervised_learning import SupervisedLearnerPrimitiveBase -from d3m.primitive_interfaces.base import ProbabilisticCompositionalityMixin, ContinueFitMixin -from d3m import exceptions -import pandas - - - -Inputs = d3m_dataframe -Outputs = d3m_dataframe - - -class Params(params.Params): - coef_: Optional[ndarray] - intercept_: Optional[Union[float, ndarray]] - n_iter_: Optional[ndarray] - input_column_names: Optional[Any] - target_names_: Optional[Sequence[Any]] - training_indices_: Optional[Sequence[int]] - target_column_indices_: Optional[Sequence[int]] - target_columns_metadata_: Optional[List[OrderedDict]] - - - -class Hyperparams(hyperparams.Hyperparams): - alpha = hyperparams.Bounded[float]( - lower=0, - upper=None, - default=1, - description='Regularization strength; must be a positive float. Regularization improves the conditioning of the problem and reduces the variance of the estimates. Larger values specify stronger regularization. Alpha corresponds to ``C^-1`` in other linear models such as LogisticRegression or LinearSVC. If an array is passed, penalties are assumed to be specific to the targets. Hence they must correspond in number. copy_X : boolean, optional, default True If True, X will be copied; else, it may be overwritten.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - fit_intercept = hyperparams.UniformBool( - default=True, - description='Whether to calculate the intercept for this model. If set to false, no intercept will be used in calculations (e.g. data is expected to be already centered).', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - normalize = hyperparams.UniformBool( - default=False, - description='If True, the regressors X will be normalized before regression. This parameter is ignored when `fit_intercept` is set to False. When the regressors are normalized, note that this makes the hyperparameters learnt more robust and almost independent of the number of samples. The same property is not valid for standardized data. However, if you wish to standardize, please use `preprocessing.StandardScaler` before calling `fit` on an estimator with `normalize=False`.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - max_iter = hyperparams.Union( - configuration=OrderedDict({ - 'int': hyperparams.Bounded[int]( - lower=0, - upper=None, - default=1000, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ), - 'none': hyperparams.Constant( - default=None, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ) - }), - default='none', - description='Maximum number of iterations for conjugate gradient solver. For \'sparse_cg\' and \'lsqr\' solvers, the default value is determined by scipy.sparse.linalg. For \'sag\' solver, the default value is 1000.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - tol = hyperparams.Bounded[float]( - default=0.001, - lower=0, - upper=None, - description='Precision of the solution.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - solver = hyperparams.Enumeration[str]( - values=['auto', 'svd', 'cholesky', 'lsqr', 'sparse_cg', 'sag', 'saga'], - default='auto', - description='Solver to use in the computational routines: - \'auto\' chooses the solver automatically based on the type of data. - \'svd\' uses a Singular Value Decomposition of X to compute the Ridge coefficients. More stable for singular matrices than \'cholesky\'. - \'cholesky\' uses the standard scipy.linalg.solve function to obtain a closed-form solution. - \'sparse_cg\' uses the conjugate gradient solver as found in scipy.sparse.linalg.cg. As an iterative algorithm, this solver is more appropriate than \'cholesky\' for large-scale data (possibility to set `tol` and `max_iter`). - \'lsqr\' uses the dedicated regularized least-squares routine scipy.sparse.linalg.lsqr. It is the fastest but may not be available in old scipy versions. It also uses an iterative procedure. - \'sag\' uses a Stochastic Average Gradient descent. It also uses an iterative procedure, and is often faster than other solvers when both n_samples and n_features are large. Note that \'sag\' fast convergence is only guaranteed on features with approximately the same scale. You can preprocess the data with a scaler from sklearn.preprocessing. All last four solvers support both dense and sparse data. However, only \'sag\' supports sparse input when `fit_intercept` is True. .. versionadded:: 0.17 Stochastic Average Gradient descent solver.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - - use_inputs_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to force primitive to use as training input. If any specified column cannot be parsed, it is skipped.", - ) - use_outputs_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to force primitive to use as training target. If any specified column cannot be parsed, it is skipped.", - ) - exclude_inputs_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to not use as training inputs. Applicable only if \"use_columns\" is not provided.", - ) - exclude_outputs_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to not use as training target. Applicable only if \"use_columns\" is not provided.", - ) - return_result = hyperparams.Enumeration( - values=['append', 'replace', 'new'], - default='new', - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Should parsed columns be appended, should they replace original columns, or should only parsed columns be returned? This hyperparam is ignored if use_semantic_types is set to false.", - ) - use_semantic_types = hyperparams.UniformBool( - default=False, - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Controls whether semantic_types metadata will be used for filtering columns in input dataframe. Setting this to false makes the code ignore return_result and will produce only the output dataframe" - ) - add_index_columns = hyperparams.UniformBool( - default=False, - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Also include primary index columns if input data has them. Applicable only if \"return_result\" is set to \"new\".", - ) - error_on_no_input = hyperparams.UniformBool( - default=True, - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Throw an exception if no input column is selected/provided. Defaults to true to behave like sklearn. To prevent pipelines from breaking set this to False.", - ) - - return_semantic_type = hyperparams.Enumeration[str]( - values=['https://metadata.datadrivendiscovery.org/types/Attribute', 'https://metadata.datadrivendiscovery.org/types/ConstructedAttribute', 'https://metadata.datadrivendiscovery.org/types/PredictedTarget'], - default='https://metadata.datadrivendiscovery.org/types/PredictedTarget', - description='Decides what semantic type to attach to generated output', - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'] - ) - -class SKRidge(SupervisedLearnerPrimitiveBase[Inputs, Outputs, Params, Hyperparams]): - """ - Primitive wrapping for sklearn Ridge - `sklearn documentation `_ - - """ - - __author__ = "JPL MARVIN" - metadata = metadata_base.PrimitiveMetadata({ - "algorithm_types": [metadata_base.PrimitiveAlgorithmType.TIKHONOV_REGULARIZATION, ], - "name": "sklearn.linear_model.ridge.Ridge", - "primitive_family": metadata_base.PrimitiveFamily.REGRESSION, - "python_path": "d3m.primitives.regression.ridge.SKlearn", - "source": {'name': 'JPL', 'contact': 'mailto:shah@jpl.nasa.gov', 'uris': ['https://gitlab.com/datadrivendiscovery/sklearn-wrap/issues', 'https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html']}, - "version": "2019.11.13", - "id": "2fb16403-8509-3f02-bdbf-9696e2fcad55", - "hyperparams_to_tune": ['alpha', 'max_iter'], - 'installation': [ - {'type': metadata_base.PrimitiveInstallationType.PIP, - 'package_uri': 'git+https://gitlab.com/datadrivendiscovery/sklearn-wrap.git@{git_commit}#egg=sklearn_wrap'.format( - git_commit=utils.current_git_commit(os.path.dirname(__file__)), - ), - }] - }) - - def __init__(self, *, - hyperparams: Hyperparams, - random_seed: int = 0, - docker_containers: Dict[str, DockerContainer] = None) -> None: - - super().__init__(hyperparams=hyperparams, random_seed=random_seed, docker_containers=docker_containers) - - # False - self._clf = Ridge( - alpha=self.hyperparams['alpha'], - fit_intercept=self.hyperparams['fit_intercept'], - normalize=self.hyperparams['normalize'], - max_iter=self.hyperparams['max_iter'], - tol=self.hyperparams['tol'], - solver=self.hyperparams['solver'], - random_state=self.random_seed, - ) - - self._inputs = None - self._outputs = None - self._training_inputs = None - self._training_outputs = None - self._target_names = None - self._training_indices = None - self._target_column_indices = None - self._target_columns_metadata: List[OrderedDict] = None - self._input_column_names = None - self._fitted = False - self._new_training_data = False - - def set_training_data(self, *, inputs: Inputs, outputs: Outputs) -> None: - self._inputs = inputs - self._outputs = outputs - self._fitted = False - self._new_training_data = True - - def fit(self, *, timeout: float = None, iterations: int = None) -> CallResult[None]: - if self._inputs is None or self._outputs is None: - raise ValueError("Missing training data.") - - if not self._new_training_data: - return CallResult(None) - self._new_training_data = False - - self._training_inputs, self._training_indices = self._get_columns_to_fit(self._inputs, self.hyperparams) - self._training_outputs, self._target_names, self._target_column_indices = self._get_targets(self._outputs, self.hyperparams) - self._input_column_names = self._training_inputs.columns - - if len(self._training_indices) > 0 and len(self._target_column_indices) > 0: - self._target_columns_metadata = self._get_target_columns_metadata(self._training_outputs.metadata, self.hyperparams) - sk_training_output = self._training_outputs.values - - shape = sk_training_output.shape - if len(shape) == 2 and shape[1] == 1: - sk_training_output = numpy.ravel(sk_training_output) - - self._clf.fit(self._training_inputs, sk_training_output) - self._fitted = True - else: - if self.hyperparams['error_on_no_input']: - raise RuntimeError("No input columns were selected") - self.logger.warn("No input columns were selected") - - return CallResult(None) - - - - def produce(self, *, inputs: Inputs, timeout: float = None, iterations: int = None) -> CallResult[Outputs]: - sk_inputs, columns_to_use = self._get_columns_to_fit(inputs, self.hyperparams) - output = [] - if len(sk_inputs.columns): - try: - sk_output = self._clf.predict(sk_inputs) - except sklearn.exceptions.NotFittedError as error: - raise PrimitiveNotFittedError("Primitive not fitted.") from error - # For primitives that allow predicting without fitting like GaussianProcessRegressor - if not self._fitted: - raise PrimitiveNotFittedError("Primitive not fitted.") - if sparse.issparse(sk_output): - sk_output = sk_output.toarray() - output = self._wrap_predictions(inputs, sk_output) - output.columns = self._target_names - output = [output] - else: - if self.hyperparams['error_on_no_input']: - raise RuntimeError("No input columns were selected") - self.logger.warn("No input columns were selected") - outputs = base_utils.combine_columns(return_result=self.hyperparams['return_result'], - add_index_columns=self.hyperparams['add_index_columns'], - inputs=inputs, column_indices=self._target_column_indices, - columns_list=output) - - return CallResult(outputs) - - - def get_params(self) -> Params: - if not self._fitted: - return Params( - coef_=None, - intercept_=None, - n_iter_=None, - input_column_names=self._input_column_names, - training_indices_=self._training_indices, - target_names_=self._target_names, - target_column_indices_=self._target_column_indices, - target_columns_metadata_=self._target_columns_metadata - ) - - return Params( - coef_=getattr(self._clf, 'coef_', None), - intercept_=getattr(self._clf, 'intercept_', None), - n_iter_=getattr(self._clf, 'n_iter_', None), - input_column_names=self._input_column_names, - training_indices_=self._training_indices, - target_names_=self._target_names, - target_column_indices_=self._target_column_indices, - target_columns_metadata_=self._target_columns_metadata - ) - - def set_params(self, *, params: Params) -> None: - self._clf.coef_ = params['coef_'] - self._clf.intercept_ = params['intercept_'] - self._clf.n_iter_ = params['n_iter_'] - self._input_column_names = params['input_column_names'] - self._training_indices = params['training_indices_'] - self._target_names = params['target_names_'] - self._target_column_indices = params['target_column_indices_'] - self._target_columns_metadata = params['target_columns_metadata_'] - - if params['coef_'] is not None: - self._fitted = True - if params['intercept_'] is not None: - self._fitted = True - if params['n_iter_'] is not None: - self._fitted = True - - - - - - - - @classmethod - def _get_columns_to_fit(cls, inputs: Inputs, hyperparams: Hyperparams): - if not hyperparams['use_semantic_types']: - return inputs, list(range(len(inputs.columns))) - - inputs_metadata = inputs.metadata - - def can_produce_column(column_index: int) -> bool: - return cls._can_produce_column(inputs_metadata, column_index, hyperparams) - - columns_to_produce, columns_not_to_produce = base_utils.get_columns_to_use(inputs_metadata, - use_columns=hyperparams['use_inputs_columns'], - exclude_columns=hyperparams['exclude_inputs_columns'], - can_use_column=can_produce_column) - return inputs.iloc[:, columns_to_produce], columns_to_produce - # return columns_to_produce - - @classmethod - def _can_produce_column(cls, inputs_metadata: metadata_base.DataMetadata, column_index: int, hyperparams: Hyperparams) -> bool: - column_metadata = inputs_metadata.query((metadata_base.ALL_ELEMENTS, column_index)) - - accepted_structural_types = (int, float, numpy.integer, numpy.float64) - accepted_semantic_types = set() - accepted_semantic_types.add("https://metadata.datadrivendiscovery.org/types/Attribute") - if not issubclass(column_metadata['structural_type'], accepted_structural_types): - return False - - semantic_types = set(column_metadata.get('semantic_types', [])) - - if len(semantic_types) == 0: - cls.logger.warning("No semantic types found in column metadata") - return False - # Making sure all accepted_semantic_types are available in semantic_types - if len(accepted_semantic_types - semantic_types) == 0: - return True - - return False - - @classmethod - def _get_targets(cls, data: d3m_dataframe, hyperparams: Hyperparams): - if not hyperparams['use_semantic_types']: - return data, list(data.columns), list(range(len(data.columns))) - - metadata = data.metadata - - def can_produce_column(column_index: int) -> bool: - accepted_semantic_types = set() - accepted_semantic_types.add("https://metadata.datadrivendiscovery.org/types/TrueTarget") - column_metadata = metadata.query((metadata_base.ALL_ELEMENTS, column_index)) - semantic_types = set(column_metadata.get('semantic_types', [])) - if len(semantic_types) == 0: - cls.logger.warning("No semantic types found in column metadata") - return False - # Making sure all accepted_semantic_types are available in semantic_types - if len(accepted_semantic_types - semantic_types) == 0: - return True - return False - - target_column_indices, target_columns_not_to_produce = base_utils.get_columns_to_use(metadata, - use_columns=hyperparams[ - 'use_outputs_columns'], - exclude_columns= - hyperparams[ - 'exclude_outputs_columns'], - can_use_column=can_produce_column) - targets = [] - if target_column_indices: - targets = data.select_columns(target_column_indices) - target_column_names = [] - for idx in target_column_indices: - target_column_names.append(data.columns[idx]) - return targets, target_column_names, target_column_indices - - @classmethod - def _get_target_columns_metadata(cls, outputs_metadata: metadata_base.DataMetadata, hyperparams) -> List[OrderedDict]: - outputs_length = outputs_metadata.query((metadata_base.ALL_ELEMENTS,))['dimension']['length'] - - target_columns_metadata: List[OrderedDict] = [] - for column_index in range(outputs_length): - column_metadata = OrderedDict(outputs_metadata.query_column(column_index)) - - # Update semantic types and prepare it for predicted targets. - semantic_types = set(column_metadata.get('semantic_types', [])) - semantic_types_to_remove = set(["https://metadata.datadrivendiscovery.org/types/TrueTarget","https://metadata.datadrivendiscovery.org/types/SuggestedTarget",]) - add_semantic_types = set(["https://metadata.datadrivendiscovery.org/types/PredictedTarget",]) - add_semantic_types.add(hyperparams["return_semantic_type"]) - semantic_types = semantic_types - semantic_types_to_remove - semantic_types = semantic_types.union(add_semantic_types) - column_metadata['semantic_types'] = list(semantic_types) - - target_columns_metadata.append(column_metadata) - - return target_columns_metadata - - @classmethod - def _update_predictions_metadata(cls, inputs_metadata: metadata_base.DataMetadata, outputs: Optional[Outputs], - target_columns_metadata: List[OrderedDict]) -> metadata_base.DataMetadata: - outputs_metadata = metadata_base.DataMetadata().generate(value=outputs) - - for column_index, column_metadata in enumerate(target_columns_metadata): - column_metadata.pop("structural_type", None) - outputs_metadata = outputs_metadata.update_column(column_index, column_metadata) - - return outputs_metadata - - def _wrap_predictions(self, inputs: Inputs, predictions: ndarray) -> Outputs: - outputs = d3m_dataframe(predictions, generate_metadata=False) - outputs.metadata = self._update_predictions_metadata(inputs.metadata, outputs, self._target_columns_metadata) - return outputs - - - @classmethod - def _add_target_columns_metadata(cls, outputs_metadata: metadata_base.DataMetadata): - outputs_length = outputs_metadata.query((metadata_base.ALL_ELEMENTS,))['dimension']['length'] - - target_columns_metadata: List[OrderedDict] = [] - for column_index in range(outputs_length): - column_metadata = OrderedDict() - semantic_types = [] - semantic_types.append('https://metadata.datadrivendiscovery.org/types/PredictedTarget') - column_name = outputs_metadata.query((metadata_base.ALL_ELEMENTS, column_index)).get("name") - if column_name is None: - column_name = "output_{}".format(column_index) - column_metadata["semantic_types"] = semantic_types - column_metadata["name"] = str(column_name) - target_columns_metadata.append(column_metadata) - - return target_columns_metadata - - -SKRidge.__doc__ = Ridge.__doc__ \ No newline at end of file diff --git a/common-primitives/sklearn-wrap/sklearn_wrap/SKRobustScaler.py b/common-primitives/sklearn-wrap/sklearn_wrap/SKRobustScaler.py deleted file mode 100644 index 6b98060..0000000 --- a/common-primitives/sklearn-wrap/sklearn_wrap/SKRobustScaler.py +++ /dev/null @@ -1,354 +0,0 @@ -from typing import Any, Callable, List, Dict, Union, Optional, Sequence, Tuple -from numpy import ndarray -from collections import OrderedDict -from scipy import sparse -import os -import sklearn -import numpy -import typing - -# Custom import commands if any -from sklearn.preprocessing.data import RobustScaler - - -from d3m.container.numpy import ndarray as d3m_ndarray -from d3m.container import DataFrame as d3m_dataframe -from d3m.metadata import hyperparams, params, base as metadata_base -from d3m import utils -from d3m.base import utils as base_utils -from d3m.exceptions import PrimitiveNotFittedError -from d3m.primitive_interfaces.base import CallResult, DockerContainer -from d3m.primitive_interfaces.unsupervised_learning import UnsupervisedLearnerPrimitiveBase - - -Inputs = d3m_dataframe -Outputs = d3m_dataframe - - -class Params(params.Params): - center_: Optional[ndarray] - scale_: Optional[ndarray] - input_column_names: Optional[Any] - target_names_: Optional[Sequence[Any]] - training_indices_: Optional[Sequence[int]] - target_column_indices_: Optional[Sequence[int]] - target_columns_metadata_: Optional[List[OrderedDict]] - - - -class Hyperparams(hyperparams.Hyperparams): - with_centering = hyperparams.UniformBool( - default=True, - description='If True, center the data before scaling. This will cause ``transform`` to raise an exception when attempted on sparse matrices, because centering them entails building a dense matrix which in common use cases is likely to be too large to fit in memory.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - with_scaling = hyperparams.UniformBool( - default=True, - description='If True, scale the data to interquartile range.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - quantile_range = hyperparams.SortedSet( - elements=hyperparams.Uniform(0.0, 100.0, 50.0, lower_inclusive=False, upper_inclusive=False), - default=(25.0, 75.0), - min_size=2, - max_size=2, - description='Default: (25.0, 75.0) = (1st quantile, 3rd quantile) = IQR Quantile range used to calculate ``scale_``. .. versionadded:: 0.18', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - - use_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to force primitive to operate on. If any specified column cannot be parsed, it is skipped.", - ) - exclude_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to not operate on. Applicable only if \"use_columns\" is not provided.", - ) - return_result = hyperparams.Enumeration( - values=['append', 'replace', 'new'], - default='new', - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Should parsed columns be appended, should they replace original columns, or should only parsed columns be returned? This hyperparam is ignored if use_semantic_types is set to false.", - ) - use_semantic_types = hyperparams.UniformBool( - default=False, - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Controls whether semantic_types metadata will be used for filtering columns in input dataframe. Setting this to false makes the code ignore return_result and will produce only the output dataframe" - ) - add_index_columns = hyperparams.UniformBool( - default=False, - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Also include primary index columns if input data has them. Applicable only if \"return_result\" is set to \"new\".", - ) - error_on_no_input = hyperparams.UniformBool( - default=True, - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Throw an exception if no input column is selected/provided. Defaults to true to behave like sklearn. To prevent pipelines from breaking set this to False.", - ) - - return_semantic_type = hyperparams.Enumeration[str]( - values=['https://metadata.datadrivendiscovery.org/types/Attribute', 'https://metadata.datadrivendiscovery.org/types/ConstructedAttribute'], - default='https://metadata.datadrivendiscovery.org/types/Attribute', - description='Decides what semantic type to attach to generated attributes', - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'] - ) - -class SKRobustScaler(UnsupervisedLearnerPrimitiveBase[Inputs, Outputs, Params, Hyperparams]): - """ - Primitive wrapping for sklearn RobustScaler - `sklearn documentation `_ - - """ - - __author__ = "JPL MARVIN" - metadata = metadata_base.PrimitiveMetadata({ - "algorithm_types": [metadata_base.PrimitiveAlgorithmType.FEATURE_SCALING, ], - "name": "sklearn.preprocessing.data.RobustScaler", - "primitive_family": metadata_base.PrimitiveFamily.DATA_PREPROCESSING, - "python_path": "d3m.primitives.data_preprocessing.robust_scaler.SKlearn", - "source": {'name': 'JPL', 'contact': 'mailto:shah@jpl.nasa.gov', 'uris': ['https://gitlab.com/datadrivendiscovery/sklearn-wrap/issues', 'https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.RobustScaler.html']}, - "version": "2019.11.13", - "id": "854727ed-c82c-3137-ac59-fd52bc9ba385", - 'installation': [ - {'type': metadata_base.PrimitiveInstallationType.PIP, - 'package_uri': 'git+https://gitlab.com/datadrivendiscovery/sklearn-wrap.git@{git_commit}#egg=sklearn_wrap'.format( - git_commit=utils.current_git_commit(os.path.dirname(__file__)), - ), - }] - }) - - def __init__(self, *, - hyperparams: Hyperparams, - random_seed: int = 0, - docker_containers: Dict[str, DockerContainer] = None) -> None: - - super().__init__(hyperparams=hyperparams, random_seed=random_seed, docker_containers=docker_containers) - - # False - self._clf = RobustScaler( - with_centering=self.hyperparams['with_centering'], - with_scaling=self.hyperparams['with_scaling'], - quantile_range=self.hyperparams['quantile_range'], - ) - - self._inputs = None - self._outputs = None - self._training_inputs = None - self._training_outputs = None - self._target_names = None - self._training_indices = None - self._target_column_indices = None - self._target_columns_metadata: List[OrderedDict] = None - self._input_column_names = None - self._fitted = False - - - def set_training_data(self, *, inputs: Inputs) -> None: - self._inputs = inputs - self._fitted = False - - def fit(self, *, timeout: float = None, iterations: int = None) -> CallResult[None]: - if self._fitted: - return CallResult(None) - - self._training_inputs, self._training_indices = self._get_columns_to_fit(self._inputs, self.hyperparams) - self._input_column_names = self._training_inputs.columns - - if self._training_inputs is None: - return CallResult(None) - - if len(self._training_indices) > 0: - self._clf.fit(self._training_inputs) - self._fitted = True - else: - if self.hyperparams['error_on_no_input']: - raise RuntimeError("No input columns were selected") - self.logger.warn("No input columns were selected") - return CallResult(None) - - def produce(self, *, inputs: Inputs, timeout: float = None, iterations: int = None) -> CallResult[Outputs]: - if not self._fitted: - raise PrimitiveNotFittedError("Primitive not fitted.") - sk_inputs = inputs - if self.hyperparams['use_semantic_types']: - sk_inputs = inputs.iloc[:, self._training_indices] - output_columns = [] - if len(self._training_indices) > 0: - sk_output = self._clf.transform(sk_inputs) - if sparse.issparse(sk_output): - sk_output = sk_output.toarray() - outputs = self._wrap_predictions(inputs, sk_output) - if len(outputs.columns) == len(self._input_column_names): - outputs.columns = self._input_column_names - output_columns = [outputs] - else: - if self.hyperparams['error_on_no_input']: - raise RuntimeError("No input columns were selected") - self.logger.warn("No input columns were selected") - outputs = base_utils.combine_columns(return_result=self.hyperparams['return_result'], - add_index_columns=self.hyperparams['add_index_columns'], - inputs=inputs, column_indices=self._training_indices, - columns_list=output_columns) - return CallResult(outputs) - - - def get_params(self) -> Params: - if not self._fitted: - return Params( - center_=None, - scale_=None, - input_column_names=self._input_column_names, - training_indices_=self._training_indices, - target_names_=self._target_names, - target_column_indices_=self._target_column_indices, - target_columns_metadata_=self._target_columns_metadata - ) - - return Params( - center_=getattr(self._clf, 'center_', None), - scale_=getattr(self._clf, 'scale_', None), - input_column_names=self._input_column_names, - training_indices_=self._training_indices, - target_names_=self._target_names, - target_column_indices_=self._target_column_indices, - target_columns_metadata_=self._target_columns_metadata - ) - - def set_params(self, *, params: Params) -> None: - self._clf.center_ = params['center_'] - self._clf.scale_ = params['scale_'] - self._input_column_names = params['input_column_names'] - self._training_indices = params['training_indices_'] - self._target_names = params['target_names_'] - self._target_column_indices = params['target_column_indices_'] - self._target_columns_metadata = params['target_columns_metadata_'] - - if params['center_'] is not None: - self._fitted = True - if params['scale_'] is not None: - self._fitted = True - - - - - - @classmethod - def _get_columns_to_fit(cls, inputs: Inputs, hyperparams: Hyperparams): - if not hyperparams['use_semantic_types']: - return inputs, list(range(len(inputs.columns))) - - inputs_metadata = inputs.metadata - - def can_produce_column(column_index: int) -> bool: - return cls._can_produce_column(inputs_metadata, column_index, hyperparams) - - columns_to_produce, columns_not_to_produce = base_utils.get_columns_to_use(inputs_metadata, - use_columns=hyperparams['use_columns'], - exclude_columns=hyperparams['exclude_columns'], - can_use_column=can_produce_column) - return inputs.iloc[:, columns_to_produce], columns_to_produce - # return columns_to_produce - - @classmethod - def _can_produce_column(cls, inputs_metadata: metadata_base.DataMetadata, column_index: int, hyperparams: Hyperparams) -> bool: - column_metadata = inputs_metadata.query((metadata_base.ALL_ELEMENTS, column_index)) - - accepted_structural_types = (int, float, numpy.integer, numpy.float64) - accepted_semantic_types = set() - accepted_semantic_types.add("https://metadata.datadrivendiscovery.org/types/Attribute") - if not issubclass(column_metadata['structural_type'], accepted_structural_types): - return False - - semantic_types = set(column_metadata.get('semantic_types', [])) - - if len(semantic_types) == 0: - cls.logger.warning("No semantic types found in column metadata") - return False - - # Making sure all accepted_semantic_types are available in semantic_types - if len(accepted_semantic_types - semantic_types) == 0: - return True - - return False - - - @classmethod - def _get_target_columns_metadata(cls, outputs_metadata: metadata_base.DataMetadata, hyperparams) -> List[OrderedDict]: - outputs_length = outputs_metadata.query((metadata_base.ALL_ELEMENTS,))['dimension']['length'] - - target_columns_metadata: List[OrderedDict] = [] - for column_index in range(outputs_length): - column_metadata = OrderedDict(outputs_metadata.query_column(column_index)) - - # Update semantic types and prepare it for predicted targets. - semantic_types = set(column_metadata.get('semantic_types', [])) - semantic_types_to_remove = set([]) - add_semantic_types = [] - add_semantic_types.add(hyperparams["return_semantic_type"]) - semantic_types = semantic_types - semantic_types_to_remove - semantic_types = semantic_types.union(add_semantic_types) - column_metadata['semantic_types'] = list(semantic_types) - - target_columns_metadata.append(column_metadata) - - return target_columns_metadata - - @classmethod - def _update_predictions_metadata(cls, inputs_metadata: metadata_base.DataMetadata, outputs: Optional[Outputs], - target_columns_metadata: List[OrderedDict]) -> metadata_base.DataMetadata: - outputs_metadata = metadata_base.DataMetadata().generate(value=outputs) - - for column_index, column_metadata in enumerate(target_columns_metadata): - column_metadata.pop("structural_type", None) - outputs_metadata = outputs_metadata.update_column(column_index, column_metadata) - - return outputs_metadata - - def _wrap_predictions(self, inputs: Inputs, predictions: ndarray) -> Outputs: - outputs = d3m_dataframe(predictions, generate_metadata=True) - target_columns_metadata = self._copy_inputs_metadata(inputs.metadata, self._training_indices, outputs.metadata, self.hyperparams) - outputs.metadata = self._update_predictions_metadata(inputs.metadata, outputs, target_columns_metadata) - return outputs - - - @classmethod - def _copy_inputs_metadata(cls, inputs_metadata: metadata_base.DataMetadata, input_indices: List[int], - outputs_metadata: metadata_base.DataMetadata, hyperparams): - outputs_length = outputs_metadata.query((metadata_base.ALL_ELEMENTS,))['dimension']['length'] - target_columns_metadata: List[OrderedDict] = [] - for column_index in input_indices: - column_name = inputs_metadata.query((metadata_base.ALL_ELEMENTS, column_index)).get("name") - if column_name is None: - column_name = "output_{}".format(column_index) - - column_metadata = OrderedDict(inputs_metadata.query_column(column_index)) - semantic_types = set(column_metadata.get('semantic_types', [])) - semantic_types_to_remove = set([]) - add_semantic_types = set() - add_semantic_types.add(hyperparams["return_semantic_type"]) - semantic_types = semantic_types - semantic_types_to_remove - semantic_types = semantic_types.union(add_semantic_types) - column_metadata['semantic_types'] = list(semantic_types) - - column_metadata["name"] = str(column_name) - target_columns_metadata.append(column_metadata) - - # If outputs has more columns than index, add Attribute Type to all remaining - if outputs_length > len(input_indices): - for column_index in range(len(input_indices), outputs_length): - column_metadata = OrderedDict() - semantic_types = set() - semantic_types.add(hyperparams["return_semantic_type"]) - column_name = "output_{}".format(column_index) - column_metadata["semantic_types"] = list(semantic_types) - column_metadata["name"] = str(column_name) - target_columns_metadata.append(column_metadata) - - return target_columns_metadata - - -SKRobustScaler.__doc__ = RobustScaler.__doc__ \ No newline at end of file diff --git a/common-primitives/sklearn-wrap/sklearn_wrap/SKSGDClassifier.py b/common-primitives/sklearn-wrap/sklearn_wrap/SKSGDClassifier.py deleted file mode 100644 index e5f0422..0000000 --- a/common-primitives/sklearn-wrap/sklearn_wrap/SKSGDClassifier.py +++ /dev/null @@ -1,661 +0,0 @@ -from typing import Any, Callable, List, Dict, Union, Optional, Sequence, Tuple -from numpy import ndarray -from collections import OrderedDict -from scipy import sparse -import os -import sklearn -import numpy -import typing - -# Custom import commands if any -from sklearn.linear_model.stochastic_gradient import SGDClassifier - - -from d3m.container.numpy import ndarray as d3m_ndarray -from d3m.container import DataFrame as d3m_dataframe -from d3m.metadata import hyperparams, params, base as metadata_base -from d3m import utils -from d3m.base import utils as base_utils -from d3m.exceptions import PrimitiveNotFittedError -from d3m.primitive_interfaces.base import CallResult, DockerContainer - -from d3m.primitive_interfaces.supervised_learning import SupervisedLearnerPrimitiveBase -from d3m.primitive_interfaces.base import ProbabilisticCompositionalityMixin, ContinueFitMixin -from d3m import exceptions -import pandas - - - -Inputs = d3m_dataframe -Outputs = d3m_dataframe - - -class Params(params.Params): - coef_: Optional[ndarray] - intercept_: Optional[ndarray] - n_iter_: Optional[int] - loss_function_: Optional[object] - classes_: Optional[ndarray] - _expanded_class_weight: Optional[ndarray] - t_: Optional[float] - C: Optional[float] - average_coef_: Optional[ndarray] - average_intercept_: Optional[ndarray] - standard_coef_: Optional[ndarray] - standard_intercept_: Optional[ndarray] - input_column_names: Optional[Any] - target_names_: Optional[Sequence[Any]] - training_indices_: Optional[Sequence[int]] - target_column_indices_: Optional[Sequence[int]] - target_columns_metadata_: Optional[List[OrderedDict]] - - - -class Hyperparams(hyperparams.Hyperparams): - loss = hyperparams.Enumeration[str]( - values=['hinge', 'log', 'modified_huber', 'squared_hinge', 'perceptron', 'squared_loss', 'huber', 'epsilon_insensitive', 'squared_epsilon_insensitive'], - default='hinge', - description='The loss function to be used. Defaults to \'hinge\', which gives a linear SVM. The possible options are \'hinge\', \'log\', \'modified_huber\', \'squared_hinge\', \'perceptron\', or a regression loss: \'squared_loss\', \'huber\', \'epsilon_insensitive\', or \'squared_epsilon_insensitive\'. The \'log\' loss gives logistic regression, a probabilistic classifier. \'modified_huber\' is another smooth loss that brings tolerance to outliers as well as probability estimates. \'squared_hinge\' is like hinge but is quadratically penalized. \'perceptron\' is the linear loss used by the perceptron algorithm. The other losses are designed for regression but can be useful in classification as well; see SGDRegressor for a description.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - penalty = hyperparams.Enumeration[str]( - values=['l1', 'l2', 'elasticnet', 'none'], - default='l2', - description='The penalty (aka regularization term) to be used. Defaults to \'l2\' which is the standard regularizer for linear SVM models. \'l1\' and \'elasticnet\' might bring sparsity to the model (feature selection) not achievable with \'l2\'.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - alpha = hyperparams.Bounded[float]( - lower=0, - upper=None, - default=0.0001, - description='Constant that multiplies the regularization term. Defaults to 0.0001 Also used to compute learning_rate when set to \'optimal\'.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - l1_ratio = hyperparams.Bounded[float]( - lower=0, - upper=1, - default=0.15, - description='The Elastic Net mixing parameter, with 0 <= l1_ratio <= 1. l1_ratio=0 corresponds to L2 penalty, l1_ratio=1 to L1. Defaults to 0.15.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - fit_intercept = hyperparams.UniformBool( - default=True, - description='Whether the intercept should be estimated or not. If False, the data is assumed to be already centered. Defaults to True.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - shuffle = hyperparams.UniformBool( - default=True, - description='Whether or not the training data should be shuffled after each epoch. Defaults to True.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - epsilon = hyperparams.Bounded[float]( - lower=0, - upper=None, - default=0.1, - description='Epsilon in the epsilon-insensitive loss functions; only if `loss` is \'huber\', \'epsilon_insensitive\', or \'squared_epsilon_insensitive\'. For \'huber\', determines the threshold at which it becomes less important to get the prediction exactly right. For epsilon-insensitive, any differences between the current prediction and the correct label are ignored if they are less than this threshold.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - n_jobs = hyperparams.Union( - configuration=OrderedDict({ - 'limit': hyperparams.Bounded[int]( - default=1, - lower=1, - upper=None, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ), - 'all_cores': hyperparams.Constant( - default=-1, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ) - }), - default='limit', - description='The number of CPUs to use to do the OVA (One Versus All, for multi-class problems) computation. -1 means \'all CPUs\'. Defaults to 1.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/ResourcesUseParameter'] - ) - learning_rate = hyperparams.Enumeration[str]( - values=['optimal', 'invscaling', 'constant', 'adaptive'], - default='optimal', - description='The learning rate schedule: - \'constant\': eta = eta0 - \'optimal\': eta = 1.0 / (alpha * (t + t0)) [default] - \'invscaling\': eta = eta0 / pow(t, power_t) where t0 is chosen by a heuristic proposed by Leon Bottou.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - power_t = hyperparams.Bounded[float]( - lower=0, - upper=None, - default=0.5, - description='The exponent for inverse scaling learning rate [default 0.5].', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - warm_start = hyperparams.UniformBool( - default=False, - description='When set to True, reuse the solution of the previous call to fit as initialization, otherwise, just erase the previous solution.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - average = hyperparams.Union( - configuration=OrderedDict({ - 'int': hyperparams.Bounded[int]( - default=2, - lower=2, - upper=None, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ), - 'bool': hyperparams.UniformBool( - default=False, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ) - }), - default='bool', - description='When set to True, computes the averaged SGD weights and stores the result in the ``coef_`` attribute. If set to an int greater than 1, averaging will begin once the total number of samples seen reaches average. So ``average=10`` will begin averaging after seeing 10 samples.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - eta0 = hyperparams.Bounded[float]( - lower=0, - upper=None, - default=0.0, - description='The initial learning rate for the \'constant\' or \'invscaling\' schedules. The default value is 0.0 as eta0 is not used by the default schedule \'optimal\'.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - max_iter = hyperparams.Union( - configuration=OrderedDict({ - 'int': hyperparams.Bounded[int]( - lower=0, - upper=None, - default=1000, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ), - 'none': hyperparams.Constant( - default=None, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ) - }), - default='int', - description='The maximum number of passes over the training data (aka epochs). It only impacts the behavior in the ``fit`` method, and not the `partial_fit`. Defaults to 5. Defaults to 1000 from 0.21, or if tol is not None. .. versionadded:: 0.19', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - tol = hyperparams.Union( - configuration=OrderedDict({ - 'float': hyperparams.Bounded[float]( - lower=0, - upper=None, - default=0.001, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ), - 'none': hyperparams.Constant( - default=None, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ) - }), - default='float', - description='The stopping criterion. If it is not None, the iterations will stop when (loss > previous_loss - tol). Defaults to None. Defaults to 1e-3 from 0.21. .. versionadded:: 0.19', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - class_weight = hyperparams.Union( - configuration=OrderedDict({ - 'str': hyperparams.Constant( - default='balanced', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ), - 'none': hyperparams.Constant( - default=None, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ) - }), - default='none', - description='Preset for the class_weight fit parameter. Weights associated with classes. If not given, all classes are supposed to have weight one. The "balanced" mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data as ``n_samples / (n_classes * np.bincount(y))``', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - early_stopping = hyperparams.UniformBool( - default=False, - description='Whether to use early stopping to terminate training when validation score is not improving. If set to True, it will automatically set asid a fraction of training data as validation and terminate training whe validation score is not improving by at least tol fo n_iter_no_change consecutive epochs.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - validation_fraction = hyperparams.Bounded[float]( - default=0.1, - lower=0, - upper=1, - description='The proportion of training data to set aside as validation set for early stopping. Must be between 0 and 1. Only used if early_stopping is True.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - n_iter_no_change = hyperparams.Bounded[int]( - default=5, - lower=0, - upper=None, - description='Number of iterations with no improvement to wait before early stopping.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - - use_inputs_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to force primitive to use as training input. If any specified column cannot be parsed, it is skipped.", - ) - use_outputs_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to force primitive to use as training target. If any specified column cannot be parsed, it is skipped.", - ) - exclude_inputs_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to not use as training inputs. Applicable only if \"use_columns\" is not provided.", - ) - exclude_outputs_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to not use as training target. Applicable only if \"use_columns\" is not provided.", - ) - return_result = hyperparams.Enumeration( - values=['append', 'replace', 'new'], - default='new', - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Should parsed columns be appended, should they replace original columns, or should only parsed columns be returned? This hyperparam is ignored if use_semantic_types is set to false.", - ) - use_semantic_types = hyperparams.UniformBool( - default=False, - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Controls whether semantic_types metadata will be used for filtering columns in input dataframe. Setting this to false makes the code ignore return_result and will produce only the output dataframe" - ) - add_index_columns = hyperparams.UniformBool( - default=False, - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Also include primary index columns if input data has them. Applicable only if \"return_result\" is set to \"new\".", - ) - error_on_no_input = hyperparams.UniformBool( - default=True, - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Throw an exception if no input column is selected/provided. Defaults to true to behave like sklearn. To prevent pipelines from breaking set this to False.", - ) - - return_semantic_type = hyperparams.Enumeration[str]( - values=['https://metadata.datadrivendiscovery.org/types/Attribute', 'https://metadata.datadrivendiscovery.org/types/ConstructedAttribute', 'https://metadata.datadrivendiscovery.org/types/PredictedTarget'], - default='https://metadata.datadrivendiscovery.org/types/PredictedTarget', - description='Decides what semantic type to attach to generated output', - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'] - ) - -class SKSGDClassifier(SupervisedLearnerPrimitiveBase[Inputs, Outputs, Params, Hyperparams], - ContinueFitMixin[Inputs, Outputs, Params, Hyperparams]): - """ - Primitive wrapping for sklearn SGDClassifier - `sklearn documentation `_ - - """ - - __author__ = "JPL MARVIN" - metadata = metadata_base.PrimitiveMetadata({ - "algorithm_types": [metadata_base.PrimitiveAlgorithmType.GRADIENT_DESCENT, ], - "name": "sklearn.linear_model.stochastic_gradient.SGDClassifier", - "primitive_family": metadata_base.PrimitiveFamily.CLASSIFICATION, - "python_path": "d3m.primitives.classification.sgd.SKlearn", - "source": {'name': 'JPL', 'contact': 'mailto:shah@jpl.nasa.gov', 'uris': ['https://gitlab.com/datadrivendiscovery/sklearn-wrap/issues', 'https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html']}, - "version": "2019.11.13", - "id": "2305e400-131e-356d-bf77-e8db19517b7a", - "hyperparams_to_tune": ['max_iter', 'penalty', 'alpha'], - 'installation': [ - {'type': metadata_base.PrimitiveInstallationType.PIP, - 'package_uri': 'git+https://gitlab.com/datadrivendiscovery/sklearn-wrap.git@{git_commit}#egg=sklearn_wrap'.format( - git_commit=utils.current_git_commit(os.path.dirname(__file__)), - ), - }] - }) - - def __init__(self, *, - hyperparams: Hyperparams, - random_seed: int = 0, - docker_containers: Dict[str, DockerContainer] = None, - _verbose: int = 0) -> None: - - super().__init__(hyperparams=hyperparams, random_seed=random_seed, docker_containers=docker_containers) - - # False - self._clf = SGDClassifier( - loss=self.hyperparams['loss'], - penalty=self.hyperparams['penalty'], - alpha=self.hyperparams['alpha'], - l1_ratio=self.hyperparams['l1_ratio'], - fit_intercept=self.hyperparams['fit_intercept'], - shuffle=self.hyperparams['shuffle'], - epsilon=self.hyperparams['epsilon'], - n_jobs=self.hyperparams['n_jobs'], - learning_rate=self.hyperparams['learning_rate'], - power_t=self.hyperparams['power_t'], - warm_start=self.hyperparams['warm_start'], - average=self.hyperparams['average'], - eta0=self.hyperparams['eta0'], - max_iter=self.hyperparams['max_iter'], - tol=self.hyperparams['tol'], - class_weight=self.hyperparams['class_weight'], - early_stopping=self.hyperparams['early_stopping'], - validation_fraction=self.hyperparams['validation_fraction'], - n_iter_no_change=self.hyperparams['n_iter_no_change'], - verbose=_verbose, - random_state=self.random_seed, - ) - - self._inputs = None - self._outputs = None - self._training_inputs = None - self._training_outputs = None - self._target_names = None - self._training_indices = None - self._target_column_indices = None - self._target_columns_metadata: List[OrderedDict] = None - self._input_column_names = None - self._fitted = False - self._new_training_data = False - - def set_training_data(self, *, inputs: Inputs, outputs: Outputs) -> None: - self._inputs = inputs - self._outputs = outputs - self._fitted = False - self._new_training_data = True - - def fit(self, *, timeout: float = None, iterations: int = None) -> CallResult[None]: - if self._inputs is None or self._outputs is None: - raise ValueError("Missing training data.") - - if not self._new_training_data: - return CallResult(None) - self._new_training_data = False - - self._training_inputs, self._training_indices = self._get_columns_to_fit(self._inputs, self.hyperparams) - self._training_outputs, self._target_names, self._target_column_indices = self._get_targets(self._outputs, self.hyperparams) - self._input_column_names = self._training_inputs.columns - - if len(self._training_indices) > 0 and len(self._target_column_indices) > 0: - self._target_columns_metadata = self._get_target_columns_metadata(self._training_outputs.metadata, self.hyperparams) - sk_training_output = self._training_outputs.values - - shape = sk_training_output.shape - if len(shape) == 2 and shape[1] == 1: - sk_training_output = numpy.ravel(sk_training_output) - - self._clf.fit(self._training_inputs, sk_training_output) - self._fitted = True - else: - if self.hyperparams['error_on_no_input']: - raise RuntimeError("No input columns were selected") - self.logger.warn("No input columns were selected") - - return CallResult(None) - - def continue_fit(self, *, timeout: float = None, iterations: int = None) -> CallResult[None]: - if self._training_inputs is None or self._training_outputs is None: - raise ValueError("Missing training data.") - - if not self._new_training_data: - return CallResult(None) - self._new_training_data = False - - if len(self._training_indices) > 0 and len(self._target_column_indices) > 0: - self._target_columns_metadata = self._get_target_columns_metadata(self._training_outputs.metadata, self.hyperparams) - sk_training_output = self._training_outputs.values - - shape = sk_training_output.shape - if len(shape) == 2 and shape[1] == 1: - sk_training_output = numpy.ravel(sk_training_output) - - self._clf.partial_fit(self._training_inputs, sk_training_output) - self._fitted = True - else: - if self.hyperparams['error_on_no_input']: - raise RuntimeError("No input columns were selected") - self.logger.warn("No input columns were selected") - - return CallResult(None) - - - def produce(self, *, inputs: Inputs, timeout: float = None, iterations: int = None) -> CallResult[Outputs]: - sk_inputs, columns_to_use = self._get_columns_to_fit(inputs, self.hyperparams) - output = [] - if len(sk_inputs.columns): - try: - sk_output = self._clf.predict(sk_inputs) - except sklearn.exceptions.NotFittedError as error: - raise PrimitiveNotFittedError("Primitive not fitted.") from error - # For primitives that allow predicting without fitting like GaussianProcessRegressor - if not self._fitted: - raise PrimitiveNotFittedError("Primitive not fitted.") - if sparse.issparse(sk_output): - sk_output = sk_output.toarray() - output = self._wrap_predictions(inputs, sk_output) - output.columns = self._target_names - output = [output] - else: - if self.hyperparams['error_on_no_input']: - raise RuntimeError("No input columns were selected") - self.logger.warn("No input columns were selected") - outputs = base_utils.combine_columns(return_result=self.hyperparams['return_result'], - add_index_columns=self.hyperparams['add_index_columns'], - inputs=inputs, column_indices=self._target_column_indices, - columns_list=output) - - return CallResult(outputs) - - - def get_params(self) -> Params: - if not self._fitted: - return Params( - coef_=None, - intercept_=None, - n_iter_=None, - loss_function_=None, - classes_=None, - _expanded_class_weight=None, - t_=None, - C=None, - average_coef_=None, - average_intercept_=None, - standard_coef_=None, - standard_intercept_=None, - input_column_names=self._input_column_names, - training_indices_=self._training_indices, - target_names_=self._target_names, - target_column_indices_=self._target_column_indices, - target_columns_metadata_=self._target_columns_metadata - ) - - return Params( - coef_=getattr(self._clf, 'coef_', None), - intercept_=getattr(self._clf, 'intercept_', None), - n_iter_=getattr(self._clf, 'n_iter_', None), - loss_function_=getattr(self._clf, 'loss_function_', None), - classes_=getattr(self._clf, 'classes_', None), - _expanded_class_weight=getattr(self._clf, '_expanded_class_weight', None), - t_=getattr(self._clf, 't_', None), - C=getattr(self._clf, 'C', None), - average_coef_=getattr(self._clf, 'average_coef_', None), - average_intercept_=getattr(self._clf, 'average_intercept_', None), - standard_coef_=getattr(self._clf, 'standard_coef_', None), - standard_intercept_=getattr(self._clf, 'standard_intercept_', None), - input_column_names=self._input_column_names, - training_indices_=self._training_indices, - target_names_=self._target_names, - target_column_indices_=self._target_column_indices, - target_columns_metadata_=self._target_columns_metadata - ) - - def set_params(self, *, params: Params) -> None: - self._clf.coef_ = params['coef_'] - self._clf.intercept_ = params['intercept_'] - self._clf.n_iter_ = params['n_iter_'] - self._clf.loss_function_ = params['loss_function_'] - self._clf.classes_ = params['classes_'] - self._clf._expanded_class_weight = params['_expanded_class_weight'] - self._clf.t_ = params['t_'] - self._clf.C = params['C'] - self._clf.average_coef_ = params['average_coef_'] - self._clf.average_intercept_ = params['average_intercept_'] - self._clf.standard_coef_ = params['standard_coef_'] - self._clf.standard_intercept_ = params['standard_intercept_'] - self._input_column_names = params['input_column_names'] - self._training_indices = params['training_indices_'] - self._target_names = params['target_names_'] - self._target_column_indices = params['target_column_indices_'] - self._target_columns_metadata = params['target_columns_metadata_'] - - if params['coef_'] is not None: - self._fitted = True - if params['intercept_'] is not None: - self._fitted = True - if params['n_iter_'] is not None: - self._fitted = True - if params['loss_function_'] is not None: - self._fitted = True - if params['classes_'] is not None: - self._fitted = True - if params['_expanded_class_weight'] is not None: - self._fitted = True - if params['t_'] is not None: - self._fitted = True - if params['C'] is not None: - self._fitted = True - if params['average_coef_'] is not None: - self._fitted = True - if params['average_intercept_'] is not None: - self._fitted = True - if params['standard_coef_'] is not None: - self._fitted = True - if params['standard_intercept_'] is not None: - self._fitted = True - - - - - - - - @classmethod - def _get_columns_to_fit(cls, inputs: Inputs, hyperparams: Hyperparams): - if not hyperparams['use_semantic_types']: - return inputs, list(range(len(inputs.columns))) - - inputs_metadata = inputs.metadata - - def can_produce_column(column_index: int) -> bool: - return cls._can_produce_column(inputs_metadata, column_index, hyperparams) - - columns_to_produce, columns_not_to_produce = base_utils.get_columns_to_use(inputs_metadata, - use_columns=hyperparams['use_inputs_columns'], - exclude_columns=hyperparams['exclude_inputs_columns'], - can_use_column=can_produce_column) - return inputs.iloc[:, columns_to_produce], columns_to_produce - # return columns_to_produce - - @classmethod - def _can_produce_column(cls, inputs_metadata: metadata_base.DataMetadata, column_index: int, hyperparams: Hyperparams) -> bool: - column_metadata = inputs_metadata.query((metadata_base.ALL_ELEMENTS, column_index)) - - accepted_structural_types = (int, float, numpy.integer, numpy.float64) - accepted_semantic_types = set() - accepted_semantic_types.add("https://metadata.datadrivendiscovery.org/types/Attribute") - if not issubclass(column_metadata['structural_type'], accepted_structural_types): - return False - - semantic_types = set(column_metadata.get('semantic_types', [])) - - if len(semantic_types) == 0: - cls.logger.warning("No semantic types found in column metadata") - return False - # Making sure all accepted_semantic_types are available in semantic_types - if len(accepted_semantic_types - semantic_types) == 0: - return True - - return False - - @classmethod - def _get_targets(cls, data: d3m_dataframe, hyperparams: Hyperparams): - if not hyperparams['use_semantic_types']: - return data, list(data.columns), list(range(len(data.columns))) - - metadata = data.metadata - - def can_produce_column(column_index: int) -> bool: - accepted_semantic_types = set() - accepted_semantic_types.add("https://metadata.datadrivendiscovery.org/types/TrueTarget") - column_metadata = metadata.query((metadata_base.ALL_ELEMENTS, column_index)) - semantic_types = set(column_metadata.get('semantic_types', [])) - if len(semantic_types) == 0: - cls.logger.warning("No semantic types found in column metadata") - return False - # Making sure all accepted_semantic_types are available in semantic_types - if len(accepted_semantic_types - semantic_types) == 0: - return True - return False - - target_column_indices, target_columns_not_to_produce = base_utils.get_columns_to_use(metadata, - use_columns=hyperparams[ - 'use_outputs_columns'], - exclude_columns= - hyperparams[ - 'exclude_outputs_columns'], - can_use_column=can_produce_column) - targets = [] - if target_column_indices: - targets = data.select_columns(target_column_indices) - target_column_names = [] - for idx in target_column_indices: - target_column_names.append(data.columns[idx]) - return targets, target_column_names, target_column_indices - - @classmethod - def _get_target_columns_metadata(cls, outputs_metadata: metadata_base.DataMetadata, hyperparams) -> List[OrderedDict]: - outputs_length = outputs_metadata.query((metadata_base.ALL_ELEMENTS,))['dimension']['length'] - - target_columns_metadata: List[OrderedDict] = [] - for column_index in range(outputs_length): - column_metadata = OrderedDict(outputs_metadata.query_column(column_index)) - - # Update semantic types and prepare it for predicted targets. - semantic_types = set(column_metadata.get('semantic_types', [])) - semantic_types_to_remove = set(["https://metadata.datadrivendiscovery.org/types/TrueTarget","https://metadata.datadrivendiscovery.org/types/SuggestedTarget",]) - add_semantic_types = set(["https://metadata.datadrivendiscovery.org/types/PredictedTarget",]) - add_semantic_types.add(hyperparams["return_semantic_type"]) - semantic_types = semantic_types - semantic_types_to_remove - semantic_types = semantic_types.union(add_semantic_types) - column_metadata['semantic_types'] = list(semantic_types) - - target_columns_metadata.append(column_metadata) - - return target_columns_metadata - - @classmethod - def _update_predictions_metadata(cls, inputs_metadata: metadata_base.DataMetadata, outputs: Optional[Outputs], - target_columns_metadata: List[OrderedDict]) -> metadata_base.DataMetadata: - outputs_metadata = metadata_base.DataMetadata().generate(value=outputs) - - for column_index, column_metadata in enumerate(target_columns_metadata): - column_metadata.pop("structural_type", None) - outputs_metadata = outputs_metadata.update_column(column_index, column_metadata) - - return outputs_metadata - - def _wrap_predictions(self, inputs: Inputs, predictions: ndarray) -> Outputs: - outputs = d3m_dataframe(predictions, generate_metadata=False) - outputs.metadata = self._update_predictions_metadata(inputs.metadata, outputs, self._target_columns_metadata) - return outputs - - - @classmethod - def _add_target_columns_metadata(cls, outputs_metadata: metadata_base.DataMetadata): - outputs_length = outputs_metadata.query((metadata_base.ALL_ELEMENTS,))['dimension']['length'] - - target_columns_metadata: List[OrderedDict] = [] - for column_index in range(outputs_length): - column_metadata = OrderedDict() - semantic_types = [] - semantic_types.append('https://metadata.datadrivendiscovery.org/types/PredictedTarget') - column_name = outputs_metadata.query((metadata_base.ALL_ELEMENTS, column_index)).get("name") - if column_name is None: - column_name = "output_{}".format(column_index) - column_metadata["semantic_types"] = semantic_types - column_metadata["name"] = str(column_name) - target_columns_metadata.append(column_metadata) - - return target_columns_metadata - - -SKSGDClassifier.__doc__ = SGDClassifier.__doc__ \ No newline at end of file diff --git a/common-primitives/sklearn-wrap/sklearn_wrap/SKSGDRegressor.py b/common-primitives/sklearn-wrap/sklearn_wrap/SKSGDRegressor.py deleted file mode 100644 index a6361ef..0000000 --- a/common-primitives/sklearn-wrap/sklearn_wrap/SKSGDRegressor.py +++ /dev/null @@ -1,643 +0,0 @@ -from typing import Any, Callable, List, Dict, Union, Optional, Sequence, Tuple -from numpy import ndarray -from collections import OrderedDict -from scipy import sparse -import os -import sklearn -import numpy -import typing - -# Custom import commands if any -from sklearn.linear_model.stochastic_gradient import SGDRegressor - - -from d3m.container.numpy import ndarray as d3m_ndarray -from d3m.container import DataFrame as d3m_dataframe -from d3m.metadata import hyperparams, params, base as metadata_base -from d3m import utils -from d3m.base import utils as base_utils -from d3m.exceptions import PrimitiveNotFittedError -from d3m.primitive_interfaces.base import CallResult, DockerContainer - -from d3m.primitive_interfaces.supervised_learning import SupervisedLearnerPrimitiveBase -from d3m.primitive_interfaces.base import ProbabilisticCompositionalityMixin, ContinueFitMixin -from d3m import exceptions -import pandas - - - -Inputs = d3m_dataframe -Outputs = d3m_dataframe - - -class Params(params.Params): - coef_: Optional[ndarray] - intercept_: Optional[ndarray] - average_coef_: Optional[ndarray] - average_intercept_: Optional[ndarray] - t_: Optional[float] - n_iter_: Optional[int] - C: Optional[float] - standard_coef_: Optional[ndarray] - standard_intercept_: Optional[ndarray] - input_column_names: Optional[Any] - target_names_: Optional[Sequence[Any]] - training_indices_: Optional[Sequence[int]] - target_column_indices_: Optional[Sequence[int]] - target_columns_metadata_: Optional[List[OrderedDict]] - - - -class Hyperparams(hyperparams.Hyperparams): - loss = hyperparams.Choice( - choices={ - 'squared_loss': hyperparams.Hyperparams.define( - configuration=OrderedDict({}) - ), - 'huber': hyperparams.Hyperparams.define( - configuration=OrderedDict({ - 'epsilon': hyperparams.Bounded[float]( - default=0.1, - lower=0, - upper=None, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - }) - ), - 'epsilon_insensitive': hyperparams.Hyperparams.define( - configuration=OrderedDict({ - 'epsilon': hyperparams.Bounded[float]( - default=0.1, - lower=0, - upper=None, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - }) - ), - 'squared_epsilon_insensitive': hyperparams.Hyperparams.define( - configuration=OrderedDict({ - 'epsilon': hyperparams.Bounded[float]( - default=0.1, - lower=0, - upper=None, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - }) - ) - }, - default='squared_loss', - description='The loss function to be used. Defaults to \'squared_loss\' which refers to the ordinary least squares fit. \'huber\' modifies \'squared_loss\' to focus less on getting outliers correct by switching from squared to linear loss past a distance of epsilon. \'epsilon_insensitive\' ignores errors less than epsilon and is linear past that; this is the loss function used in SVR. \'squared_epsilon_insensitive\' is the same but becomes squared loss past a tolerance of epsilon.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - penalty = hyperparams.Union( - configuration=OrderedDict({ - 'str': hyperparams.Enumeration[str]( - values=['l1', 'l2', 'elasticnet'], - default='l2', - description='The penalty (aka regularization term) to be used. Defaults to \'l2\' which is the standard regularizer for linear SVM models. \'l1\' and \'elasticnet\' might bring sparsity to the model (feature selection) not achievable with \'l2\'.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ), - 'none': hyperparams.Constant( - default=None, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ) - }), - default='str', - description='The penalty (aka regularization term) to be used. Defaults to \'l2\' which is the standard regularizer for linear SVM models. \'l1\' and \'elasticnet\' might bring sparsity to the model (feature selection) not achievable with \'l2\'.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - alpha = hyperparams.Bounded[float]( - lower=0, - upper=None, - default=0.0001, - description='Constant that multiplies the regularization term. Defaults to 0.0001 Also used to compute learning_rate when set to \'optimal\'. l1_ratio : float The Elastic Net mixing parameter, with 0 <= l1_ratio <= 1. l1_ratio=0 corresponds to L2 penalty, l1_ratio=1 to L1. Defaults to 0.15.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - l1_ratio = hyperparams.Bounded[float]( - lower=0, - upper=1, - default=0.15, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - fit_intercept = hyperparams.UniformBool( - default=True, - description='Whether the intercept should be estimated or not. If False, the data is assumed to be already centered. Defaults to True.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - max_iter = hyperparams.Union( - configuration=OrderedDict({ - 'int': hyperparams.Bounded[int]( - lower=0, - upper=None, - default=1000, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ), - 'none': hyperparams.Constant( - default=None, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ) - }), - default='int', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - tol = hyperparams.Union( - configuration=OrderedDict({ - 'float': hyperparams.Bounded[float]( - lower=0, - upper=None, - default=0.001, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ), - 'none': hyperparams.Constant( - default=None, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ) - }), - default='float', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - shuffle = hyperparams.UniformBool( - default=True, - description='Whether or not the training data should be shuffled after each epoch. Defaults to True.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - learning_rate = hyperparams.Enumeration[str]( - values=['optimal', 'invscaling', 'constant', 'adaptive'], - default='invscaling', - description='The learning rate schedule: - \'constant\': eta = eta0 - \'optimal\': eta = 1.0 / (alpha * (t + t0)) [default] - \'invscaling\': eta = eta0 / pow(t, power_t) where t0 is chosen by a heuristic proposed by Leon Bottou. eta0 : double, optional The initial learning rate [default 0.01].', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - eta0 = hyperparams.Bounded[float]( - lower=0, - upper=None, - default=0.01, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - power_t = hyperparams.Bounded[float]( - lower=0, - upper=None, - default=0.25, - description='The exponent for inverse scaling learning rate [default 0.25].', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - warm_start = hyperparams.UniformBool( - default=False, - description='When set to True, reuse the solution of the previous call to fit as initialization, otherwise, just erase the previous solution.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - average = hyperparams.Union( - configuration=OrderedDict({ - 'int': hyperparams.Bounded[int]( - default=2, - lower=2, - upper=None, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ), - 'bool': hyperparams.UniformBool( - default=False, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ) - }), - default='bool', - description='When set to True, computes the averaged SGD weights and stores the result in the ``coef_`` attribute. If set to an int greater than 1, averaging will begin once the total number of samples seen reaches average. So ``average=10`` will begin averaging after seeing 10 samples.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - early_stopping = hyperparams.UniformBool( - default=False, - description='Whether to use early stopping to terminate training when validation score is not improving. If set to True, it will automatically set asid a fraction of training data as validation and terminate training whe validation score is not improving by at least tol fo n_iter_no_change consecutive epochs.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - validation_fraction = hyperparams.Bounded[float]( - default=0.1, - lower=0, - upper=1, - description='The proportion of training data to set aside as validation set for early stopping. Must be between 0 and 1. Only used if early_stopping is True.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - n_iter_no_change = hyperparams.Bounded[int]( - default=5, - lower=0, - upper=None, - description='Number of iterations with no improvement to wait before early stopping.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - - use_inputs_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to force primitive to use as training input. If any specified column cannot be parsed, it is skipped.", - ) - use_outputs_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to force primitive to use as training target. If any specified column cannot be parsed, it is skipped.", - ) - exclude_inputs_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to not use as training inputs. Applicable only if \"use_columns\" is not provided.", - ) - exclude_outputs_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to not use as training target. Applicable only if \"use_columns\" is not provided.", - ) - return_result = hyperparams.Enumeration( - values=['append', 'replace', 'new'], - default='new', - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Should parsed columns be appended, should they replace original columns, or should only parsed columns be returned? This hyperparam is ignored if use_semantic_types is set to false.", - ) - use_semantic_types = hyperparams.UniformBool( - default=False, - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Controls whether semantic_types metadata will be used for filtering columns in input dataframe. Setting this to false makes the code ignore return_result and will produce only the output dataframe" - ) - add_index_columns = hyperparams.UniformBool( - default=False, - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Also include primary index columns if input data has them. Applicable only if \"return_result\" is set to \"new\".", - ) - error_on_no_input = hyperparams.UniformBool( - default=True, - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Throw an exception if no input column is selected/provided. Defaults to true to behave like sklearn. To prevent pipelines from breaking set this to False.", - ) - - return_semantic_type = hyperparams.Enumeration[str]( - values=['https://metadata.datadrivendiscovery.org/types/Attribute', 'https://metadata.datadrivendiscovery.org/types/ConstructedAttribute', 'https://metadata.datadrivendiscovery.org/types/PredictedTarget'], - default='https://metadata.datadrivendiscovery.org/types/PredictedTarget', - description='Decides what semantic type to attach to generated output', - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'] - ) - -class SKSGDRegressor(SupervisedLearnerPrimitiveBase[Inputs, Outputs, Params, Hyperparams], - ContinueFitMixin[Inputs, Outputs, Params, Hyperparams]): - """ - Primitive wrapping for sklearn SGDRegressor - `sklearn documentation `_ - - """ - - __author__ = "JPL MARVIN" - metadata = metadata_base.PrimitiveMetadata({ - "algorithm_types": [metadata_base.PrimitiveAlgorithmType.GRADIENT_DESCENT, ], - "name": "sklearn.linear_model.stochastic_gradient.SGDRegressor", - "primitive_family": metadata_base.PrimitiveFamily.REGRESSION, - "python_path": "d3m.primitives.regression.sgd.SKlearn", - "source": {'name': 'JPL', 'contact': 'mailto:shah@jpl.nasa.gov', 'uris': ['https://gitlab.com/datadrivendiscovery/sklearn-wrap/issues', 'https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDRegressor.html']}, - "version": "2019.11.13", - "id": "db3a7669-72e1-3c95-91c1-0c2a3f137d78", - "hyperparams_to_tune": ['max_iter', 'penalty', 'alpha'], - 'installation': [ - {'type': metadata_base.PrimitiveInstallationType.PIP, - 'package_uri': 'git+https://gitlab.com/datadrivendiscovery/sklearn-wrap.git@{git_commit}#egg=sklearn_wrap'.format( - git_commit=utils.current_git_commit(os.path.dirname(__file__)), - ), - }] - }) - - def __init__(self, *, - hyperparams: Hyperparams, - random_seed: int = 0, - docker_containers: Dict[str, DockerContainer] = None, - _verbose: int = 0) -> None: - - super().__init__(hyperparams=hyperparams, random_seed=random_seed, docker_containers=docker_containers) - - # False - self._clf = SGDRegressor( - loss=self.hyperparams['loss']['choice'], - epsilon=self.hyperparams['loss'].get('epsilon', 0.1), - penalty=self.hyperparams['penalty'], - alpha=self.hyperparams['alpha'], - l1_ratio=self.hyperparams['l1_ratio'], - fit_intercept=self.hyperparams['fit_intercept'], - max_iter=self.hyperparams['max_iter'], - tol=self.hyperparams['tol'], - shuffle=self.hyperparams['shuffle'], - learning_rate=self.hyperparams['learning_rate'], - eta0=self.hyperparams['eta0'], - power_t=self.hyperparams['power_t'], - warm_start=self.hyperparams['warm_start'], - average=self.hyperparams['average'], - early_stopping=self.hyperparams['early_stopping'], - validation_fraction=self.hyperparams['validation_fraction'], - n_iter_no_change=self.hyperparams['n_iter_no_change'], - verbose=_verbose, - random_state=self.random_seed, - ) - - self._inputs = None - self._outputs = None - self._training_inputs = None - self._training_outputs = None - self._target_names = None - self._training_indices = None - self._target_column_indices = None - self._target_columns_metadata: List[OrderedDict] = None - self._input_column_names = None - self._fitted = False - self._new_training_data = False - - def set_training_data(self, *, inputs: Inputs, outputs: Outputs) -> None: - self._inputs = inputs - self._outputs = outputs - self._fitted = False - self._new_training_data = True - - def fit(self, *, timeout: float = None, iterations: int = None) -> CallResult[None]: - if self._inputs is None or self._outputs is None: - raise ValueError("Missing training data.") - - if not self._new_training_data: - return CallResult(None) - self._new_training_data = False - - self._training_inputs, self._training_indices = self._get_columns_to_fit(self._inputs, self.hyperparams) - self._training_outputs, self._target_names, self._target_column_indices = self._get_targets(self._outputs, self.hyperparams) - self._input_column_names = self._training_inputs.columns - - if len(self._training_indices) > 0 and len(self._target_column_indices) > 0: - self._target_columns_metadata = self._get_target_columns_metadata(self._training_outputs.metadata, self.hyperparams) - sk_training_output = self._training_outputs.values - - shape = sk_training_output.shape - if len(shape) == 2 and shape[1] == 1: - sk_training_output = numpy.ravel(sk_training_output) - - self._clf.fit(self._training_inputs, sk_training_output) - self._fitted = True - else: - if self.hyperparams['error_on_no_input']: - raise RuntimeError("No input columns were selected") - self.logger.warn("No input columns were selected") - - return CallResult(None) - - def continue_fit(self, *, timeout: float = None, iterations: int = None) -> CallResult[None]: - if self._training_inputs is None or self._training_outputs is None: - raise ValueError("Missing training data.") - - if not self._new_training_data: - return CallResult(None) - self._new_training_data = False - - if len(self._training_indices) > 0 and len(self._target_column_indices) > 0: - self._target_columns_metadata = self._get_target_columns_metadata(self._training_outputs.metadata, self.hyperparams) - sk_training_output = self._training_outputs.values - - shape = sk_training_output.shape - if len(shape) == 2 and shape[1] == 1: - sk_training_output = numpy.ravel(sk_training_output) - - self._clf.partial_fit(self._training_inputs, sk_training_output) - self._fitted = True - else: - if self.hyperparams['error_on_no_input']: - raise RuntimeError("No input columns were selected") - self.logger.warn("No input columns were selected") - - return CallResult(None) - - - def produce(self, *, inputs: Inputs, timeout: float = None, iterations: int = None) -> CallResult[Outputs]: - sk_inputs, columns_to_use = self._get_columns_to_fit(inputs, self.hyperparams) - output = [] - if len(sk_inputs.columns): - try: - sk_output = self._clf.predict(sk_inputs) - except sklearn.exceptions.NotFittedError as error: - raise PrimitiveNotFittedError("Primitive not fitted.") from error - # For primitives that allow predicting without fitting like GaussianProcessRegressor - if not self._fitted: - raise PrimitiveNotFittedError("Primitive not fitted.") - if sparse.issparse(sk_output): - sk_output = sk_output.toarray() - output = self._wrap_predictions(inputs, sk_output) - output.columns = self._target_names - output = [output] - else: - if self.hyperparams['error_on_no_input']: - raise RuntimeError("No input columns were selected") - self.logger.warn("No input columns were selected") - outputs = base_utils.combine_columns(return_result=self.hyperparams['return_result'], - add_index_columns=self.hyperparams['add_index_columns'], - inputs=inputs, column_indices=self._target_column_indices, - columns_list=output) - - return CallResult(outputs) - - - def get_params(self) -> Params: - if not self._fitted: - return Params( - coef_=None, - intercept_=None, - average_coef_=None, - average_intercept_=None, - t_=None, - n_iter_=None, - C=None, - standard_coef_=None, - standard_intercept_=None, - input_column_names=self._input_column_names, - training_indices_=self._training_indices, - target_names_=self._target_names, - target_column_indices_=self._target_column_indices, - target_columns_metadata_=self._target_columns_metadata - ) - - return Params( - coef_=getattr(self._clf, 'coef_', None), - intercept_=getattr(self._clf, 'intercept_', None), - average_coef_=getattr(self._clf, 'average_coef_', None), - average_intercept_=getattr(self._clf, 'average_intercept_', None), - t_=getattr(self._clf, 't_', None), - n_iter_=getattr(self._clf, 'n_iter_', None), - C=getattr(self._clf, 'C', None), - standard_coef_=getattr(self._clf, 'standard_coef_', None), - standard_intercept_=getattr(self._clf, 'standard_intercept_', None), - input_column_names=self._input_column_names, - training_indices_=self._training_indices, - target_names_=self._target_names, - target_column_indices_=self._target_column_indices, - target_columns_metadata_=self._target_columns_metadata - ) - - def set_params(self, *, params: Params) -> None: - self._clf.coef_ = params['coef_'] - self._clf.intercept_ = params['intercept_'] - self._clf.average_coef_ = params['average_coef_'] - self._clf.average_intercept_ = params['average_intercept_'] - self._clf.t_ = params['t_'] - self._clf.n_iter_ = params['n_iter_'] - self._clf.C = params['C'] - self._clf.standard_coef_ = params['standard_coef_'] - self._clf.standard_intercept_ = params['standard_intercept_'] - self._input_column_names = params['input_column_names'] - self._training_indices = params['training_indices_'] - self._target_names = params['target_names_'] - self._target_column_indices = params['target_column_indices_'] - self._target_columns_metadata = params['target_columns_metadata_'] - - if params['coef_'] is not None: - self._fitted = True - if params['intercept_'] is not None: - self._fitted = True - if params['average_coef_'] is not None: - self._fitted = True - if params['average_intercept_'] is not None: - self._fitted = True - if params['t_'] is not None: - self._fitted = True - if params['n_iter_'] is not None: - self._fitted = True - if params['C'] is not None: - self._fitted = True - if params['standard_coef_'] is not None: - self._fitted = True - if params['standard_intercept_'] is not None: - self._fitted = True - - - - - - - - @classmethod - def _get_columns_to_fit(cls, inputs: Inputs, hyperparams: Hyperparams): - if not hyperparams['use_semantic_types']: - return inputs, list(range(len(inputs.columns))) - - inputs_metadata = inputs.metadata - - def can_produce_column(column_index: int) -> bool: - return cls._can_produce_column(inputs_metadata, column_index, hyperparams) - - columns_to_produce, columns_not_to_produce = base_utils.get_columns_to_use(inputs_metadata, - use_columns=hyperparams['use_inputs_columns'], - exclude_columns=hyperparams['exclude_inputs_columns'], - can_use_column=can_produce_column) - return inputs.iloc[:, columns_to_produce], columns_to_produce - # return columns_to_produce - - @classmethod - def _can_produce_column(cls, inputs_metadata: metadata_base.DataMetadata, column_index: int, hyperparams: Hyperparams) -> bool: - column_metadata = inputs_metadata.query((metadata_base.ALL_ELEMENTS, column_index)) - - accepted_structural_types = (int, float, numpy.integer, numpy.float64) - accepted_semantic_types = set() - accepted_semantic_types.add("https://metadata.datadrivendiscovery.org/types/Attribute") - if not issubclass(column_metadata['structural_type'], accepted_structural_types): - return False - - semantic_types = set(column_metadata.get('semantic_types', [])) - - if len(semantic_types) == 0: - cls.logger.warning("No semantic types found in column metadata") - return False - # Making sure all accepted_semantic_types are available in semantic_types - if len(accepted_semantic_types - semantic_types) == 0: - return True - - return False - - @classmethod - def _get_targets(cls, data: d3m_dataframe, hyperparams: Hyperparams): - if not hyperparams['use_semantic_types']: - return data, list(data.columns), list(range(len(data.columns))) - - metadata = data.metadata - - def can_produce_column(column_index: int) -> bool: - accepted_semantic_types = set() - accepted_semantic_types.add("https://metadata.datadrivendiscovery.org/types/TrueTarget") - column_metadata = metadata.query((metadata_base.ALL_ELEMENTS, column_index)) - semantic_types = set(column_metadata.get('semantic_types', [])) - if len(semantic_types) == 0: - cls.logger.warning("No semantic types found in column metadata") - return False - # Making sure all accepted_semantic_types are available in semantic_types - if len(accepted_semantic_types - semantic_types) == 0: - return True - return False - - target_column_indices, target_columns_not_to_produce = base_utils.get_columns_to_use(metadata, - use_columns=hyperparams[ - 'use_outputs_columns'], - exclude_columns= - hyperparams[ - 'exclude_outputs_columns'], - can_use_column=can_produce_column) - targets = [] - if target_column_indices: - targets = data.select_columns(target_column_indices) - target_column_names = [] - for idx in target_column_indices: - target_column_names.append(data.columns[idx]) - return targets, target_column_names, target_column_indices - - @classmethod - def _get_target_columns_metadata(cls, outputs_metadata: metadata_base.DataMetadata, hyperparams) -> List[OrderedDict]: - outputs_length = outputs_metadata.query((metadata_base.ALL_ELEMENTS,))['dimension']['length'] - - target_columns_metadata: List[OrderedDict] = [] - for column_index in range(outputs_length): - column_metadata = OrderedDict(outputs_metadata.query_column(column_index)) - - # Update semantic types and prepare it for predicted targets. - semantic_types = set(column_metadata.get('semantic_types', [])) - semantic_types_to_remove = set(["https://metadata.datadrivendiscovery.org/types/TrueTarget","https://metadata.datadrivendiscovery.org/types/SuggestedTarget",]) - add_semantic_types = set(["https://metadata.datadrivendiscovery.org/types/PredictedTarget",]) - add_semantic_types.add(hyperparams["return_semantic_type"]) - semantic_types = semantic_types - semantic_types_to_remove - semantic_types = semantic_types.union(add_semantic_types) - column_metadata['semantic_types'] = list(semantic_types) - - target_columns_metadata.append(column_metadata) - - return target_columns_metadata - - @classmethod - def _update_predictions_metadata(cls, inputs_metadata: metadata_base.DataMetadata, outputs: Optional[Outputs], - target_columns_metadata: List[OrderedDict]) -> metadata_base.DataMetadata: - outputs_metadata = metadata_base.DataMetadata().generate(value=outputs) - - for column_index, column_metadata in enumerate(target_columns_metadata): - column_metadata.pop("structural_type", None) - outputs_metadata = outputs_metadata.update_column(column_index, column_metadata) - - return outputs_metadata - - def _wrap_predictions(self, inputs: Inputs, predictions: ndarray) -> Outputs: - outputs = d3m_dataframe(predictions, generate_metadata=False) - outputs.metadata = self._update_predictions_metadata(inputs.metadata, outputs, self._target_columns_metadata) - return outputs - - - @classmethod - def _add_target_columns_metadata(cls, outputs_metadata: metadata_base.DataMetadata): - outputs_length = outputs_metadata.query((metadata_base.ALL_ELEMENTS,))['dimension']['length'] - - target_columns_metadata: List[OrderedDict] = [] - for column_index in range(outputs_length): - column_metadata = OrderedDict() - semantic_types = [] - semantic_types.append('https://metadata.datadrivendiscovery.org/types/PredictedTarget') - column_name = outputs_metadata.query((metadata_base.ALL_ELEMENTS, column_index)).get("name") - if column_name is None: - column_name = "output_{}".format(column_index) - column_metadata["semantic_types"] = semantic_types - column_metadata["name"] = str(column_name) - target_columns_metadata.append(column_metadata) - - return target_columns_metadata - - -SKSGDRegressor.__doc__ = SGDRegressor.__doc__ \ No newline at end of file diff --git a/common-primitives/sklearn-wrap/sklearn_wrap/SKSVC.py b/common-primitives/sklearn-wrap/sklearn_wrap/SKSVC.py deleted file mode 100644 index c8f60e5..0000000 --- a/common-primitives/sklearn-wrap/sklearn_wrap/SKSVC.py +++ /dev/null @@ -1,635 +0,0 @@ -from typing import Any, Callable, List, Dict, Union, Optional, Sequence, Tuple -from numpy import ndarray -from collections import OrderedDict -from scipy import sparse -import os -import sklearn -import numpy -import typing - -# Custom import commands if any -from sklearn.svm.classes import SVC - - -from d3m.container.numpy import ndarray as d3m_ndarray -from d3m.container import DataFrame as d3m_dataframe -from d3m.metadata import hyperparams, params, base as metadata_base -from d3m import utils -from d3m.base import utils as base_utils -from d3m.exceptions import PrimitiveNotFittedError -from d3m.primitive_interfaces.base import CallResult, DockerContainer - -from d3m.primitive_interfaces.supervised_learning import SupervisedLearnerPrimitiveBase -from d3m.primitive_interfaces.base import ProbabilisticCompositionalityMixin, ContinueFitMixin -from d3m import exceptions -import pandas - - - -Inputs = d3m_dataframe -Outputs = d3m_dataframe - - -class Params(params.Params): - support_: Optional[ndarray] - support_vectors_: Optional[ndarray] - n_support_: Optional[ndarray] - dual_coef_: Optional[ndarray] - intercept_: Optional[ndarray] - _sparse: Optional[bool] - shape_fit_: Optional[tuple] - _dual_coef_: Optional[ndarray] - _intercept_: Optional[ndarray] - probA_: Optional[ndarray] - probB_: Optional[ndarray] - _gamma: Optional[float] - classes_: Optional[ndarray] - class_weight_: Optional[ndarray] - fit_status_: Optional[int] - epsilon: Optional[float] - nu: Optional[float] - input_column_names: Optional[Any] - target_names_: Optional[Sequence[Any]] - training_indices_: Optional[Sequence[int]] - target_column_indices_: Optional[Sequence[int]] - target_columns_metadata_: Optional[List[OrderedDict]] - - - -class Hyperparams(hyperparams.Hyperparams): - C = hyperparams.Bounded[float]( - default=1, - lower=0, - upper=None, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - description='Penalty parameter C of the error term.' - ) - kernel = hyperparams.Choice( - choices={ - 'linear': hyperparams.Hyperparams.define( - configuration=OrderedDict({}) - ), - 'poly': hyperparams.Hyperparams.define( - configuration=OrderedDict({ - 'degree': hyperparams.Bounded[int]( - default=3, - lower=0, - upper=None, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ), - 'gamma': hyperparams.Union( - configuration=OrderedDict({ - 'float': hyperparams.Bounded[float]( - default=0.1, - lower=0, - upper=None, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ), - 'auto': hyperparams.Constant( - default='auto', - description='1/n_features will be used.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ) - }), - default='auto', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ), - 'coef0': hyperparams.Constant( - default=0, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - }) - ), - 'rbf': hyperparams.Hyperparams.define( - configuration=OrderedDict({ - 'gamma': hyperparams.Union( - configuration=OrderedDict({ - 'float': hyperparams.Bounded[float]( - default=0.1, - lower=0, - upper=None, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ), - 'auto': hyperparams.Constant( - default='auto', - description='1/n_features will be used.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ) - }), - default='auto', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - }) - ), - 'sigmoid': hyperparams.Hyperparams.define( - configuration=OrderedDict({ - 'gamma': hyperparams.Union( - configuration=OrderedDict({ - 'float': hyperparams.Bounded[float]( - default=0.1, - lower=0, - upper=None, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ), - 'auto': hyperparams.Constant( - default='auto', - description='1/n_features will be used.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ) - }), - default='auto', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ), - 'coef0': hyperparams.Constant( - default=0, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - }) - ) - }, - default='rbf', - description='Specifies the kernel type to be used in the algorithm. It must be one of \'linear\', \'poly\', \'rbf\', \'sigmoid\', \'precomputed\' or a callable. If none is given, \'rbf\' will be used. If a callable is given it is used to pre-compute the kernel matrix from data matrices; that matrix should be an array of shape ``(n_samples, n_samples)``.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - probability = hyperparams.UniformBool( - default=False, - description='Whether to enable probability estimates. This must be enabled prior to calling `fit`, and will slow down that method.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - shrinking = hyperparams.UniformBool( - default=True, - description='Whether to use the shrinking heuristic.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - tol = hyperparams.Bounded[float]( - default=0.001, - lower=0, - upper=None, - description='Tolerance for stopping criterion.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - cache_size = hyperparams.Bounded[float]( - default=200, - lower=0, - upper=None, - description='Specify the size of the kernel cache (in MB).', - semantic_types=['https://metadata.datadrivendiscovery.org/types/ResourcesUseParameter'] - ) - class_weight = hyperparams.Union( - configuration=OrderedDict({ - 'str': hyperparams.Constant( - default='balanced', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ), - 'none': hyperparams.Constant( - default=None, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ) - }), - default='none', - description='Set the parameter C of class i to class_weight[i]*C for SVC. If not given, all classes are supposed to have weight one. The "balanced" mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data as ``n_samples / (n_classes * np.bincount(y))``', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - max_iter = hyperparams.Bounded[int]( - default=-1, - lower=-1, - upper=None, - description='Hard limit on iterations within solver, or -1 for no limit.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - decision_function_shape = hyperparams.Enumeration[str]( - values=['ovr', 'ovo'], - default='ovr', - description='Whether to return a one-vs-rest (\'ovr\') decision function of shape (n_samples, n_classes) as all other classifiers, or the original one-vs-one (\'ovo\') decision function of libsvm which has shape (n_samples, n_classes * (n_classes - 1) / 2). The default of None will currently behave as \'ovo\' for backward compatibility and raise a deprecation warning, but will change \'ovr\' in 0.19. .. versionadded:: 0.17 *decision_function_shape=\'ovr\'* is recommended. .. versionchanged:: 0.17 Deprecated *decision_function_shape=\'ovo\' and None*.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - - use_inputs_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to force primitive to use as training input. If any specified column cannot be parsed, it is skipped.", - ) - use_outputs_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to force primitive to use as training target. If any specified column cannot be parsed, it is skipped.", - ) - exclude_inputs_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to not use as training inputs. Applicable only if \"use_columns\" is not provided.", - ) - exclude_outputs_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to not use as training target. Applicable only if \"use_columns\" is not provided.", - ) - return_result = hyperparams.Enumeration( - values=['append', 'replace', 'new'], - default='new', - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Should parsed columns be appended, should they replace original columns, or should only parsed columns be returned? This hyperparam is ignored if use_semantic_types is set to false.", - ) - use_semantic_types = hyperparams.UniformBool( - default=False, - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Controls whether semantic_types metadata will be used for filtering columns in input dataframe. Setting this to false makes the code ignore return_result and will produce only the output dataframe" - ) - add_index_columns = hyperparams.UniformBool( - default=False, - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Also include primary index columns if input data has them. Applicable only if \"return_result\" is set to \"new\".", - ) - error_on_no_input = hyperparams.UniformBool( - default=True, - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Throw an exception if no input column is selected/provided. Defaults to true to behave like sklearn. To prevent pipelines from breaking set this to False.", - ) - - return_semantic_type = hyperparams.Enumeration[str]( - values=['https://metadata.datadrivendiscovery.org/types/Attribute', 'https://metadata.datadrivendiscovery.org/types/ConstructedAttribute', 'https://metadata.datadrivendiscovery.org/types/PredictedTarget'], - default='https://metadata.datadrivendiscovery.org/types/PredictedTarget', - description='Decides what semantic type to attach to generated output', - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'] - ) - -class SKSVC(SupervisedLearnerPrimitiveBase[Inputs, Outputs, Params, Hyperparams]): - """ - Primitive wrapping for sklearn SVC - `sklearn documentation `_ - - """ - - __author__ = "JPL MARVIN" - metadata = metadata_base.PrimitiveMetadata({ - "algorithm_types": [metadata_base.PrimitiveAlgorithmType.SUPPORT_VECTOR_MACHINE, ], - "name": "sklearn.svm.classes.SVC", - "primitive_family": metadata_base.PrimitiveFamily.CLASSIFICATION, - "python_path": "d3m.primitives.classification.svc.SKlearn", - "source": {'name': 'JPL', 'contact': 'mailto:shah@jpl.nasa.gov', 'uris': ['https://gitlab.com/datadrivendiscovery/sklearn-wrap/issues', 'https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html']}, - "version": "2019.11.13", - "id": "0ae7d42d-f765-3348-a28c-57d94880aa6a", - "hyperparams_to_tune": ['C', 'kernel'], - 'installation': [ - {'type': metadata_base.PrimitiveInstallationType.PIP, - 'package_uri': 'git+https://gitlab.com/datadrivendiscovery/sklearn-wrap.git@{git_commit}#egg=sklearn_wrap'.format( - git_commit=utils.current_git_commit(os.path.dirname(__file__)), - ), - }] - }) - - def __init__(self, *, - hyperparams: Hyperparams, - random_seed: int = 0, - docker_containers: Dict[str, DockerContainer] = None, - _verbose: int = 0) -> None: - - super().__init__(hyperparams=hyperparams, random_seed=random_seed, docker_containers=docker_containers) - - # False - self._clf = SVC( - C=self.hyperparams['C'], - kernel=self.hyperparams['kernel']['choice'], - degree=self.hyperparams['kernel'].get('degree', 3), - gamma=self.hyperparams['kernel'].get('gamma', 'auto'), - coef0=self.hyperparams['kernel'].get('coef0', 0), - probability=self.hyperparams['probability'], - shrinking=self.hyperparams['shrinking'], - tol=self.hyperparams['tol'], - cache_size=self.hyperparams['cache_size'], - class_weight=self.hyperparams['class_weight'], - max_iter=self.hyperparams['max_iter'], - decision_function_shape=self.hyperparams['decision_function_shape'], - verbose=_verbose, - random_state=self.random_seed, - ) - - self._inputs = None - self._outputs = None - self._training_inputs = None - self._training_outputs = None - self._target_names = None - self._training_indices = None - self._target_column_indices = None - self._target_columns_metadata: List[OrderedDict] = None - self._input_column_names = None - self._fitted = False - self._new_training_data = False - - def set_training_data(self, *, inputs: Inputs, outputs: Outputs) -> None: - self._inputs = inputs - self._outputs = outputs - self._fitted = False - self._new_training_data = True - - def fit(self, *, timeout: float = None, iterations: int = None) -> CallResult[None]: - if self._inputs is None or self._outputs is None: - raise ValueError("Missing training data.") - - if not self._new_training_data: - return CallResult(None) - self._new_training_data = False - - self._training_inputs, self._training_indices = self._get_columns_to_fit(self._inputs, self.hyperparams) - self._training_outputs, self._target_names, self._target_column_indices = self._get_targets(self._outputs, self.hyperparams) - self._input_column_names = self._training_inputs.columns - - if len(self._training_indices) > 0 and len(self._target_column_indices) > 0: - self._target_columns_metadata = self._get_target_columns_metadata(self._training_outputs.metadata, self.hyperparams) - sk_training_output = self._training_outputs.values - - shape = sk_training_output.shape - if len(shape) == 2 and shape[1] == 1: - sk_training_output = numpy.ravel(sk_training_output) - - self._clf.fit(self._training_inputs, sk_training_output) - self._fitted = True - else: - if self.hyperparams['error_on_no_input']: - raise RuntimeError("No input columns were selected") - self.logger.warn("No input columns were selected") - - return CallResult(None) - - - - def produce(self, *, inputs: Inputs, timeout: float = None, iterations: int = None) -> CallResult[Outputs]: - sk_inputs, columns_to_use = self._get_columns_to_fit(inputs, self.hyperparams) - output = [] - if len(sk_inputs.columns): - try: - sk_output = self._clf.predict(sk_inputs) - except sklearn.exceptions.NotFittedError as error: - raise PrimitiveNotFittedError("Primitive not fitted.") from error - # For primitives that allow predicting without fitting like GaussianProcessRegressor - if not self._fitted: - raise PrimitiveNotFittedError("Primitive not fitted.") - if sparse.issparse(sk_output): - sk_output = sk_output.toarray() - output = self._wrap_predictions(inputs, sk_output) - output.columns = self._target_names - output = [output] - else: - if self.hyperparams['error_on_no_input']: - raise RuntimeError("No input columns were selected") - self.logger.warn("No input columns were selected") - outputs = base_utils.combine_columns(return_result=self.hyperparams['return_result'], - add_index_columns=self.hyperparams['add_index_columns'], - inputs=inputs, column_indices=self._target_column_indices, - columns_list=output) - - return CallResult(outputs) - - - def get_params(self) -> Params: - if not self._fitted: - return Params( - support_=None, - support_vectors_=None, - n_support_=None, - dual_coef_=None, - intercept_=None, - _sparse=None, - shape_fit_=None, - _dual_coef_=None, - _intercept_=None, - probA_=None, - probB_=None, - _gamma=None, - classes_=None, - class_weight_=None, - fit_status_=None, - epsilon=None, - nu=None, - input_column_names=self._input_column_names, - training_indices_=self._training_indices, - target_names_=self._target_names, - target_column_indices_=self._target_column_indices, - target_columns_metadata_=self._target_columns_metadata - ) - - return Params( - support_=getattr(self._clf, 'support_', None), - support_vectors_=getattr(self._clf, 'support_vectors_', None), - n_support_=getattr(self._clf, 'n_support_', None), - dual_coef_=getattr(self._clf, 'dual_coef_', None), - intercept_=getattr(self._clf, 'intercept_', None), - _sparse=getattr(self._clf, '_sparse', None), - shape_fit_=getattr(self._clf, 'shape_fit_', None), - _dual_coef_=getattr(self._clf, '_dual_coef_', None), - _intercept_=getattr(self._clf, '_intercept_', None), - probA_=getattr(self._clf, 'probA_', None), - probB_=getattr(self._clf, 'probB_', None), - _gamma=getattr(self._clf, '_gamma', None), - classes_=getattr(self._clf, 'classes_', None), - class_weight_=getattr(self._clf, 'class_weight_', None), - fit_status_=getattr(self._clf, 'fit_status_', None), - epsilon=getattr(self._clf, 'epsilon', None), - nu=getattr(self._clf, 'nu', None), - input_column_names=self._input_column_names, - training_indices_=self._training_indices, - target_names_=self._target_names, - target_column_indices_=self._target_column_indices, - target_columns_metadata_=self._target_columns_metadata - ) - - def set_params(self, *, params: Params) -> None: - self._clf.support_ = params['support_'] - self._clf.support_vectors_ = params['support_vectors_'] - self._clf.n_support_ = params['n_support_'] - self._clf.dual_coef_ = params['dual_coef_'] - self._clf.intercept_ = params['intercept_'] - self._clf._sparse = params['_sparse'] - self._clf.shape_fit_ = params['shape_fit_'] - self._clf._dual_coef_ = params['_dual_coef_'] - self._clf._intercept_ = params['_intercept_'] - self._clf.probA_ = params['probA_'] - self._clf.probB_ = params['probB_'] - self._clf._gamma = params['_gamma'] - self._clf.classes_ = params['classes_'] - self._clf.class_weight_ = params['class_weight_'] - self._clf.fit_status_ = params['fit_status_'] - self._clf.epsilon = params['epsilon'] - self._clf.nu = params['nu'] - self._input_column_names = params['input_column_names'] - self._training_indices = params['training_indices_'] - self._target_names = params['target_names_'] - self._target_column_indices = params['target_column_indices_'] - self._target_columns_metadata = params['target_columns_metadata_'] - - if params['support_'] is not None: - self._fitted = True - if params['support_vectors_'] is not None: - self._fitted = True - if params['n_support_'] is not None: - self._fitted = True - if params['dual_coef_'] is not None: - self._fitted = True - if params['intercept_'] is not None: - self._fitted = True - if params['_sparse'] is not None: - self._fitted = True - if params['shape_fit_'] is not None: - self._fitted = True - if params['_dual_coef_'] is not None: - self._fitted = True - if params['_intercept_'] is not None: - self._fitted = True - if params['probA_'] is not None: - self._fitted = True - if params['probB_'] is not None: - self._fitted = True - if params['_gamma'] is not None: - self._fitted = True - if params['classes_'] is not None: - self._fitted = True - if params['class_weight_'] is not None: - self._fitted = True - if params['fit_status_'] is not None: - self._fitted = True - if params['epsilon'] is not None: - self._fitted = True - if params['nu'] is not None: - self._fitted = True - - - - - - - - @classmethod - def _get_columns_to_fit(cls, inputs: Inputs, hyperparams: Hyperparams): - if not hyperparams['use_semantic_types']: - return inputs, list(range(len(inputs.columns))) - - inputs_metadata = inputs.metadata - - def can_produce_column(column_index: int) -> bool: - return cls._can_produce_column(inputs_metadata, column_index, hyperparams) - - columns_to_produce, columns_not_to_produce = base_utils.get_columns_to_use(inputs_metadata, - use_columns=hyperparams['use_inputs_columns'], - exclude_columns=hyperparams['exclude_inputs_columns'], - can_use_column=can_produce_column) - return inputs.iloc[:, columns_to_produce], columns_to_produce - # return columns_to_produce - - @classmethod - def _can_produce_column(cls, inputs_metadata: metadata_base.DataMetadata, column_index: int, hyperparams: Hyperparams) -> bool: - column_metadata = inputs_metadata.query((metadata_base.ALL_ELEMENTS, column_index)) - - accepted_structural_types = (int, float, numpy.integer, numpy.float64) - accepted_semantic_types = set() - accepted_semantic_types.add("https://metadata.datadrivendiscovery.org/types/Attribute") - if not issubclass(column_metadata['structural_type'], accepted_structural_types): - return False - - semantic_types = set(column_metadata.get('semantic_types', [])) - - if len(semantic_types) == 0: - cls.logger.warning("No semantic types found in column metadata") - return False - # Making sure all accepted_semantic_types are available in semantic_types - if len(accepted_semantic_types - semantic_types) == 0: - return True - - return False - - @classmethod - def _get_targets(cls, data: d3m_dataframe, hyperparams: Hyperparams): - if not hyperparams['use_semantic_types']: - return data, list(data.columns), list(range(len(data.columns))) - - metadata = data.metadata - - def can_produce_column(column_index: int) -> bool: - accepted_semantic_types = set() - accepted_semantic_types.add("https://metadata.datadrivendiscovery.org/types/TrueTarget") - column_metadata = metadata.query((metadata_base.ALL_ELEMENTS, column_index)) - semantic_types = set(column_metadata.get('semantic_types', [])) - if len(semantic_types) == 0: - cls.logger.warning("No semantic types found in column metadata") - return False - # Making sure all accepted_semantic_types are available in semantic_types - if len(accepted_semantic_types - semantic_types) == 0: - return True - return False - - target_column_indices, target_columns_not_to_produce = base_utils.get_columns_to_use(metadata, - use_columns=hyperparams[ - 'use_outputs_columns'], - exclude_columns= - hyperparams[ - 'exclude_outputs_columns'], - can_use_column=can_produce_column) - targets = [] - if target_column_indices: - targets = data.select_columns(target_column_indices) - target_column_names = [] - for idx in target_column_indices: - target_column_names.append(data.columns[idx]) - return targets, target_column_names, target_column_indices - - @classmethod - def _get_target_columns_metadata(cls, outputs_metadata: metadata_base.DataMetadata, hyperparams) -> List[OrderedDict]: - outputs_length = outputs_metadata.query((metadata_base.ALL_ELEMENTS,))['dimension']['length'] - - target_columns_metadata: List[OrderedDict] = [] - for column_index in range(outputs_length): - column_metadata = OrderedDict(outputs_metadata.query_column(column_index)) - - # Update semantic types and prepare it for predicted targets. - semantic_types = set(column_metadata.get('semantic_types', [])) - semantic_types_to_remove = set(["https://metadata.datadrivendiscovery.org/types/TrueTarget","https://metadata.datadrivendiscovery.org/types/SuggestedTarget",]) - add_semantic_types = set(["https://metadata.datadrivendiscovery.org/types/PredictedTarget",]) - add_semantic_types.add(hyperparams["return_semantic_type"]) - semantic_types = semantic_types - semantic_types_to_remove - semantic_types = semantic_types.union(add_semantic_types) - column_metadata['semantic_types'] = list(semantic_types) - - target_columns_metadata.append(column_metadata) - - return target_columns_metadata - - @classmethod - def _update_predictions_metadata(cls, inputs_metadata: metadata_base.DataMetadata, outputs: Optional[Outputs], - target_columns_metadata: List[OrderedDict]) -> metadata_base.DataMetadata: - outputs_metadata = metadata_base.DataMetadata().generate(value=outputs) - - for column_index, column_metadata in enumerate(target_columns_metadata): - column_metadata.pop("structural_type", None) - outputs_metadata = outputs_metadata.update_column(column_index, column_metadata) - - return outputs_metadata - - def _wrap_predictions(self, inputs: Inputs, predictions: ndarray) -> Outputs: - outputs = d3m_dataframe(predictions, generate_metadata=False) - outputs.metadata = self._update_predictions_metadata(inputs.metadata, outputs, self._target_columns_metadata) - return outputs - - - @classmethod - def _add_target_columns_metadata(cls, outputs_metadata: metadata_base.DataMetadata): - outputs_length = outputs_metadata.query((metadata_base.ALL_ELEMENTS,))['dimension']['length'] - - target_columns_metadata: List[OrderedDict] = [] - for column_index in range(outputs_length): - column_metadata = OrderedDict() - semantic_types = [] - semantic_types.append('https://metadata.datadrivendiscovery.org/types/PredictedTarget') - column_name = outputs_metadata.query((metadata_base.ALL_ELEMENTS, column_index)).get("name") - if column_name is None: - column_name = "output_{}".format(column_index) - column_metadata["semantic_types"] = semantic_types - column_metadata["name"] = str(column_name) - target_columns_metadata.append(column_metadata) - - return target_columns_metadata - - -SKSVC.__doc__ = SVC.__doc__ \ No newline at end of file diff --git a/common-primitives/sklearn-wrap/sklearn_wrap/SKSVR.py b/common-primitives/sklearn-wrap/sklearn_wrap/SKSVR.py deleted file mode 100644 index 8f17ca5..0000000 --- a/common-primitives/sklearn-wrap/sklearn_wrap/SKSVR.py +++ /dev/null @@ -1,616 +0,0 @@ -from typing import Any, Callable, List, Dict, Union, Optional, Sequence, Tuple -from numpy import ndarray -from collections import OrderedDict -from scipy import sparse -import os -import sklearn -import numpy -import typing - -# Custom import commands if any -from sklearn.svm.classes import SVR - - -from d3m.container.numpy import ndarray as d3m_ndarray -from d3m.container import DataFrame as d3m_dataframe -from d3m.metadata import hyperparams, params, base as metadata_base -from d3m import utils -from d3m.base import utils as base_utils -from d3m.exceptions import PrimitiveNotFittedError -from d3m.primitive_interfaces.base import CallResult, DockerContainer - -from d3m.primitive_interfaces.supervised_learning import SupervisedLearnerPrimitiveBase -from d3m.primitive_interfaces.base import ProbabilisticCompositionalityMixin, ContinueFitMixin -from d3m import exceptions -import pandas - - - -Inputs = d3m_dataframe -Outputs = d3m_dataframe - - -class Params(params.Params): - support_: Optional[ndarray] - support_vectors_: Optional[ndarray] - dual_coef_: Optional[ndarray] - intercept_: Optional[ndarray] - _sparse: Optional[bool] - shape_fit_: Optional[tuple] - n_support_: Optional[ndarray] - probA_: Optional[ndarray] - probB_: Optional[ndarray] - _gamma: Optional[float] - _dual_coef_: Optional[ndarray] - _intercept_: Optional[ndarray] - class_weight_: Optional[ndarray] - fit_status_: Optional[int] - class_weight: Optional[Union[str, Dict, List[Dict]]] - nu: Optional[float] - probability: Optional[bool] - input_column_names: Optional[Any] - target_names_: Optional[Sequence[Any]] - training_indices_: Optional[Sequence[int]] - target_column_indices_: Optional[Sequence[int]] - target_columns_metadata_: Optional[List[OrderedDict]] - - - -class Hyperparams(hyperparams.Hyperparams): - C = hyperparams.Bounded[float]( - default=1, - lower=0, - upper=None, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - description='Penalty parameter C of the error term.' - ) - epsilon = hyperparams.Bounded[float]( - lower=0, - upper=None, - default=0.1, - description='Epsilon in the epsilon-SVR model. It specifies the epsilon-tube within which no penalty is associated in the training loss function with points predicted within a distance epsilon from the actual value.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - kernel = hyperparams.Choice( - choices={ - 'linear': hyperparams.Hyperparams.define( - configuration=OrderedDict({}) - ), - 'poly': hyperparams.Hyperparams.define( - configuration=OrderedDict({ - 'degree': hyperparams.Bounded[int]( - default=3, - lower=0, - upper=None, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ), - 'gamma': hyperparams.Union( - configuration=OrderedDict({ - 'float': hyperparams.Bounded[float]( - default=0.1, - lower=0, - upper=None, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ), - 'auto': hyperparams.Constant( - default='auto', - description='1/n_features will be used.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ) - }), - default='auto', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ), - 'coef0': hyperparams.Constant( - default=0, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - }) - ), - 'rbf': hyperparams.Hyperparams.define( - configuration=OrderedDict({ - 'gamma': hyperparams.Union( - configuration=OrderedDict({ - 'float': hyperparams.Bounded[float]( - default=0.1, - lower=0, - upper=None, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ), - 'auto': hyperparams.Constant( - default='auto', - description='1/n_features will be used.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ) - }), - default='auto', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - }) - ), - 'sigmoid': hyperparams.Hyperparams.define( - configuration=OrderedDict({ - 'gamma': hyperparams.Union( - configuration=OrderedDict({ - 'float': hyperparams.Bounded[float]( - default=0.1, - lower=0, - upper=None, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ), - 'auto': hyperparams.Constant( - default='auto', - description='1/n_features will be used.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ) - }), - default='auto', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ), - 'coef0': hyperparams.Constant( - default=0, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - }) - ), - 'precomputed': hyperparams.Hyperparams.define( - configuration=OrderedDict({}) - ) - }, - default='rbf', - description='Specifies the kernel type to be used in the algorithm. It must be one of \'linear\', \'poly\', \'rbf\', \'sigmoid\', \'precomputed\' or a callable. If none is given, \'rbf\' will be used. If a callable is given it is used to precompute the kernel matrix.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - shrinking = hyperparams.UniformBool( - default=True, - description='Whether to use the shrinking heuristic.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - tol = hyperparams.Bounded[float]( - default=0.001, - lower=0, - upper=None, - description='Tolerance for stopping criterion.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - cache_size = hyperparams.Bounded[float]( - default=200, - lower=0, - upper=None, - description='Specify the size of the kernel cache (in MB).', - semantic_types=['https://metadata.datadrivendiscovery.org/types/ResourcesUseParameter'] - ) - max_iter = hyperparams.Bounded[int]( - default=-1, - lower=-1, - upper=None, - description='Hard limit on iterations within solver, or -1 for no limit.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - - use_inputs_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to force primitive to use as training input. If any specified column cannot be parsed, it is skipped.", - ) - use_outputs_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to force primitive to use as training target. If any specified column cannot be parsed, it is skipped.", - ) - exclude_inputs_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to not use as training inputs. Applicable only if \"use_columns\" is not provided.", - ) - exclude_outputs_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to not use as training target. Applicable only if \"use_columns\" is not provided.", - ) - return_result = hyperparams.Enumeration( - values=['append', 'replace', 'new'], - default='new', - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Should parsed columns be appended, should they replace original columns, or should only parsed columns be returned? This hyperparam is ignored if use_semantic_types is set to false.", - ) - use_semantic_types = hyperparams.UniformBool( - default=False, - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Controls whether semantic_types metadata will be used for filtering columns in input dataframe. Setting this to false makes the code ignore return_result and will produce only the output dataframe" - ) - add_index_columns = hyperparams.UniformBool( - default=False, - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Also include primary index columns if input data has them. Applicable only if \"return_result\" is set to \"new\".", - ) - error_on_no_input = hyperparams.UniformBool( - default=True, - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Throw an exception if no input column is selected/provided. Defaults to true to behave like sklearn. To prevent pipelines from breaking set this to False.", - ) - - return_semantic_type = hyperparams.Enumeration[str]( - values=['https://metadata.datadrivendiscovery.org/types/Attribute', 'https://metadata.datadrivendiscovery.org/types/ConstructedAttribute', 'https://metadata.datadrivendiscovery.org/types/PredictedTarget'], - default='https://metadata.datadrivendiscovery.org/types/PredictedTarget', - description='Decides what semantic type to attach to generated output', - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'] - ) - -class SKSVR(SupervisedLearnerPrimitiveBase[Inputs, Outputs, Params, Hyperparams]): - """ - Primitive wrapping for sklearn SVR - `sklearn documentation `_ - - """ - - __author__ = "JPL MARVIN" - metadata = metadata_base.PrimitiveMetadata({ - "algorithm_types": [metadata_base.PrimitiveAlgorithmType.SUPPORT_VECTOR_MACHINE, ], - "name": "sklearn.svm.classes.SVR", - "primitive_family": metadata_base.PrimitiveFamily.REGRESSION, - "python_path": "d3m.primitives.regression.svr.SKlearn", - "source": {'name': 'JPL', 'contact': 'mailto:shah@jpl.nasa.gov', 'uris': ['https://gitlab.com/datadrivendiscovery/sklearn-wrap/issues', 'https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVR.html']}, - "version": "2019.11.13", - "id": "ebbc3404-902d-33cc-a10c-e42b06dfe60c", - "hyperparams_to_tune": ['C', 'kernel'], - 'installation': [ - {'type': metadata_base.PrimitiveInstallationType.PIP, - 'package_uri': 'git+https://gitlab.com/datadrivendiscovery/sklearn-wrap.git@{git_commit}#egg=sklearn_wrap'.format( - git_commit=utils.current_git_commit(os.path.dirname(__file__)), - ), - }] - }) - - def __init__(self, *, - hyperparams: Hyperparams, - random_seed: int = 0, - docker_containers: Dict[str, DockerContainer] = None, - _verbose: int = 0) -> None: - - super().__init__(hyperparams=hyperparams, random_seed=random_seed, docker_containers=docker_containers) - - # False - self._clf = SVR( - C=self.hyperparams['C'], - epsilon=self.hyperparams['epsilon'], - kernel=self.hyperparams['kernel']['choice'], - degree=self.hyperparams['kernel'].get('degree', 3), - gamma=self.hyperparams['kernel'].get('gamma', 'auto'), - coef0=self.hyperparams['kernel'].get('coef0', 0), - shrinking=self.hyperparams['shrinking'], - tol=self.hyperparams['tol'], - cache_size=self.hyperparams['cache_size'], - max_iter=self.hyperparams['max_iter'], - verbose=_verbose - ) - - self._inputs = None - self._outputs = None - self._training_inputs = None - self._training_outputs = None - self._target_names = None - self._training_indices = None - self._target_column_indices = None - self._target_columns_metadata: List[OrderedDict] = None - self._input_column_names = None - self._fitted = False - self._new_training_data = False - - def set_training_data(self, *, inputs: Inputs, outputs: Outputs) -> None: - self._inputs = inputs - self._outputs = outputs - self._fitted = False - self._new_training_data = True - - def fit(self, *, timeout: float = None, iterations: int = None) -> CallResult[None]: - if self._inputs is None or self._outputs is None: - raise ValueError("Missing training data.") - - if not self._new_training_data: - return CallResult(None) - self._new_training_data = False - - self._training_inputs, self._training_indices = self._get_columns_to_fit(self._inputs, self.hyperparams) - self._training_outputs, self._target_names, self._target_column_indices = self._get_targets(self._outputs, self.hyperparams) - self._input_column_names = self._training_inputs.columns - - if len(self._training_indices) > 0 and len(self._target_column_indices) > 0: - self._target_columns_metadata = self._get_target_columns_metadata(self._training_outputs.metadata, self.hyperparams) - sk_training_output = self._training_outputs.values - - shape = sk_training_output.shape - if len(shape) == 2 and shape[1] == 1: - sk_training_output = numpy.ravel(sk_training_output) - - self._clf.fit(self._training_inputs, sk_training_output) - self._fitted = True - else: - if self.hyperparams['error_on_no_input']: - raise RuntimeError("No input columns were selected") - self.logger.warn("No input columns were selected") - - return CallResult(None) - - - - def produce(self, *, inputs: Inputs, timeout: float = None, iterations: int = None) -> CallResult[Outputs]: - sk_inputs, columns_to_use = self._get_columns_to_fit(inputs, self.hyperparams) - output = [] - if len(sk_inputs.columns): - try: - sk_output = self._clf.predict(sk_inputs) - except sklearn.exceptions.NotFittedError as error: - raise PrimitiveNotFittedError("Primitive not fitted.") from error - # For primitives that allow predicting without fitting like GaussianProcessRegressor - if not self._fitted: - raise PrimitiveNotFittedError("Primitive not fitted.") - if sparse.issparse(sk_output): - sk_output = sk_output.toarray() - output = self._wrap_predictions(inputs, sk_output) - output.columns = self._target_names - output = [output] - else: - if self.hyperparams['error_on_no_input']: - raise RuntimeError("No input columns were selected") - self.logger.warn("No input columns were selected") - outputs = base_utils.combine_columns(return_result=self.hyperparams['return_result'], - add_index_columns=self.hyperparams['add_index_columns'], - inputs=inputs, column_indices=self._target_column_indices, - columns_list=output) - - return CallResult(outputs) - - - def get_params(self) -> Params: - if not self._fitted: - return Params( - support_=None, - support_vectors_=None, - dual_coef_=None, - intercept_=None, - _sparse=None, - shape_fit_=None, - n_support_=None, - probA_=None, - probB_=None, - _gamma=None, - _dual_coef_=None, - _intercept_=None, - class_weight_=None, - fit_status_=None, - class_weight=None, - nu=None, - probability=None, - input_column_names=self._input_column_names, - training_indices_=self._training_indices, - target_names_=self._target_names, - target_column_indices_=self._target_column_indices, - target_columns_metadata_=self._target_columns_metadata - ) - - return Params( - support_=getattr(self._clf, 'support_', None), - support_vectors_=getattr(self._clf, 'support_vectors_', None), - dual_coef_=getattr(self._clf, 'dual_coef_', None), - intercept_=getattr(self._clf, 'intercept_', None), - _sparse=getattr(self._clf, '_sparse', None), - shape_fit_=getattr(self._clf, 'shape_fit_', None), - n_support_=getattr(self._clf, 'n_support_', None), - probA_=getattr(self._clf, 'probA_', None), - probB_=getattr(self._clf, 'probB_', None), - _gamma=getattr(self._clf, '_gamma', None), - _dual_coef_=getattr(self._clf, '_dual_coef_', None), - _intercept_=getattr(self._clf, '_intercept_', None), - class_weight_=getattr(self._clf, 'class_weight_', None), - fit_status_=getattr(self._clf, 'fit_status_', None), - class_weight=getattr(self._clf, 'class_weight', None), - nu=getattr(self._clf, 'nu', None), - probability=getattr(self._clf, 'probability', None), - input_column_names=self._input_column_names, - training_indices_=self._training_indices, - target_names_=self._target_names, - target_column_indices_=self._target_column_indices, - target_columns_metadata_=self._target_columns_metadata - ) - - def set_params(self, *, params: Params) -> None: - self._clf.support_ = params['support_'] - self._clf.support_vectors_ = params['support_vectors_'] - self._clf.dual_coef_ = params['dual_coef_'] - self._clf.intercept_ = params['intercept_'] - self._clf._sparse = params['_sparse'] - self._clf.shape_fit_ = params['shape_fit_'] - self._clf.n_support_ = params['n_support_'] - self._clf.probA_ = params['probA_'] - self._clf.probB_ = params['probB_'] - self._clf._gamma = params['_gamma'] - self._clf._dual_coef_ = params['_dual_coef_'] - self._clf._intercept_ = params['_intercept_'] - self._clf.class_weight_ = params['class_weight_'] - self._clf.fit_status_ = params['fit_status_'] - self._clf.class_weight = params['class_weight'] - self._clf.nu = params['nu'] - self._clf.probability = params['probability'] - self._input_column_names = params['input_column_names'] - self._training_indices = params['training_indices_'] - self._target_names = params['target_names_'] - self._target_column_indices = params['target_column_indices_'] - self._target_columns_metadata = params['target_columns_metadata_'] - - if params['support_'] is not None: - self._fitted = True - if params['support_vectors_'] is not None: - self._fitted = True - if params['dual_coef_'] is not None: - self._fitted = True - if params['intercept_'] is not None: - self._fitted = True - if params['_sparse'] is not None: - self._fitted = True - if params['shape_fit_'] is not None: - self._fitted = True - if params['n_support_'] is not None: - self._fitted = True - if params['probA_'] is not None: - self._fitted = True - if params['probB_'] is not None: - self._fitted = True - if params['_gamma'] is not None: - self._fitted = True - if params['_dual_coef_'] is not None: - self._fitted = True - if params['_intercept_'] is not None: - self._fitted = True - if params['class_weight_'] is not None: - self._fitted = True - if params['fit_status_'] is not None: - self._fitted = True - if params['class_weight'] is not None: - self._fitted = True - if params['nu'] is not None: - self._fitted = True - if params['probability'] is not None: - self._fitted = True - - - - - - - - @classmethod - def _get_columns_to_fit(cls, inputs: Inputs, hyperparams: Hyperparams): - if not hyperparams['use_semantic_types']: - return inputs, list(range(len(inputs.columns))) - - inputs_metadata = inputs.metadata - - def can_produce_column(column_index: int) -> bool: - return cls._can_produce_column(inputs_metadata, column_index, hyperparams) - - columns_to_produce, columns_not_to_produce = base_utils.get_columns_to_use(inputs_metadata, - use_columns=hyperparams['use_inputs_columns'], - exclude_columns=hyperparams['exclude_inputs_columns'], - can_use_column=can_produce_column) - return inputs.iloc[:, columns_to_produce], columns_to_produce - # return columns_to_produce - - @classmethod - def _can_produce_column(cls, inputs_metadata: metadata_base.DataMetadata, column_index: int, hyperparams: Hyperparams) -> bool: - column_metadata = inputs_metadata.query((metadata_base.ALL_ELEMENTS, column_index)) - - accepted_structural_types = (int, float, numpy.integer, numpy.float64) - accepted_semantic_types = set() - accepted_semantic_types.add("https://metadata.datadrivendiscovery.org/types/Attribute") - if not issubclass(column_metadata['structural_type'], accepted_structural_types): - return False - - semantic_types = set(column_metadata.get('semantic_types', [])) - - if len(semantic_types) == 0: - cls.logger.warning("No semantic types found in column metadata") - return False - # Making sure all accepted_semantic_types are available in semantic_types - if len(accepted_semantic_types - semantic_types) == 0: - return True - - return False - - @classmethod - def _get_targets(cls, data: d3m_dataframe, hyperparams: Hyperparams): - if not hyperparams['use_semantic_types']: - return data, list(data.columns), list(range(len(data.columns))) - - metadata = data.metadata - - def can_produce_column(column_index: int) -> bool: - accepted_semantic_types = set() - accepted_semantic_types.add("https://metadata.datadrivendiscovery.org/types/TrueTarget") - column_metadata = metadata.query((metadata_base.ALL_ELEMENTS, column_index)) - semantic_types = set(column_metadata.get('semantic_types', [])) - if len(semantic_types) == 0: - cls.logger.warning("No semantic types found in column metadata") - return False - # Making sure all accepted_semantic_types are available in semantic_types - if len(accepted_semantic_types - semantic_types) == 0: - return True - return False - - target_column_indices, target_columns_not_to_produce = base_utils.get_columns_to_use(metadata, - use_columns=hyperparams[ - 'use_outputs_columns'], - exclude_columns= - hyperparams[ - 'exclude_outputs_columns'], - can_use_column=can_produce_column) - targets = [] - if target_column_indices: - targets = data.select_columns(target_column_indices) - target_column_names = [] - for idx in target_column_indices: - target_column_names.append(data.columns[idx]) - return targets, target_column_names, target_column_indices - - @classmethod - def _get_target_columns_metadata(cls, outputs_metadata: metadata_base.DataMetadata, hyperparams) -> List[OrderedDict]: - outputs_length = outputs_metadata.query((metadata_base.ALL_ELEMENTS,))['dimension']['length'] - - target_columns_metadata: List[OrderedDict] = [] - for column_index in range(outputs_length): - column_metadata = OrderedDict(outputs_metadata.query_column(column_index)) - - # Update semantic types and prepare it for predicted targets. - semantic_types = set(column_metadata.get('semantic_types', [])) - semantic_types_to_remove = set(["https://metadata.datadrivendiscovery.org/types/TrueTarget","https://metadata.datadrivendiscovery.org/types/SuggestedTarget",]) - add_semantic_types = set(["https://metadata.datadrivendiscovery.org/types/PredictedTarget",]) - add_semantic_types.add(hyperparams["return_semantic_type"]) - semantic_types = semantic_types - semantic_types_to_remove - semantic_types = semantic_types.union(add_semantic_types) - column_metadata['semantic_types'] = list(semantic_types) - - target_columns_metadata.append(column_metadata) - - return target_columns_metadata - - @classmethod - def _update_predictions_metadata(cls, inputs_metadata: metadata_base.DataMetadata, outputs: Optional[Outputs], - target_columns_metadata: List[OrderedDict]) -> metadata_base.DataMetadata: - outputs_metadata = metadata_base.DataMetadata().generate(value=outputs) - - for column_index, column_metadata in enumerate(target_columns_metadata): - column_metadata.pop("structural_type", None) - outputs_metadata = outputs_metadata.update_column(column_index, column_metadata) - - return outputs_metadata - - def _wrap_predictions(self, inputs: Inputs, predictions: ndarray) -> Outputs: - outputs = d3m_dataframe(predictions, generate_metadata=False) - outputs.metadata = self._update_predictions_metadata(inputs.metadata, outputs, self._target_columns_metadata) - return outputs - - - @classmethod - def _add_target_columns_metadata(cls, outputs_metadata: metadata_base.DataMetadata): - outputs_length = outputs_metadata.query((metadata_base.ALL_ELEMENTS,))['dimension']['length'] - - target_columns_metadata: List[OrderedDict] = [] - for column_index in range(outputs_length): - column_metadata = OrderedDict() - semantic_types = [] - semantic_types.append('https://metadata.datadrivendiscovery.org/types/PredictedTarget') - column_name = outputs_metadata.query((metadata_base.ALL_ELEMENTS, column_index)).get("name") - if column_name is None: - column_name = "output_{}".format(column_index) - column_metadata["semantic_types"] = semantic_types - column_metadata["name"] = str(column_name) - target_columns_metadata.append(column_metadata) - - return target_columns_metadata - - -SKSVR.__doc__ = SVR.__doc__ \ No newline at end of file diff --git a/common-primitives/sklearn-wrap/sklearn_wrap/SKSelectFwe.py b/common-primitives/sklearn-wrap/sklearn_wrap/SKSelectFwe.py deleted file mode 100644 index b7e534c..0000000 --- a/common-primitives/sklearn-wrap/sklearn_wrap/SKSelectFwe.py +++ /dev/null @@ -1,428 +0,0 @@ -from typing import Any, Callable, List, Dict, Union, Optional, Sequence, Tuple -from numpy import ndarray -from collections import OrderedDict -from scipy import sparse -import os -import sklearn -import numpy -import typing - -# Custom import commands if any -from sklearn.feature_selection.univariate_selection import SelectFwe -from sklearn.feature_selection import f_classif, f_regression, chi2 - - -from d3m.container.numpy import ndarray as d3m_ndarray -from d3m.container import DataFrame as d3m_dataframe -from d3m.metadata import hyperparams, params, base as metadata_base -from d3m import utils -from d3m.base import utils as base_utils -from d3m.exceptions import PrimitiveNotFittedError -from d3m.primitive_interfaces.base import CallResult, DockerContainer - -from d3m.primitive_interfaces.supervised_learning import SupervisedLearnerPrimitiveBase -from d3m.primitive_interfaces.base import ProbabilisticCompositionalityMixin, ContinueFitMixin -from d3m import exceptions -import pandas - - - -Inputs = d3m_dataframe -Outputs = d3m_dataframe - - -class Params(params.Params): - scores_: Optional[ndarray] - pvalues_: Optional[ndarray] - input_column_names: Optional[Any] - target_names_: Optional[Sequence[Any]] - training_indices_: Optional[Sequence[int]] - target_column_indices_: Optional[Sequence[int]] - target_columns_metadata_: Optional[List[OrderedDict]] - - - -class Hyperparams(hyperparams.Hyperparams): - score_func = hyperparams.Enumeration[str]( - default='f_classif', - values=['f_classif', 'f_regression', 'chi2'], - description='Function taking two arrays X and y, and returning a pair of arrays (scores, pvalues). Default is f_classif (see below "See also"). The default function only works with classification tasks.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - alpha = hyperparams.Bounded[float]( - default=0.05, - lower=0, - upper=None, - description='The highest uncorrected p-value for features to keep.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - - use_inputs_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to force primitive to use as training input. If any specified column cannot be parsed, it is skipped.", - ) - use_outputs_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to force primitive to use as training target. If any specified column cannot be parsed, it is skipped.", - ) - exclude_inputs_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to not use as training inputs. Applicable only if \"use_columns\" is not provided.", - ) - exclude_outputs_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to not use as training target. Applicable only if \"use_columns\" is not provided.", - ) - return_result = hyperparams.Enumeration( - values=['update_semantic_types', 'replace', 'new'], - default='new', - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Should parsed columns be appended, should they replace original columns, or should only parsed columns be returned? This hyperparam is ignored if use_semantic_types is set to false.", -) - use_semantic_types = hyperparams.UniformBool( - default=False, - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Controls whether semantic_types metadata will be used for filtering columns in input dataframe. Setting this to false makes the code ignore return_result and will produce only the output dataframe" - ) - add_index_columns = hyperparams.UniformBool( - default=False, - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Also include primary index columns if input data has them. Applicable only if \"return_result\" is set to \"new\".", - ) - error_on_no_input = hyperparams.UniformBool( - default=True, - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Throw an exception if no input column is selected/provided. Defaults to true to behave like sklearn. To prevent pipelines from breaking set this to False.", - ) - - return_semantic_type = hyperparams.Enumeration[str]( - values=['https://metadata.datadrivendiscovery.org/types/Attribute', 'https://metadata.datadrivendiscovery.org/types/ConstructedAttribute', 'https://metadata.datadrivendiscovery.org/types/PredictedTarget'], - default='https://metadata.datadrivendiscovery.org/types/PredictedTarget', - description='Decides what semantic type to attach to generated output', - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'] - ) - -class SKSelectFwe(SupervisedLearnerPrimitiveBase[Inputs, Outputs, Params, Hyperparams]): - """ - Primitive wrapping for sklearn SelectFwe - `sklearn documentation `_ - - """ - - __author__ = "JPL MARVIN" - metadata = metadata_base.PrimitiveMetadata({ - "algorithm_types": [metadata_base.PrimitiveAlgorithmType.FEATURE_SCALING, ], - "name": "sklearn.feature_selection.univariate_selection.SelectFwe", - "primitive_family": metadata_base.PrimitiveFamily.FEATURE_SELECTION, - "python_path": "d3m.primitives.feature_selection.select_fwe.SKlearn", - "source": {'name': 'JPL', 'contact': 'mailto:shah@jpl.nasa.gov', 'uris': ['https://gitlab.com/datadrivendiscovery/sklearn-wrap/issues', 'https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectFwe.html']}, - "version": "2019.11.13", - "id": "09a4cffa-a59f-30ac-b78f-101c35b3f7c6", - "hyperparams_to_tune": ['alpha'], - 'installation': [ - {'type': metadata_base.PrimitiveInstallationType.PIP, - 'package_uri': 'git+https://gitlab.com/datadrivendiscovery/sklearn-wrap.git@{git_commit}#egg=sklearn_wrap'.format( - git_commit=utils.current_git_commit(os.path.dirname(__file__)), - ), - }] - }) - - def __init__(self, *, - hyperparams: Hyperparams, - random_seed: int = 0, - docker_containers: Dict[str, DockerContainer] = None) -> None: - - super().__init__(hyperparams=hyperparams, random_seed=random_seed, docker_containers=docker_containers) - - # False - self._clf = SelectFwe( - score_func=eval(self.hyperparams['score_func']), - alpha=self.hyperparams['alpha'], - ) - - self._inputs = None - self._outputs = None - self._training_inputs = None - self._training_outputs = None - self._target_names = None - self._training_indices = None - self._target_column_indices = None - self._target_columns_metadata: List[OrderedDict] = None - self._input_column_names = None - self._fitted = False - self._new_training_data = False - - def set_training_data(self, *, inputs: Inputs, outputs: Outputs) -> None: - self._inputs = inputs - self._outputs = outputs - self._fitted = False - self._new_training_data = True - - def fit(self, *, timeout: float = None, iterations: int = None) -> CallResult[None]: - if self._fitted: - return CallResult(None) - - self._training_inputs, self._training_indices = self._get_columns_to_fit(self._inputs, self.hyperparams) - self._training_outputs, self._target_names, self._target_column_indices = self._get_targets(self._outputs, self.hyperparams) - self._input_column_names = self._training_inputs.columns - - if self._training_inputs is None or self._training_outputs is None: - raise ValueError("Missing training data.") - - if len(self._training_indices) > 0 and len(self._target_column_indices) > 0: - sk_training_output = self._training_outputs.values - - shape = sk_training_output.shape - if len(shape) == 2 and shape[1] == 1: - sk_training_output = numpy.ravel(sk_training_output) - - self._clf.fit(self._training_inputs, sk_training_output) - self._fitted = True - else: - if self.hyperparams['error_on_no_input']: - raise RuntimeError("No input columns were selected") - self.logger.warn("No input columns were selected") - - return CallResult(None) - - def produce(self, *, inputs: Inputs, timeout: float = None, iterations: int = None) -> CallResult[Outputs]: - sk_inputs, columns_to_use = self._get_columns_to_fit(inputs, self.hyperparams) - output = [] - if len(sk_inputs.columns): - try: - sk_output = self._clf.transform(sk_inputs) - except sklearn.exceptions.NotFittedError as error: - raise PrimitiveNotFittedError("Primitive not fitted.") from error - if sparse.issparse(sk_output): - sk_output = sk_output.toarray() - target_columns_metadata = self._copy_columns_metadata(inputs.iloc[:, self._training_indices].metadata, - self.produce_support().value) - output = self._wrap_predictions(inputs, sk_output, target_columns_metadata) - output.columns = [inputs.columns[idx] for idx in range(len(inputs.columns)) if idx in self.produce_support().value] - output = [output] - else: - if self.hyperparams['error_on_no_input']: - raise RuntimeError("No input columns were selected") - self.logger.warn("No input columns were selected") - - if self.hyperparams['return_result'] == 'update_semantic_types': - temp_inputs = inputs.copy() - columns_not_selected = sorted(set(range(len(temp_inputs.columns))) - set(self.produce_support().value)) - - for idx in columns_not_selected: - temp_inputs.metadata = temp_inputs.metadata.remove_semantic_type((metadata_base.ALL_ELEMENTS, idx), - 'https://metadata.datadrivendiscovery.org/types/Attribute') - - temp_inputs = temp_inputs.select_columns(self._training_indices) - outputs = base_utils.combine_columns(return_result='replace', - add_index_columns=self.hyperparams['add_index_columns'], - inputs=inputs, column_indices=self._training_indices, - columns_list=[temp_inputs]) - return CallResult(outputs) - - outputs = base_utils.combine_columns(return_result=self.hyperparams['return_result'], - add_index_columns=self.hyperparams['add_index_columns'], - inputs=inputs, column_indices=self._training_indices, - columns_list=output) - - return CallResult(outputs) - - def produce_support(self, *, timeout: float = None, iterations: int = None) -> CallResult[Any]: - all_indices = self._training_indices - selected_indices = self._clf.get_support(indices=True).tolist() - indices = [all_indices[index] for index in selected_indices] - return CallResult(indices) - - - def get_params(self) -> Params: - if not self._fitted: - return Params( - scores_=None, - pvalues_=None, - input_column_names=self._input_column_names, - training_indices_=self._training_indices, - target_names_=self._target_names, - target_column_indices_=self._target_column_indices, - target_columns_metadata_=self._target_columns_metadata - ) - - return Params( - scores_=getattr(self._clf, 'scores_', None), - pvalues_=getattr(self._clf, 'pvalues_', None), - input_column_names=self._input_column_names, - training_indices_=self._training_indices, - target_names_=self._target_names, - target_column_indices_=self._target_column_indices, - target_columns_metadata_=self._target_columns_metadata - ) - - def set_params(self, *, params: Params) -> None: - self._clf.scores_ = params['scores_'] - self._clf.pvalues_ = params['pvalues_'] - self._input_column_names = params['input_column_names'] - self._training_indices = params['training_indices_'] - self._target_names = params['target_names_'] - self._target_column_indices = params['target_column_indices_'] - self._target_columns_metadata = params['target_columns_metadata_'] - - if params['scores_'] is not None: - self._fitted = True - if params['pvalues_'] is not None: - self._fitted = True - - - - - - - - @classmethod - def _get_columns_to_fit(cls, inputs: Inputs, hyperparams: Hyperparams): - if not hyperparams['use_semantic_types']: - return inputs, list(range(len(inputs.columns))) - - inputs_metadata = inputs.metadata - - def can_produce_column(column_index: int) -> bool: - return cls._can_produce_column(inputs_metadata, column_index, hyperparams) - - columns_to_produce, columns_not_to_produce = base_utils.get_columns_to_use(inputs_metadata, - use_columns=hyperparams['use_inputs_columns'], - exclude_columns=hyperparams['exclude_inputs_columns'], - can_use_column=can_produce_column) - return inputs.iloc[:, columns_to_produce], columns_to_produce - # return columns_to_produce - - @classmethod - def _can_produce_column(cls, inputs_metadata: metadata_base.DataMetadata, column_index: int, hyperparams: Hyperparams) -> bool: - column_metadata = inputs_metadata.query((metadata_base.ALL_ELEMENTS, column_index)) - - accepted_structural_types = (int, float, numpy.integer, numpy.float64) - accepted_semantic_types = set() - accepted_semantic_types.add("https://metadata.datadrivendiscovery.org/types/Attribute") - if not issubclass(column_metadata['structural_type'], accepted_structural_types): - return False - - semantic_types = set(column_metadata.get('semantic_types', [])) - - if len(semantic_types) == 0: - cls.logger.warning("No semantic types found in column metadata") - return False - # Making sure all accepted_semantic_types are available in semantic_types - if len(accepted_semantic_types - semantic_types) == 0: - return True - - return False - - @classmethod - def _get_targets(cls, data: d3m_dataframe, hyperparams: Hyperparams): - if not hyperparams['use_semantic_types']: - return data, list(data.columns), list(range(len(data.columns))) - - metadata = data.metadata - - def can_produce_column(column_index: int) -> bool: - accepted_semantic_types = set() - accepted_semantic_types.add("https://metadata.datadrivendiscovery.org/types/TrueTarget") - column_metadata = metadata.query((metadata_base.ALL_ELEMENTS, column_index)) - semantic_types = set(column_metadata.get('semantic_types', [])) - if len(semantic_types) == 0: - cls.logger.warning("No semantic types found in column metadata") - return False - # Making sure all accepted_semantic_types are available in semantic_types - if len(accepted_semantic_types - semantic_types) == 0: - return True - return False - - target_column_indices, target_columns_not_to_produce = base_utils.get_columns_to_use(metadata, - use_columns=hyperparams[ - 'use_outputs_columns'], - exclude_columns= - hyperparams[ - 'exclude_outputs_columns'], - can_use_column=can_produce_column) - targets = [] - if target_column_indices: - targets = data.select_columns(target_column_indices) - target_column_names = [] - for idx in target_column_indices: - target_column_names.append(data.columns[idx]) - return targets, target_column_names, target_column_indices - - @classmethod - def _get_target_columns_metadata(cls, outputs_metadata: metadata_base.DataMetadata, hyperparams) -> List[OrderedDict]: - outputs_length = outputs_metadata.query((metadata_base.ALL_ELEMENTS,))['dimension']['length'] - - target_columns_metadata: List[OrderedDict] = [] - for column_index in range(outputs_length): - column_metadata = OrderedDict(outputs_metadata.query_column(column_index)) - - # Update semantic types and prepare it for predicted targets. - semantic_types = set(column_metadata.get('semantic_types', [])) - semantic_types_to_remove = set([]) - add_semantic_types = [] - add_semantic_types.add(hyperparams["return_semantic_type"]) - semantic_types = semantic_types - semantic_types_to_remove - semantic_types = semantic_types.union(add_semantic_types) - column_metadata['semantic_types'] = list(semantic_types) - - target_columns_metadata.append(column_metadata) - - return target_columns_metadata - - @classmethod - def _update_predictions_metadata(cls, inputs_metadata: metadata_base.DataMetadata, outputs: Optional[Outputs], - target_columns_metadata: List[OrderedDict]) -> metadata_base.DataMetadata: - outputs_metadata = metadata_base.DataMetadata().generate(value=outputs) - - for column_index, column_metadata in enumerate(target_columns_metadata): - if len(target_columns_metadata) == 1: - name = column_metadata.get("name") - for idx in range(len(outputs.columns)): - outputs_metadata = outputs_metadata.update_column(idx, column_metadata) - if len(outputs.columns) > 1: - # Updating column names. - outputs_metadata = outputs_metadata.update((metadata_base.ALL_ELEMENTS, idx), {'name': "{}_{}".format(name, idx)}) - else: - outputs_metadata = outputs_metadata.update_column(column_index, column_metadata) - - return outputs_metadata - - - def _wrap_predictions(self, inputs: Inputs, predictions: ndarray, target_columns_metadata) -> Outputs: - outputs = d3m_dataframe(predictions, generate_metadata=False) - outputs.metadata = self._update_predictions_metadata(inputs.metadata, outputs, target_columns_metadata) - return outputs - - - - @classmethod - def _copy_columns_metadata(cls, inputs_metadata: metadata_base.DataMetadata, column_indices) -> List[OrderedDict]: - outputs_length = inputs_metadata.query((metadata_base.ALL_ELEMENTS,))['dimension']['length'] - - target_columns_metadata: List[OrderedDict] = [] - for column_index in column_indices: - column_name = inputs_metadata.query((metadata_base.ALL_ELEMENTS, column_index)).get("name") - column_metadata = OrderedDict(inputs_metadata.query_column(column_index)) - semantic_types = set(column_metadata.get('semantic_types', [])) - semantic_types_to_remove = set([]) - add_semantic_types = [] - semantic_types = semantic_types - semantic_types_to_remove - semantic_types = semantic_types.union(add_semantic_types) - column_metadata['semantic_types'] = list(semantic_types) - - column_metadata["name"] = str(column_name) - target_columns_metadata.append(column_metadata) - - return target_columns_metadata - - -SKSelectFwe.__doc__ = SelectFwe.__doc__ \ No newline at end of file diff --git a/common-primitives/sklearn-wrap/sklearn_wrap/SKSelectPercentile.py b/common-primitives/sklearn-wrap/sklearn_wrap/SKSelectPercentile.py deleted file mode 100644 index 05044c1..0000000 --- a/common-primitives/sklearn-wrap/sklearn_wrap/SKSelectPercentile.py +++ /dev/null @@ -1,428 +0,0 @@ -from typing import Any, Callable, List, Dict, Union, Optional, Sequence, Tuple -from numpy import ndarray -from collections import OrderedDict -from scipy import sparse -import os -import sklearn -import numpy -import typing - -# Custom import commands if any -from sklearn.feature_selection.univariate_selection import SelectPercentile -from sklearn.feature_selection import f_classif, f_regression, chi2 - - -from d3m.container.numpy import ndarray as d3m_ndarray -from d3m.container import DataFrame as d3m_dataframe -from d3m.metadata import hyperparams, params, base as metadata_base -from d3m import utils -from d3m.base import utils as base_utils -from d3m.exceptions import PrimitiveNotFittedError -from d3m.primitive_interfaces.base import CallResult, DockerContainer - -from d3m.primitive_interfaces.supervised_learning import SupervisedLearnerPrimitiveBase -from d3m.primitive_interfaces.base import ProbabilisticCompositionalityMixin, ContinueFitMixin -from d3m import exceptions -import pandas - - - -Inputs = d3m_dataframe -Outputs = d3m_dataframe - - -class Params(params.Params): - scores_: Optional[ndarray] - pvalues_: Optional[ndarray] - input_column_names: Optional[Any] - target_names_: Optional[Sequence[Any]] - training_indices_: Optional[Sequence[int]] - target_column_indices_: Optional[Sequence[int]] - target_columns_metadata_: Optional[List[OrderedDict]] - - - -class Hyperparams(hyperparams.Hyperparams): - score_func = hyperparams.Enumeration[str]( - default='f_classif', - values=['f_classif', 'f_regression', 'chi2'], - description='Function taking two arrays X and y, and returning a pair of arrays (scores, pvalues) or a single array with scores. Default is f_classif (see below "See also"). The default function only works with classification tasks.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - percentile = hyperparams.Bounded[int]( - default=10, - lower=0, - upper=100, - description='Percent of features to keep.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - - use_inputs_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to force primitive to use as training input. If any specified column cannot be parsed, it is skipped.", - ) - use_outputs_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to force primitive to use as training target. If any specified column cannot be parsed, it is skipped.", - ) - exclude_inputs_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to not use as training inputs. Applicable only if \"use_columns\" is not provided.", - ) - exclude_outputs_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to not use as training target. Applicable only if \"use_columns\" is not provided.", - ) - return_result = hyperparams.Enumeration( - values=['update_semantic_types', 'replace', 'new'], - default='new', - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Should parsed columns be appended, should they replace original columns, or should only parsed columns be returned? This hyperparam is ignored if use_semantic_types is set to false.", -) - use_semantic_types = hyperparams.UniformBool( - default=False, - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Controls whether semantic_types metadata will be used for filtering columns in input dataframe. Setting this to false makes the code ignore return_result and will produce only the output dataframe" - ) - add_index_columns = hyperparams.UniformBool( - default=False, - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Also include primary index columns if input data has them. Applicable only if \"return_result\" is set to \"new\".", - ) - error_on_no_input = hyperparams.UniformBool( - default=True, - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Throw an exception if no input column is selected/provided. Defaults to true to behave like sklearn. To prevent pipelines from breaking set this to False.", - ) - - return_semantic_type = hyperparams.Enumeration[str]( - values=['https://metadata.datadrivendiscovery.org/types/Attribute', 'https://metadata.datadrivendiscovery.org/types/ConstructedAttribute', 'https://metadata.datadrivendiscovery.org/types/PredictedTarget'], - default='https://metadata.datadrivendiscovery.org/types/PredictedTarget', - description='Decides what semantic type to attach to generated output', - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'] - ) - -class SKSelectPercentile(SupervisedLearnerPrimitiveBase[Inputs, Outputs, Params, Hyperparams]): - """ - Primitive wrapping for sklearn SelectPercentile - `sklearn documentation `_ - - """ - - __author__ = "JPL MARVIN" - metadata = metadata_base.PrimitiveMetadata({ - "algorithm_types": [metadata_base.PrimitiveAlgorithmType.STATISTICAL_MOMENT_ANALYSIS, ], - "name": "sklearn.feature_selection.univariate_selection.SelectPercentile", - "primitive_family": metadata_base.PrimitiveFamily.FEATURE_SELECTION, - "python_path": "d3m.primitives.feature_selection.select_percentile.SKlearn", - "source": {'name': 'JPL', 'contact': 'mailto:shah@jpl.nasa.gov', 'uris': ['https://gitlab.com/datadrivendiscovery/sklearn-wrap/issues', 'https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectPercentile.html']}, - "version": "2019.11.13", - "id": "16696c4d-bed9-34a2-b9ae-b882c069512d", - "hyperparams_to_tune": ['percentile'], - 'installation': [ - {'type': metadata_base.PrimitiveInstallationType.PIP, - 'package_uri': 'git+https://gitlab.com/datadrivendiscovery/sklearn-wrap.git@{git_commit}#egg=sklearn_wrap'.format( - git_commit=utils.current_git_commit(os.path.dirname(__file__)), - ), - }] - }) - - def __init__(self, *, - hyperparams: Hyperparams, - random_seed: int = 0, - docker_containers: Dict[str, DockerContainer] = None) -> None: - - super().__init__(hyperparams=hyperparams, random_seed=random_seed, docker_containers=docker_containers) - - # False - self._clf = SelectPercentile( - score_func=eval(self.hyperparams['score_func']), - percentile=self.hyperparams['percentile'], - ) - - self._inputs = None - self._outputs = None - self._training_inputs = None - self._training_outputs = None - self._target_names = None - self._training_indices = None - self._target_column_indices = None - self._target_columns_metadata: List[OrderedDict] = None - self._input_column_names = None - self._fitted = False - self._new_training_data = False - - def set_training_data(self, *, inputs: Inputs, outputs: Outputs) -> None: - self._inputs = inputs - self._outputs = outputs - self._fitted = False - self._new_training_data = True - - def fit(self, *, timeout: float = None, iterations: int = None) -> CallResult[None]: - if self._fitted: - return CallResult(None) - - self._training_inputs, self._training_indices = self._get_columns_to_fit(self._inputs, self.hyperparams) - self._training_outputs, self._target_names, self._target_column_indices = self._get_targets(self._outputs, self.hyperparams) - self._input_column_names = self._training_inputs.columns - - if self._training_inputs is None or self._training_outputs is None: - raise ValueError("Missing training data.") - - if len(self._training_indices) > 0 and len(self._target_column_indices) > 0: - sk_training_output = self._training_outputs.values - - shape = sk_training_output.shape - if len(shape) == 2 and shape[1] == 1: - sk_training_output = numpy.ravel(sk_training_output) - - self._clf.fit(self._training_inputs, sk_training_output) - self._fitted = True - else: - if self.hyperparams['error_on_no_input']: - raise RuntimeError("No input columns were selected") - self.logger.warn("No input columns were selected") - - return CallResult(None) - - def produce(self, *, inputs: Inputs, timeout: float = None, iterations: int = None) -> CallResult[Outputs]: - sk_inputs, columns_to_use = self._get_columns_to_fit(inputs, self.hyperparams) - output = [] - if len(sk_inputs.columns): - try: - sk_output = self._clf.transform(sk_inputs) - except sklearn.exceptions.NotFittedError as error: - raise PrimitiveNotFittedError("Primitive not fitted.") from error - if sparse.issparse(sk_output): - sk_output = sk_output.toarray() - target_columns_metadata = self._copy_columns_metadata(inputs.iloc[:, self._training_indices].metadata, - self.produce_support().value) - output = self._wrap_predictions(inputs, sk_output, target_columns_metadata) - output.columns = [inputs.columns[idx] for idx in range(len(inputs.columns)) if idx in self.produce_support().value] - output = [output] - else: - if self.hyperparams['error_on_no_input']: - raise RuntimeError("No input columns were selected") - self.logger.warn("No input columns were selected") - - if self.hyperparams['return_result'] == 'update_semantic_types': - temp_inputs = inputs.copy() - columns_not_selected = sorted(set(range(len(temp_inputs.columns))) - set(self.produce_support().value)) - - for idx in columns_not_selected: - temp_inputs.metadata = temp_inputs.metadata.remove_semantic_type((metadata_base.ALL_ELEMENTS, idx), - 'https://metadata.datadrivendiscovery.org/types/Attribute') - - temp_inputs = temp_inputs.select_columns(self._training_indices) - outputs = base_utils.combine_columns(return_result='replace', - add_index_columns=self.hyperparams['add_index_columns'], - inputs=inputs, column_indices=self._training_indices, - columns_list=[temp_inputs]) - return CallResult(outputs) - - outputs = base_utils.combine_columns(return_result=self.hyperparams['return_result'], - add_index_columns=self.hyperparams['add_index_columns'], - inputs=inputs, column_indices=self._training_indices, - columns_list=output) - - return CallResult(outputs) - - def produce_support(self, *, timeout: float = None, iterations: int = None) -> CallResult[Any]: - all_indices = self._training_indices - selected_indices = self._clf.get_support(indices=True).tolist() - indices = [all_indices[index] for index in selected_indices] - return CallResult(indices) - - - def get_params(self) -> Params: - if not self._fitted: - return Params( - scores_=None, - pvalues_=None, - input_column_names=self._input_column_names, - training_indices_=self._training_indices, - target_names_=self._target_names, - target_column_indices_=self._target_column_indices, - target_columns_metadata_=self._target_columns_metadata - ) - - return Params( - scores_=getattr(self._clf, 'scores_', None), - pvalues_=getattr(self._clf, 'pvalues_', None), - input_column_names=self._input_column_names, - training_indices_=self._training_indices, - target_names_=self._target_names, - target_column_indices_=self._target_column_indices, - target_columns_metadata_=self._target_columns_metadata - ) - - def set_params(self, *, params: Params) -> None: - self._clf.scores_ = params['scores_'] - self._clf.pvalues_ = params['pvalues_'] - self._input_column_names = params['input_column_names'] - self._training_indices = params['training_indices_'] - self._target_names = params['target_names_'] - self._target_column_indices = params['target_column_indices_'] - self._target_columns_metadata = params['target_columns_metadata_'] - - if params['scores_'] is not None: - self._fitted = True - if params['pvalues_'] is not None: - self._fitted = True - - - - - - - - @classmethod - def _get_columns_to_fit(cls, inputs: Inputs, hyperparams: Hyperparams): - if not hyperparams['use_semantic_types']: - return inputs, list(range(len(inputs.columns))) - - inputs_metadata = inputs.metadata - - def can_produce_column(column_index: int) -> bool: - return cls._can_produce_column(inputs_metadata, column_index, hyperparams) - - columns_to_produce, columns_not_to_produce = base_utils.get_columns_to_use(inputs_metadata, - use_columns=hyperparams['use_inputs_columns'], - exclude_columns=hyperparams['exclude_inputs_columns'], - can_use_column=can_produce_column) - return inputs.iloc[:, columns_to_produce], columns_to_produce - # return columns_to_produce - - @classmethod - def _can_produce_column(cls, inputs_metadata: metadata_base.DataMetadata, column_index: int, hyperparams: Hyperparams) -> bool: - column_metadata = inputs_metadata.query((metadata_base.ALL_ELEMENTS, column_index)) - - accepted_structural_types = (int, float, numpy.integer, numpy.float64) - accepted_semantic_types = set() - accepted_semantic_types.add("https://metadata.datadrivendiscovery.org/types/Attribute") - if not issubclass(column_metadata['structural_type'], accepted_structural_types): - return False - - semantic_types = set(column_metadata.get('semantic_types', [])) - - if len(semantic_types) == 0: - cls.logger.warning("No semantic types found in column metadata") - return False - # Making sure all accepted_semantic_types are available in semantic_types - if len(accepted_semantic_types - semantic_types) == 0: - return True - - return False - - @classmethod - def _get_targets(cls, data: d3m_dataframe, hyperparams: Hyperparams): - if not hyperparams['use_semantic_types']: - return data, list(data.columns), list(range(len(data.columns))) - - metadata = data.metadata - - def can_produce_column(column_index: int) -> bool: - accepted_semantic_types = set() - accepted_semantic_types.add("https://metadata.datadrivendiscovery.org/types/TrueTarget") - column_metadata = metadata.query((metadata_base.ALL_ELEMENTS, column_index)) - semantic_types = set(column_metadata.get('semantic_types', [])) - if len(semantic_types) == 0: - cls.logger.warning("No semantic types found in column metadata") - return False - # Making sure all accepted_semantic_types are available in semantic_types - if len(accepted_semantic_types - semantic_types) == 0: - return True - return False - - target_column_indices, target_columns_not_to_produce = base_utils.get_columns_to_use(metadata, - use_columns=hyperparams[ - 'use_outputs_columns'], - exclude_columns= - hyperparams[ - 'exclude_outputs_columns'], - can_use_column=can_produce_column) - targets = [] - if target_column_indices: - targets = data.select_columns(target_column_indices) - target_column_names = [] - for idx in target_column_indices: - target_column_names.append(data.columns[idx]) - return targets, target_column_names, target_column_indices - - @classmethod - def _get_target_columns_metadata(cls, outputs_metadata: metadata_base.DataMetadata, hyperparams) -> List[OrderedDict]: - outputs_length = outputs_metadata.query((metadata_base.ALL_ELEMENTS,))['dimension']['length'] - - target_columns_metadata: List[OrderedDict] = [] - for column_index in range(outputs_length): - column_metadata = OrderedDict(outputs_metadata.query_column(column_index)) - - # Update semantic types and prepare it for predicted targets. - semantic_types = set(column_metadata.get('semantic_types', [])) - semantic_types_to_remove = set([]) - add_semantic_types = [] - add_semantic_types.add(hyperparams["return_semantic_type"]) - semantic_types = semantic_types - semantic_types_to_remove - semantic_types = semantic_types.union(add_semantic_types) - column_metadata['semantic_types'] = list(semantic_types) - - target_columns_metadata.append(column_metadata) - - return target_columns_metadata - - @classmethod - def _update_predictions_metadata(cls, inputs_metadata: metadata_base.DataMetadata, outputs: Optional[Outputs], - target_columns_metadata: List[OrderedDict]) -> metadata_base.DataMetadata: - outputs_metadata = metadata_base.DataMetadata().generate(value=outputs) - - for column_index, column_metadata in enumerate(target_columns_metadata): - if len(target_columns_metadata) == 1: - name = column_metadata.get("name") - for idx in range(len(outputs.columns)): - outputs_metadata = outputs_metadata.update_column(idx, column_metadata) - if len(outputs.columns) > 1: - # Updating column names. - outputs_metadata = outputs_metadata.update((metadata_base.ALL_ELEMENTS, idx), {'name': "{}_{}".format(name, idx)}) - else: - outputs_metadata = outputs_metadata.update_column(column_index, column_metadata) - - return outputs_metadata - - - def _wrap_predictions(self, inputs: Inputs, predictions: ndarray, target_columns_metadata) -> Outputs: - outputs = d3m_dataframe(predictions, generate_metadata=False) - outputs.metadata = self._update_predictions_metadata(inputs.metadata, outputs, target_columns_metadata) - return outputs - - - - @classmethod - def _copy_columns_metadata(cls, inputs_metadata: metadata_base.DataMetadata, column_indices) -> List[OrderedDict]: - outputs_length = inputs_metadata.query((metadata_base.ALL_ELEMENTS,))['dimension']['length'] - - target_columns_metadata: List[OrderedDict] = [] - for column_index in column_indices: - column_name = inputs_metadata.query((metadata_base.ALL_ELEMENTS, column_index)).get("name") - column_metadata = OrderedDict(inputs_metadata.query_column(column_index)) - semantic_types = set(column_metadata.get('semantic_types', [])) - semantic_types_to_remove = set([]) - add_semantic_types = [] - semantic_types = semantic_types - semantic_types_to_remove - semantic_types = semantic_types.union(add_semantic_types) - column_metadata['semantic_types'] = list(semantic_types) - - column_metadata["name"] = str(column_name) - target_columns_metadata.append(column_metadata) - - return target_columns_metadata - - -SKSelectPercentile.__doc__ = SelectPercentile.__doc__ \ No newline at end of file diff --git a/common-primitives/sklearn-wrap/sklearn_wrap/SKSparseRandomProjection.py b/common-primitives/sklearn-wrap/sklearn_wrap/SKSparseRandomProjection.py deleted file mode 100644 index 351f4d8..0000000 --- a/common-primitives/sklearn-wrap/sklearn_wrap/SKSparseRandomProjection.py +++ /dev/null @@ -1,375 +0,0 @@ -from typing import Any, Callable, List, Dict, Union, Optional, Sequence, Tuple -from numpy import ndarray -from collections import OrderedDict -from scipy import sparse -import os -import sklearn -import numpy -import typing - -# Custom import commands if any -from sklearn.random_projection import SparseRandomProjection - - -from d3m.container.numpy import ndarray as d3m_ndarray -from d3m.container import DataFrame as d3m_dataframe -from d3m.metadata import hyperparams, params, base as metadata_base -from d3m import utils -from d3m.base import utils as base_utils -from d3m.exceptions import PrimitiveNotFittedError -from d3m.primitive_interfaces.base import CallResult, DockerContainer -from d3m.primitive_interfaces.unsupervised_learning import UnsupervisedLearnerPrimitiveBase - - -Inputs = d3m_dataframe -Outputs = d3m_dataframe - - -class Params(params.Params): - n_component_: Optional[int] - components_: Optional[Union[ndarray, sparse.spmatrix]] - density_: Optional[float] - input_column_names: Optional[Any] - target_names_: Optional[Sequence[Any]] - training_indices_: Optional[Sequence[int]] - target_column_indices_: Optional[Sequence[int]] - target_columns_metadata_: Optional[List[OrderedDict]] - - - -class Hyperparams(hyperparams.Hyperparams): - n_components = hyperparams.Union( - configuration=OrderedDict({ - 'int': hyperparams.Bounded[int]( - lower=0, - upper=None, - default=100, - description='Number of components to keep.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ), - 'auto': hyperparams.Constant( - default='auto', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ) - }), - default='auto', - description='Dimensionality of the target projection space. n_components can be automatically adjusted according to the number of samples in the dataset and the bound given by the Johnson-Lindenstrauss lemma. In that case the quality of the embedding is controlled by the ``eps`` parameter. It should be noted that Johnson-Lindenstrauss lemma can yield very conservative estimated of the required number of components as it makes no assumption on the structure of the dataset.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - density = hyperparams.Union( - configuration=OrderedDict({ - 'float': hyperparams.Uniform( - lower=0, - upper=1, - default=0.3, - description='Number of components to keep.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ), - 'auto': hyperparams.Constant( - default='auto', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ) - }), - default='auto', - description='Ratio of non-zero component in the random projection matrix. If density = \'auto\', the value is set to the minimum density as recommended by Ping Li et al.: 1 / sqrt(n_features). Use density = 1 / 3.0 if you want to reproduce the results from Achlioptas, 2001.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - eps = hyperparams.Bounded[float]( - default=0.1, - lower=0, - upper=1, - description='Parameter to control the quality of the embedding according to the Johnson-Lindenstrauss lemma when n_components is set to \'auto\'. Smaller values lead to better embedding and higher number of dimensions (n_components) in the target projection space.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - dense_output = hyperparams.UniformBool( - default=False, - description='If True, ensure that the output of the random projection is a dense numpy array even if the input and random projection matrix are both sparse. In practice, if the number of components is small the number of zero components in the projected data will be very small and it will be more CPU and memory efficient to use a dense representation. If False, the projected data uses a sparse representation if the input is sparse.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - - use_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to force primitive to operate on. If any specified column cannot be parsed, it is skipped.", - ) - exclude_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to not operate on. Applicable only if \"use_columns\" is not provided.", - ) - return_result = hyperparams.Enumeration( - values=['append', 'replace', 'new'], - default='new', - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Should parsed columns be appended, should they replace original columns, or should only parsed columns be returned? This hyperparam is ignored if use_semantic_types is set to false.", - ) - use_semantic_types = hyperparams.UniformBool( - default=False, - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Controls whether semantic_types metadata will be used for filtering columns in input dataframe. Setting this to false makes the code ignore return_result and will produce only the output dataframe" - ) - add_index_columns = hyperparams.UniformBool( - default=False, - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Also include primary index columns if input data has them. Applicable only if \"return_result\" is set to \"new\".", - ) - error_on_no_input = hyperparams.UniformBool( - default=True, - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Throw an exception if no input column is selected/provided. Defaults to true to behave like sklearn. To prevent pipelines from breaking set this to False.", - ) - - return_semantic_type = hyperparams.Enumeration[str]( - values=['https://metadata.datadrivendiscovery.org/types/Attribute', 'https://metadata.datadrivendiscovery.org/types/ConstructedAttribute'], - default='https://metadata.datadrivendiscovery.org/types/Attribute', - description='Decides what semantic type to attach to generated attributes', - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'] - ) - -class SKSparseRandomProjection(UnsupervisedLearnerPrimitiveBase[Inputs, Outputs, Params, Hyperparams]): - """ - Primitive wrapping for sklearn SparseRandomProjection - `sklearn documentation `_ - - """ - - __author__ = "JPL MARVIN" - metadata = metadata_base.PrimitiveMetadata({ - "algorithm_types": [metadata_base.PrimitiveAlgorithmType.RANDOM_PROJECTION, ], - "name": "sklearn.random_projection.SparseRandomProjection", - "primitive_family": metadata_base.PrimitiveFamily.DATA_TRANSFORMATION, - "python_path": "d3m.primitives.data_transformation.sparse_random_projection.SKlearn", - "source": {'name': 'JPL', 'contact': 'mailto:shah@jpl.nasa.gov', 'uris': ['https://gitlab.com/datadrivendiscovery/sklearn-wrap/issues', 'https://scikit-learn.org/stable/modules/generated/sklearn.random_projection.SparseRandomProjection.html']}, - "version": "2019.11.13", - "id": "43ddd6be-bb4f-3fd0-8765-df961c16d7dc", - "hyperparams_to_tune": ['n_components'], - 'installation': [ - {'type': metadata_base.PrimitiveInstallationType.PIP, - 'package_uri': 'git+https://gitlab.com/datadrivendiscovery/sklearn-wrap.git@{git_commit}#egg=sklearn_wrap'.format( - git_commit=utils.current_git_commit(os.path.dirname(__file__)), - ), - }] - }) - - def __init__(self, *, - hyperparams: Hyperparams, - random_seed: int = 0, - docker_containers: Dict[str, DockerContainer] = None) -> None: - - super().__init__(hyperparams=hyperparams, random_seed=random_seed, docker_containers=docker_containers) - - # False - self._clf = SparseRandomProjection( - n_components=self.hyperparams['n_components'], - density=self.hyperparams['density'], - eps=self.hyperparams['eps'], - dense_output=self.hyperparams['dense_output'], - random_state=self.random_seed, - ) - - self._inputs = None - self._outputs = None - self._training_inputs = None - self._training_outputs = None - self._target_names = None - self._training_indices = None - self._target_column_indices = None - self._target_columns_metadata: List[OrderedDict] = None - self._input_column_names = None - self._fitted = False - - - def set_training_data(self, *, inputs: Inputs) -> None: - self._inputs = inputs - self._fitted = False - - def fit(self, *, timeout: float = None, iterations: int = None) -> CallResult[None]: - if self._fitted: - return CallResult(None) - - self._training_inputs, self._training_indices = self._get_columns_to_fit(self._inputs, self.hyperparams) - self._input_column_names = self._training_inputs.columns - - if self._training_inputs is None: - return CallResult(None) - - if len(self._training_indices) > 0: - self._clf.fit(self._training_inputs) - self._fitted = True - else: - if self.hyperparams['error_on_no_input']: - raise RuntimeError("No input columns were selected") - self.logger.warn("No input columns were selected") - return CallResult(None) - - def produce(self, *, inputs: Inputs, timeout: float = None, iterations: int = None) -> CallResult[Outputs]: - if not self._fitted: - raise PrimitiveNotFittedError("Primitive not fitted.") - sk_inputs = inputs - if self.hyperparams['use_semantic_types']: - sk_inputs = inputs.iloc[:, self._training_indices] - output_columns = [] - if len(self._training_indices) > 0: - sk_output = self._clf.transform(sk_inputs) - if sparse.issparse(sk_output): - sk_output = sk_output.toarray() - outputs = self._wrap_predictions(inputs, sk_output) - if len(outputs.columns) == len(self._input_column_names): - outputs.columns = self._input_column_names - output_columns = [outputs] - else: - if self.hyperparams['error_on_no_input']: - raise RuntimeError("No input columns were selected") - self.logger.warn("No input columns were selected") - outputs = base_utils.combine_columns(return_result=self.hyperparams['return_result'], - add_index_columns=self.hyperparams['add_index_columns'], - inputs=inputs, column_indices=self._training_indices, - columns_list=output_columns) - return CallResult(outputs) - - - def get_params(self) -> Params: - if not self._fitted: - return Params( - n_component_=None, - components_=None, - density_=None, - input_column_names=self._input_column_names, - training_indices_=self._training_indices, - target_names_=self._target_names, - target_column_indices_=self._target_column_indices, - target_columns_metadata_=self._target_columns_metadata - ) - - return Params( - n_component_=getattr(self._clf, 'n_component_', None), - components_=getattr(self._clf, 'components_', None), - density_=getattr(self._clf, 'density_', None), - input_column_names=self._input_column_names, - training_indices_=self._training_indices, - target_names_=self._target_names, - target_column_indices_=self._target_column_indices, - target_columns_metadata_=self._target_columns_metadata - ) - - def set_params(self, *, params: Params) -> None: - self._clf.n_component_ = params['n_component_'] - self._clf.components_ = params['components_'] - self._clf.density_ = params['density_'] - self._input_column_names = params['input_column_names'] - self._training_indices = params['training_indices_'] - self._target_names = params['target_names_'] - self._target_column_indices = params['target_column_indices_'] - self._target_columns_metadata = params['target_columns_metadata_'] - - if params['n_component_'] is not None: - self._fitted = True - if params['components_'] is not None: - self._fitted = True - if params['density_'] is not None: - self._fitted = True - - - - - - @classmethod - def _get_columns_to_fit(cls, inputs: Inputs, hyperparams: Hyperparams): - if not hyperparams['use_semantic_types']: - return inputs, list(range(len(inputs.columns))) - - inputs_metadata = inputs.metadata - - def can_produce_column(column_index: int) -> bool: - return cls._can_produce_column(inputs_metadata, column_index, hyperparams) - - columns_to_produce, columns_not_to_produce = base_utils.get_columns_to_use(inputs_metadata, - use_columns=hyperparams['use_columns'], - exclude_columns=hyperparams['exclude_columns'], - can_use_column=can_produce_column) - return inputs.iloc[:, columns_to_produce], columns_to_produce - # return columns_to_produce - - @classmethod - def _can_produce_column(cls, inputs_metadata: metadata_base.DataMetadata, column_index: int, hyperparams: Hyperparams) -> bool: - column_metadata = inputs_metadata.query((metadata_base.ALL_ELEMENTS, column_index)) - - accepted_structural_types = (int, float, numpy.integer, numpy.float64) - accepted_semantic_types = set() - accepted_semantic_types.add("https://metadata.datadrivendiscovery.org/types/Attribute") - if not issubclass(column_metadata['structural_type'], accepted_structural_types): - return False - - semantic_types = set(column_metadata.get('semantic_types', [])) - - if len(semantic_types) == 0: - cls.logger.warning("No semantic types found in column metadata") - return False - - # Making sure all accepted_semantic_types are available in semantic_types - if len(accepted_semantic_types - semantic_types) == 0: - return True - - return False - - - @classmethod - def _get_target_columns_metadata(cls, outputs_metadata: metadata_base.DataMetadata, hyperparams) -> List[OrderedDict]: - outputs_length = outputs_metadata.query((metadata_base.ALL_ELEMENTS,))['dimension']['length'] - - target_columns_metadata: List[OrderedDict] = [] - for column_index in range(outputs_length): - column_metadata = OrderedDict(outputs_metadata.query_column(column_index)) - - # Update semantic types and prepare it for predicted targets. - semantic_types = set(column_metadata.get('semantic_types', [])) - semantic_types_to_remove = set([]) - add_semantic_types = [] - add_semantic_types.add(hyperparams["return_semantic_type"]) - semantic_types = semantic_types - semantic_types_to_remove - semantic_types = semantic_types.union(add_semantic_types) - column_metadata['semantic_types'] = list(semantic_types) - - target_columns_metadata.append(column_metadata) - - return target_columns_metadata - - @classmethod - def _update_predictions_metadata(cls, inputs_metadata: metadata_base.DataMetadata, outputs: Optional[Outputs], - target_columns_metadata: List[OrderedDict]) -> metadata_base.DataMetadata: - outputs_metadata = metadata_base.DataMetadata().generate(value=outputs) - - for column_index, column_metadata in enumerate(target_columns_metadata): - column_metadata.pop("structural_type", None) - outputs_metadata = outputs_metadata.update_column(column_index, column_metadata) - - return outputs_metadata - - def _wrap_predictions(self, inputs: Inputs, predictions: ndarray) -> Outputs: - outputs = d3m_dataframe(predictions, generate_metadata=True) - target_columns_metadata = self._add_target_columns_metadata(outputs.metadata, self.hyperparams) - outputs.metadata = self._update_predictions_metadata(inputs.metadata, outputs, target_columns_metadata) - return outputs - - - @classmethod - def _add_target_columns_metadata(cls, outputs_metadata: metadata_base.DataMetadata, hyperparams): - - outputs_length = outputs_metadata.query((metadata_base.ALL_ELEMENTS,))['dimension']['length'] - target_columns_metadata: List[OrderedDict] = [] - for column_index in range(outputs_length): - column_name = "output_{}".format(column_index) - column_metadata = OrderedDict() - semantic_types = set() - semantic_types.add(hyperparams["return_semantic_type"]) - column_metadata['semantic_types'] = list(semantic_types) - - column_metadata["name"] = str(column_name) - target_columns_metadata.append(column_metadata) - - return target_columns_metadata - - -SKSparseRandomProjection.__doc__ = SparseRandomProjection.__doc__ \ No newline at end of file diff --git a/common-primitives/sklearn-wrap/sklearn_wrap/SKStandardScaler.py b/common-primitives/sklearn-wrap/sklearn_wrap/SKStandardScaler.py deleted file mode 100644 index f8491bb..0000000 --- a/common-primitives/sklearn-wrap/sklearn_wrap/SKStandardScaler.py +++ /dev/null @@ -1,357 +0,0 @@ -from typing import Any, Callable, List, Dict, Union, Optional, Sequence, Tuple -from numpy import ndarray -from collections import OrderedDict -from scipy import sparse -import os -import sklearn -import numpy -import typing - -# Custom import commands if any -from sklearn.preprocessing.data import StandardScaler - - -from d3m.container.numpy import ndarray as d3m_ndarray -from d3m.container import DataFrame as d3m_dataframe -from d3m.metadata import hyperparams, params, base as metadata_base -from d3m import utils -from d3m.base import utils as base_utils -from d3m.exceptions import PrimitiveNotFittedError -from d3m.primitive_interfaces.base import CallResult, DockerContainer -from d3m.primitive_interfaces.unsupervised_learning import UnsupervisedLearnerPrimitiveBase - - -Inputs = d3m_dataframe -Outputs = d3m_dataframe - - -class Params(params.Params): - scale_: Optional[ndarray] - mean_: Optional[ndarray] - var_: Optional[ndarray] - n_samples_seen_: Optional[Union[int, numpy.integer]] - input_column_names: Optional[Any] - target_names_: Optional[Sequence[Any]] - training_indices_: Optional[Sequence[int]] - target_column_indices_: Optional[Sequence[int]] - target_columns_metadata_: Optional[List[OrderedDict]] - - - -class Hyperparams(hyperparams.Hyperparams): - with_mean = hyperparams.UniformBool( - default=True, - description='If True, center the data before scaling. This does not work (and will raise an exception) when attempted on sparse matrices, because centering them entails building a dense matrix which in common use cases is likely to be too large to fit in memory.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - with_std = hyperparams.UniformBool( - default=True, - description='If True, scale the data to unit variance (or equivalently, unit standard deviation).', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - - use_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to force primitive to operate on. If any specified column cannot be parsed, it is skipped.", - ) - exclude_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to not operate on. Applicable only if \"use_columns\" is not provided.", - ) - return_result = hyperparams.Enumeration( - values=['append', 'replace', 'new'], - default='new', - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Should parsed columns be appended, should they replace original columns, or should only parsed columns be returned? This hyperparam is ignored if use_semantic_types is set to false.", - ) - use_semantic_types = hyperparams.UniformBool( - default=False, - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Controls whether semantic_types metadata will be used for filtering columns in input dataframe. Setting this to false makes the code ignore return_result and will produce only the output dataframe" - ) - add_index_columns = hyperparams.UniformBool( - default=False, - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Also include primary index columns if input data has them. Applicable only if \"return_result\" is set to \"new\".", - ) - error_on_no_input = hyperparams.UniformBool( - default=True, - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Throw an exception if no input column is selected/provided. Defaults to true to behave like sklearn. To prevent pipelines from breaking set this to False.", - ) - - return_semantic_type = hyperparams.Enumeration[str]( - values=['https://metadata.datadrivendiscovery.org/types/Attribute', 'https://metadata.datadrivendiscovery.org/types/ConstructedAttribute'], - default='https://metadata.datadrivendiscovery.org/types/Attribute', - description='Decides what semantic type to attach to generated attributes', - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'] - ) - -class SKStandardScaler(UnsupervisedLearnerPrimitiveBase[Inputs, Outputs, Params, Hyperparams]): - """ - Primitive wrapping for sklearn StandardScaler - `sklearn documentation `_ - - """ - - __author__ = "JPL MARVIN" - metadata = metadata_base.PrimitiveMetadata({ - "algorithm_types": [metadata_base.PrimitiveAlgorithmType.FEATURE_SCALING, ], - "name": "sklearn.preprocessing.data.StandardScaler", - "primitive_family": metadata_base.PrimitiveFamily.DATA_PREPROCESSING, - "python_path": "d3m.primitives.data_preprocessing.standard_scaler.SKlearn", - "source": {'name': 'JPL', 'contact': 'mailto:shah@jpl.nasa.gov', 'uris': ['https://gitlab.com/datadrivendiscovery/sklearn-wrap/issues', 'https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html']}, - "version": "2019.11.13", - "id": "d639947e-ece0-3a39-a666-e974acf4521d", - 'installation': [ - {'type': metadata_base.PrimitiveInstallationType.PIP, - 'package_uri': 'git+https://gitlab.com/datadrivendiscovery/sklearn-wrap.git@{git_commit}#egg=sklearn_wrap'.format( - git_commit=utils.current_git_commit(os.path.dirname(__file__)), - ), - }] - }) - - def __init__(self, *, - hyperparams: Hyperparams, - random_seed: int = 0, - docker_containers: Dict[str, DockerContainer] = None) -> None: - - super().__init__(hyperparams=hyperparams, random_seed=random_seed, docker_containers=docker_containers) - - # False - self._clf = StandardScaler( - with_mean=self.hyperparams['with_mean'], - with_std=self.hyperparams['with_std'], - ) - - self._inputs = None - self._outputs = None - self._training_inputs = None - self._training_outputs = None - self._target_names = None - self._training_indices = None - self._target_column_indices = None - self._target_columns_metadata: List[OrderedDict] = None - self._input_column_names = None - self._fitted = False - - - def set_training_data(self, *, inputs: Inputs) -> None: - self._inputs = inputs - self._fitted = False - - def fit(self, *, timeout: float = None, iterations: int = None) -> CallResult[None]: - if self._fitted: - return CallResult(None) - - self._training_inputs, self._training_indices = self._get_columns_to_fit(self._inputs, self.hyperparams) - self._input_column_names = self._training_inputs.columns - - if self._training_inputs is None: - return CallResult(None) - - if len(self._training_indices) > 0: - self._clf.fit(self._training_inputs) - self._fitted = True - else: - if self.hyperparams['error_on_no_input']: - raise RuntimeError("No input columns were selected") - self.logger.warn("No input columns were selected") - return CallResult(None) - - def produce(self, *, inputs: Inputs, timeout: float = None, iterations: int = None) -> CallResult[Outputs]: - if not self._fitted: - raise PrimitiveNotFittedError("Primitive not fitted.") - sk_inputs = inputs - if self.hyperparams['use_semantic_types']: - sk_inputs = inputs.iloc[:, self._training_indices] - output_columns = [] - if len(self._training_indices) > 0: - sk_output = self._clf.transform(sk_inputs) - if sparse.issparse(sk_output): - sk_output = sk_output.toarray() - outputs = self._wrap_predictions(inputs, sk_output) - if len(outputs.columns) == len(self._input_column_names): - outputs.columns = self._input_column_names - output_columns = [outputs] - else: - if self.hyperparams['error_on_no_input']: - raise RuntimeError("No input columns were selected") - self.logger.warn("No input columns were selected") - outputs = base_utils.combine_columns(return_result=self.hyperparams['return_result'], - add_index_columns=self.hyperparams['add_index_columns'], - inputs=inputs, column_indices=self._training_indices, - columns_list=output_columns) - return CallResult(outputs) - - - def get_params(self) -> Params: - if not self._fitted: - return Params( - scale_=None, - mean_=None, - var_=None, - n_samples_seen_=None, - input_column_names=self._input_column_names, - training_indices_=self._training_indices, - target_names_=self._target_names, - target_column_indices_=self._target_column_indices, - target_columns_metadata_=self._target_columns_metadata - ) - - return Params( - scale_=getattr(self._clf, 'scale_', None), - mean_=getattr(self._clf, 'mean_', None), - var_=getattr(self._clf, 'var_', None), - n_samples_seen_=getattr(self._clf, 'n_samples_seen_', None), - input_column_names=self._input_column_names, - training_indices_=self._training_indices, - target_names_=self._target_names, - target_column_indices_=self._target_column_indices, - target_columns_metadata_=self._target_columns_metadata - ) - - def set_params(self, *, params: Params) -> None: - self._clf.scale_ = params['scale_'] - self._clf.mean_ = params['mean_'] - self._clf.var_ = params['var_'] - self._clf.n_samples_seen_ = params['n_samples_seen_'] - self._input_column_names = params['input_column_names'] - self._training_indices = params['training_indices_'] - self._target_names = params['target_names_'] - self._target_column_indices = params['target_column_indices_'] - self._target_columns_metadata = params['target_columns_metadata_'] - - if params['scale_'] is not None: - self._fitted = True - if params['mean_'] is not None: - self._fitted = True - if params['var_'] is not None: - self._fitted = True - if params['n_samples_seen_'] is not None: - self._fitted = True - - - - - - @classmethod - def _get_columns_to_fit(cls, inputs: Inputs, hyperparams: Hyperparams): - if not hyperparams['use_semantic_types']: - return inputs, list(range(len(inputs.columns))) - - inputs_metadata = inputs.metadata - - def can_produce_column(column_index: int) -> bool: - return cls._can_produce_column(inputs_metadata, column_index, hyperparams) - - columns_to_produce, columns_not_to_produce = base_utils.get_columns_to_use(inputs_metadata, - use_columns=hyperparams['use_columns'], - exclude_columns=hyperparams['exclude_columns'], - can_use_column=can_produce_column) - return inputs.iloc[:, columns_to_produce], columns_to_produce - # return columns_to_produce - - @classmethod - def _can_produce_column(cls, inputs_metadata: metadata_base.DataMetadata, column_index: int, hyperparams: Hyperparams) -> bool: - column_metadata = inputs_metadata.query((metadata_base.ALL_ELEMENTS, column_index)) - - accepted_structural_types = (int, float, numpy.integer, numpy.float64) - accepted_semantic_types = set() - accepted_semantic_types.add("https://metadata.datadrivendiscovery.org/types/Attribute") - if not issubclass(column_metadata['structural_type'], accepted_structural_types): - return False - - semantic_types = set(column_metadata.get('semantic_types', [])) - - if len(semantic_types) == 0: - cls.logger.warning("No semantic types found in column metadata") - return False - - # Making sure all accepted_semantic_types are available in semantic_types - if len(accepted_semantic_types - semantic_types) == 0: - return True - - return False - - - @classmethod - def _get_target_columns_metadata(cls, outputs_metadata: metadata_base.DataMetadata, hyperparams) -> List[OrderedDict]: - outputs_length = outputs_metadata.query((metadata_base.ALL_ELEMENTS,))['dimension']['length'] - - target_columns_metadata: List[OrderedDict] = [] - for column_index in range(outputs_length): - column_metadata = OrderedDict(outputs_metadata.query_column(column_index)) - - # Update semantic types and prepare it for predicted targets. - semantic_types = set(column_metadata.get('semantic_types', [])) - semantic_types_to_remove = set([]) - add_semantic_types = [] - add_semantic_types.add(hyperparams["return_semantic_type"]) - semantic_types = semantic_types - semantic_types_to_remove - semantic_types = semantic_types.union(add_semantic_types) - column_metadata['semantic_types'] = list(semantic_types) - - target_columns_metadata.append(column_metadata) - - return target_columns_metadata - - @classmethod - def _update_predictions_metadata(cls, inputs_metadata: metadata_base.DataMetadata, outputs: Optional[Outputs], - target_columns_metadata: List[OrderedDict]) -> metadata_base.DataMetadata: - outputs_metadata = metadata_base.DataMetadata().generate(value=outputs) - - for column_index, column_metadata in enumerate(target_columns_metadata): - column_metadata.pop("structural_type", None) - outputs_metadata = outputs_metadata.update_column(column_index, column_metadata) - - return outputs_metadata - - def _wrap_predictions(self, inputs: Inputs, predictions: ndarray) -> Outputs: - outputs = d3m_dataframe(predictions, generate_metadata=True) - target_columns_metadata = self._copy_inputs_metadata(inputs.metadata, self._training_indices, outputs.metadata, self.hyperparams) - outputs.metadata = self._update_predictions_metadata(inputs.metadata, outputs, target_columns_metadata) - return outputs - - - @classmethod - def _copy_inputs_metadata(cls, inputs_metadata: metadata_base.DataMetadata, input_indices: List[int], - outputs_metadata: metadata_base.DataMetadata, hyperparams): - outputs_length = outputs_metadata.query((metadata_base.ALL_ELEMENTS,))['dimension']['length'] - target_columns_metadata: List[OrderedDict] = [] - for column_index in input_indices: - column_name = inputs_metadata.query((metadata_base.ALL_ELEMENTS, column_index)).get("name") - if column_name is None: - column_name = "output_{}".format(column_index) - - column_metadata = OrderedDict(inputs_metadata.query_column(column_index)) - semantic_types = set(column_metadata.get('semantic_types', [])) - semantic_types_to_remove = set([]) - add_semantic_types = set() - add_semantic_types.add(hyperparams["return_semantic_type"]) - semantic_types = semantic_types - semantic_types_to_remove - semantic_types = semantic_types.union(add_semantic_types) - column_metadata['semantic_types'] = list(semantic_types) - - column_metadata["name"] = str(column_name) - target_columns_metadata.append(column_metadata) - - # If outputs has more columns than index, add Attribute Type to all remaining - if outputs_length > len(input_indices): - for column_index in range(len(input_indices), outputs_length): - column_metadata = OrderedDict() - semantic_types = set() - semantic_types.add(hyperparams["return_semantic_type"]) - column_name = "output_{}".format(column_index) - column_metadata["semantic_types"] = list(semantic_types) - column_metadata["name"] = str(column_name) - target_columns_metadata.append(column_metadata) - - return target_columns_metadata - - -SKStandardScaler.__doc__ = StandardScaler.__doc__ \ No newline at end of file diff --git a/common-primitives/sklearn-wrap/sklearn_wrap/SKStringImputer.py b/common-primitives/sklearn-wrap/sklearn_wrap/SKStringImputer.py deleted file mode 100644 index 6e0c125..0000000 --- a/common-primitives/sklearn-wrap/sklearn_wrap/SKStringImputer.py +++ /dev/null @@ -1,371 +0,0 @@ -from typing import Any, Callable, List, Dict, Union, Optional, Sequence, Tuple -from numpy import ndarray -from collections import OrderedDict -from scipy import sparse -import os -import sklearn -import numpy -import typing - -# Custom import commands if any -from sklearn.impute import SimpleImputer -from sklearn.impute._base import _get_mask - - -from d3m.container.numpy import ndarray as d3m_ndarray -from d3m.container import DataFrame as d3m_dataframe -from d3m.metadata import hyperparams, params, base as metadata_base -from d3m import utils -from d3m.base import utils as base_utils -from d3m.exceptions import PrimitiveNotFittedError -from d3m.primitive_interfaces.base import CallResult, DockerContainer -from d3m.primitive_interfaces.unsupervised_learning import UnsupervisedLearnerPrimitiveBase - - -Inputs = d3m_dataframe -Outputs = d3m_dataframe - - -class Params(params.Params): - statistics_: Optional[ndarray] - indicator_: Optional[sklearn.base.BaseEstimator] - input_column_names: Optional[Any] - target_names_: Optional[Sequence[Any]] - training_indices_: Optional[Sequence[int]] - target_column_indices_: Optional[Sequence[int]] - target_columns_metadata_: Optional[List[OrderedDict]] - - - -class Hyperparams(hyperparams.Hyperparams): - missing_values = hyperparams.Hyperparameter[str]( - default='', - description='The placeholder for the missing values. All occurrences of `missing_values` will be imputed.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'] - ) - add_indicator = hyperparams.UniformBool( - default=False, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - strategy = hyperparams.Enumeration[str]( - default='most_frequent', - values=['most_frequent', 'constant'], - description='The imputation strategy. - If "mean", then replace missing values using the mean along each column. Can only be used with numeric data. - If "median", then replace missing values using the median along each column. Can only be used with numeric data. - If "most_frequent", then replace missing using the most frequent value along each column. Can be used with strings or numeric data. - If "constant", then replace missing values with fill_value. Can be used with strings or numeric data. .. versionadded:: 0.20 strategy="constant" for fixed value imputation.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - fill_value = hyperparams.Hyperparameter[str]( - default='', - description='When strategy == "constant", fill_value is used to replace all occurrences of missing_values. If left to the default, fill_value will be 0 when imputing numerical data and "missing_value" for strings or object data types.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - - use_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to force primitive to operate on. If any specified column cannot be parsed, it is skipped.", - ) - exclude_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to not operate on. Applicable only if \"use_columns\" is not provided.", - ) - return_result = hyperparams.Enumeration( - values=['append', 'replace', 'new'], - default='new', - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Should parsed columns be appended, should they replace original columns, or should only parsed columns be returned? This hyperparam is ignored if use_semantic_types is set to false.", - ) - use_semantic_types = hyperparams.UniformBool( - default=False, - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Controls whether semantic_types metadata will be used for filtering columns in input dataframe. Setting this to false makes the code ignore return_result and will produce only the output dataframe" - ) - add_index_columns = hyperparams.UniformBool( - default=False, - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Also include primary index columns if input data has them. Applicable only if \"return_result\" is set to \"new\".", - ) - error_on_no_input = hyperparams.UniformBool( - default=True, - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Throw an exception if no input column is selected/provided. Defaults to true to behave like sklearn. To prevent pipelines from breaking set this to False.", - ) - - return_semantic_type = hyperparams.Enumeration[str]( - values=['https://metadata.datadrivendiscovery.org/types/Attribute', 'https://metadata.datadrivendiscovery.org/types/ConstructedAttribute'], - default='https://metadata.datadrivendiscovery.org/types/Attribute', - description='Decides what semantic type to attach to generated attributes', - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'] - ) - -class SKStringImputer(UnsupervisedLearnerPrimitiveBase[Inputs, Outputs, Params, Hyperparams]): - """ - Primitive wrapping for sklearn SimpleImputer - `sklearn documentation `_ - - """ - - __author__ = "JPL MARVIN" - metadata = metadata_base.PrimitiveMetadata({ - "algorithm_types": [metadata_base.PrimitiveAlgorithmType.IMPUTATION, ], - "name": "sklearn.impute.SimpleImputer", - "primitive_family": metadata_base.PrimitiveFamily.DATA_CLEANING, - "python_path": "d3m.primitives.data_cleaning.string_imputer.SKlearn", - "source": {'name': 'JPL', 'contact': 'mailto:shah@jpl.nasa.gov', 'uris': ['https://gitlab.com/datadrivendiscovery/sklearn-wrap/issues', 'https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html']}, - "version": "2019.11.13", - "id": "caeed986-cd1b-303b-900f-868dfc665341", - "hyperparams_to_tune": ['strategy'], - 'installation': [ - {'type': metadata_base.PrimitiveInstallationType.PIP, - 'package_uri': 'git+https://gitlab.com/datadrivendiscovery/sklearn-wrap.git@{git_commit}#egg=sklearn_wrap'.format( - git_commit=utils.current_git_commit(os.path.dirname(__file__)), - ), - }] - }) - - def __init__(self, *, - hyperparams: Hyperparams, - random_seed: int = 0, - docker_containers: Dict[str, DockerContainer] = None, - _verbose: int = 0) -> None: - - super().__init__(hyperparams=hyperparams, random_seed=random_seed, docker_containers=docker_containers) - - # False - self._clf = SimpleImputer( - missing_values=self.hyperparams['missing_values'], - add_indicator=self.hyperparams['add_indicator'], - strategy=self.hyperparams['strategy'], - fill_value=self.hyperparams['fill_value'], - verbose=_verbose - ) - - self._inputs = None - self._outputs = None - self._training_inputs = None - self._training_outputs = None - self._target_names = None - self._training_indices = None - self._target_column_indices = None - self._target_columns_metadata: List[OrderedDict] = None - self._input_column_names = None - self._fitted = False - - - def set_training_data(self, *, inputs: Inputs) -> None: - self._inputs = inputs - self._fitted = False - - def fit(self, *, timeout: float = None, iterations: int = None) -> CallResult[None]: - if self._fitted: - return CallResult(None) - - self._training_inputs, self._training_indices, _ = self._get_columns_to_fit(self._inputs, self.hyperparams) - self._input_column_names = self._training_inputs.columns - - if self._training_inputs is None: - return CallResult(None) - - if len(self._training_indices) > 0: - self._clf.fit(self._training_inputs) - self._fitted = True - else: - if self.hyperparams['error_on_no_input']: - raise RuntimeError("No input columns were selected") - self.logger.warn("No input columns were selected") - return CallResult(None) - - def produce(self, *, inputs: Inputs, timeout: float = None, iterations: int = None) -> CallResult[Outputs]: - sk_inputs, columns_to_use, _ = self._get_columns_to_fit(inputs, self.hyperparams) - output = [] - if len(sk_inputs.columns): - try: - sk_output = self._clf.transform(sk_inputs) - except sklearn.exceptions.NotFittedError as error: - raise PrimitiveNotFittedError("Primitive not fitted.") from error - if sparse.issparse(sk_output): - sk_output = sk_output.toarray() - target_columns_metadata = self._copy_columns_metadata(inputs.metadata, self._training_indices, self.hyperparams) - output = self._wrap_predictions(inputs, sk_output, target_columns_metadata) - - output.columns = [inputs.columns[idx] for idx in range(len(inputs.columns)) if idx in self._training_indices] - output = [output] - else: - if self.hyperparams['error_on_no_input']: - raise RuntimeError("No input columns were selected") - self.logger.warn("No input columns were selected") - _, _, dropped_cols = self._get_columns_to_fit(inputs, self.hyperparams) - outputs = base_utils.combine_columns(return_result=self.hyperparams['return_result'], - add_index_columns=self.hyperparams['add_index_columns'], - inputs=inputs, column_indices=self._training_indices + dropped_cols, - columns_list=output) - return CallResult(outputs) - - - def get_params(self) -> Params: - if not self._fitted: - return Params( - statistics_=None, - indicator_=None, - input_column_names=self._input_column_names, - training_indices_=self._training_indices, - target_names_=self._target_names, - target_column_indices_=self._target_column_indices, - target_columns_metadata_=self._target_columns_metadata - ) - - return Params( - statistics_=getattr(self._clf, 'statistics_', None), - indicator_=getattr(self._clf, 'indicator_', None), - input_column_names=self._input_column_names, - training_indices_=self._training_indices, - target_names_=self._target_names, - target_column_indices_=self._target_column_indices, - target_columns_metadata_=self._target_columns_metadata - ) - - def set_params(self, *, params: Params) -> None: - self._clf.statistics_ = params['statistics_'] - self._clf.indicator_ = params['indicator_'] - self._input_column_names = params['input_column_names'] - self._training_indices = params['training_indices_'] - self._target_names = params['target_names_'] - self._target_column_indices = params['target_column_indices_'] - self._target_columns_metadata = params['target_columns_metadata_'] - - if params['statistics_'] is not None: - self._fitted = True - if params['indicator_'] is not None: - self._fitted = True - - - - - - @classmethod - def _get_columns_to_fit(cls, inputs: Inputs, hyperparams: Hyperparams): - - if not hyperparams['use_semantic_types']: - columns_to_produce = list(range(len(inputs.columns))) - - else: - inputs_metadata = inputs.metadata - - def can_produce_column(column_index: int) -> bool: - return cls._can_produce_column(inputs_metadata, column_index, hyperparams) - - columns_to_produce, columns_not_to_produce = base_utils.get_columns_to_use(inputs_metadata, - use_columns=hyperparams['use_columns'], - exclude_columns=hyperparams['exclude_columns'], - can_use_column=can_produce_column) - - columns_to_drop = cls._get_columns_to_drop(inputs, columns_to_produce, hyperparams) - for col in columns_to_drop: - columns_to_produce.remove(col) - - return inputs.iloc[:, columns_to_produce], columns_to_produce, columns_to_drop - - @classmethod - def _get_columns_to_drop(cls, inputs: Inputs, column_indices: List[int], hyperparams: Hyperparams): - """ - Check for columns that contain missing_values that need to be imputed - If strategy is constant and missin_values is nan, then all nan columns will not be dropped - :param inputs: - :param column_indices: - :return: - """ - columns_to_remove = [] - if hyperparams['strategy'] != "constant": - for _, col in enumerate(column_indices): - inp = inputs.iloc[:, [col]].values - mask = _get_mask(inp, hyperparams['missing_values']) - if mask.all(): - columns_to_remove.append(col) - return columns_to_remove - - @classmethod - def _can_produce_column(cls, inputs_metadata: metadata_base.DataMetadata, column_index: int, hyperparams: Hyperparams) -> bool: - column_metadata = inputs_metadata.query((metadata_base.ALL_ELEMENTS, column_index)) - - accepted_structural_types = (str,) - accepted_semantic_types = set() - accepted_semantic_types.add("https://metadata.datadrivendiscovery.org/types/Attribute") - if not issubclass(column_metadata['structural_type'], accepted_structural_types): - return False - - semantic_types = set(column_metadata.get('semantic_types', [])) - - if len(semantic_types) == 0: - cls.logger.warning("No semantic types found in column metadata") - return False - # Making sure all accepted_semantic_types are available in semantic_types - if len(accepted_semantic_types - semantic_types) == 0: - return True - - return False - - - @classmethod - def _get_target_columns_metadata(cls, outputs_metadata: metadata_base.DataMetadata, hyperparams) -> List[OrderedDict]: - outputs_length = outputs_metadata.query((metadata_base.ALL_ELEMENTS,))['dimension']['length'] - - target_columns_metadata: List[OrderedDict] = [] - for column_index in range(outputs_length): - column_metadata = OrderedDict(outputs_metadata.query_column(column_index)) - - # Update semantic types and prepare it for predicted targets. - semantic_types = set(column_metadata.get('semantic_types', [])) - semantic_types_to_remove = set([]) - add_semantic_types = [] - add_semantic_types.add(hyperparams["return_semantic_type"]) - semantic_types = semantic_types - semantic_types_to_remove - semantic_types = semantic_types.union(add_semantic_types) - column_metadata['semantic_types'] = list(semantic_types) - - target_columns_metadata.append(column_metadata) - - return target_columns_metadata - - @classmethod - def _update_predictions_metadata(cls, inputs_metadata: metadata_base.DataMetadata, outputs: Optional[Outputs], - target_columns_metadata: List[OrderedDict]) -> metadata_base.DataMetadata: - outputs_metadata = metadata_base.DataMetadata().generate(value=outputs) - - for column_index, column_metadata in enumerate(target_columns_metadata): - column_metadata.pop("structural_type", None) - outputs_metadata = outputs_metadata.update_column(column_index, column_metadata) - - return outputs_metadata - - def _wrap_predictions(self, inputs: Inputs, predictions: ndarray, target_columns_metadata) -> Outputs: - outputs = d3m_dataframe(predictions, generate_metadata=False) - outputs.metadata = self._update_predictions_metadata(inputs.metadata, outputs, target_columns_metadata) - return outputs - - - - @classmethod - def _copy_columns_metadata(cls, inputs_metadata: metadata_base.DataMetadata, column_indices, hyperparams) -> List[OrderedDict]: - outputs_length = inputs_metadata.query((metadata_base.ALL_ELEMENTS,))['dimension']['length'] - - target_columns_metadata: List[OrderedDict] = [] - for column_index in column_indices: - column_name = inputs_metadata.query((metadata_base.ALL_ELEMENTS, column_index)).get("name") - column_metadata = OrderedDict(inputs_metadata.query_column(column_index)) - semantic_types = set(column_metadata.get('semantic_types', [])) - semantic_types_to_remove = set([]) - add_semantic_types = set() - add_semantic_types.add(hyperparams["return_semantic_type"]) - semantic_types = semantic_types - semantic_types_to_remove - semantic_types = semantic_types.union(add_semantic_types) - column_metadata['semantic_types'] = list(semantic_types) - - column_metadata["name"] = str(column_name) - target_columns_metadata.append(column_metadata) - - return target_columns_metadata - - -SKStringImputer.__doc__ = SimpleImputer.__doc__ \ No newline at end of file diff --git a/common-primitives/sklearn-wrap/sklearn_wrap/SKTfidfVectorizer.py b/common-primitives/sklearn-wrap/sklearn_wrap/SKTfidfVectorizer.py deleted file mode 100644 index 99cd7da..0000000 --- a/common-primitives/sklearn-wrap/sklearn_wrap/SKTfidfVectorizer.py +++ /dev/null @@ -1,530 +0,0 @@ -from typing import Any, Callable, List, Dict, Union, Optional, Sequence, Tuple -from numpy import ndarray -from collections import OrderedDict -from scipy import sparse -import os -import sklearn -import numpy -import typing - -# Custom import commands if any -from sklearn.feature_extraction.text import TfidfVectorizer - - -from d3m.container.numpy import ndarray as d3m_ndarray -from d3m.container import DataFrame as d3m_dataframe -from d3m.metadata import hyperparams, params, base as metadata_base -from d3m import utils -from d3m.base import utils as base_utils -from d3m.exceptions import PrimitiveNotFittedError -from d3m.primitive_interfaces.base import CallResult, DockerContainer -from d3m.primitive_interfaces.unsupervised_learning import UnsupervisedLearnerPrimitiveBase -from d3m.metadata.base import ALL_ELEMENTS - - -Inputs = d3m_dataframe -Outputs = d3m_dataframe - - -class Params(params.Params): - vocabulary_: Optional[Sequence[dict]] - stop_words_: Optional[Sequence[set]] - _tfidf: Optional[Sequence[object]] - fixed_vocabulary_: Optional[Sequence[bool]] - _stop_words_id: Optional[Sequence[int]] - target_names_: Optional[Sequence[Any]] - training_indices_: Optional[Sequence[int]] - - -class Hyperparams(hyperparams.Hyperparams): - strip_accents = hyperparams.Union( - configuration=OrderedDict({ - 'accents': hyperparams.Enumeration[str]( - default='ascii', - values=['ascii', 'unicode'], - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ), - 'none': hyperparams.Constant( - default=None, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ) - }), - default='none', - description='Remove accents during the preprocessing step. \'ascii\' is a fast method that only works on characters that have an direct ASCII mapping. \'unicode\' is a slightly slower method that works on any characters. None (default) does nothing.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - analyzer = hyperparams.Enumeration[str]( - default='word', - values=['word', 'char', 'char_wb'], - description='Whether the feature should be made of word or character n-grams. If a callable is passed it is used to extract the sequence of features out of the raw, unprocessed input.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - ngram_range = hyperparams.SortedList( - elements=hyperparams.Bounded[int](1, None, 1), - default=(1, 1), - min_size=2, - max_size=2, - description='The lower and upper boundary of the range of n-values for different n-grams to be extracted. All values of n such that min_n <= n <= max_n will be used.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - stop_words = hyperparams.Union( - configuration=OrderedDict({ - 'string': hyperparams.Hyperparameter[str]( - default='english', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ), - 'list': hyperparams.List( - elements=hyperparams.Hyperparameter[str](''), - default=[], - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ), - 'none': hyperparams.Constant( - default=None, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ) - }), - default='none', - description='If a string, it is passed to _check_stop_list and the appropriate stop list is returned. \'english\' is currently the only supported string value. If a list, that list is assumed to contain stop words, all of which will be removed from the resulting tokens. Only applies if ``analyzer == \'word\'``. If None, no stop words will be used. max_df can be set to a value in the range [0.7, 1.0) to automatically detect and filter stop words based on intra corpus document frequency of terms.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - lowercase = hyperparams.UniformBool( - default=True, - description='Convert all characters to lowercase before tokenizing.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - token_pattern = hyperparams.Hyperparameter[str]( - default='(?u)\\b\w\w+\\b', - description='Regular expression denoting what constitutes a "token", only used if ``analyzer == \'word\'``. The default regexp selects tokens of 2 or more alphanumeric characters (punctuation is completely ignored and always treated as a token separator).', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - max_df = hyperparams.Union( - configuration=OrderedDict({ - 'proportion': hyperparams.Bounded[float]( - default=1.0, - lower=0.0, - upper=1.0, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ), - 'absolute': hyperparams.Bounded[int]( - default=1, - lower=0, - upper=None, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ) - }), - default='proportion', - description='When building the vocabulary ignore terms that have a document frequency strictly higher than the given threshold (corpus-specific stop words). If float, the parameter represents a proportion of documents, integer absolute counts. This parameter is ignored if vocabulary is not None.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - min_df = hyperparams.Union( - configuration=OrderedDict({ - 'proportion': hyperparams.Bounded[float]( - default=1.0, - lower=0.0, - upper=1.0, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ), - 'absolute': hyperparams.Bounded[int]( - default=1, - lower=0, - upper=None, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ) - }), - default='absolute', - description='When building the vocabulary ignore terms that have a document frequency strictly lower than the given threshold. This value is also called cut-off in the literature. If float, the parameter represents a proportion of documents, integer absolute counts. This parameter is ignored if vocabulary is not None.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - max_features = hyperparams.Union( - configuration=OrderedDict({ - 'absolute': hyperparams.Bounded[int]( - default=1, - lower=0, - upper=None, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ), - 'none': hyperparams.Constant( - default=None, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ) - }), - default='none', - description='If not None, build a vocabulary that only consider the top max_features ordered by term frequency across the corpus. This parameter is ignored if vocabulary is not None.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - binary = hyperparams.UniformBool( - default=False, - description='If True, all non-zero term counts are set to 1. This does not mean outputs will have only 0/1 values, only that the tf term in tf-idf is binary. (Set idf and normalization to False to get 0/1 outputs.)', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - norm = hyperparams.Union( - configuration=OrderedDict({ - 'str': hyperparams.Enumeration[str]( - default='l2', - values=['l1', 'l2'], - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ), - 'none': hyperparams.Constant( - default=None, - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'], - ) - }), - default='none', - description='Norm used to normalize term vectors. None for no normalization.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - use_idf = hyperparams.UniformBool( - default=True, - description='Enable inverse-document-frequency reweighting.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - smooth_idf = hyperparams.UniformBool( - default=True, - description='Smooth idf weights by adding one to document frequencies, as if an extra document was seen containing every term in the collection exactly once. Prevents zero divisions.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - sublinear_tf = hyperparams.UniformBool( - default=False, - description='Apply sublinear tf scaling, i.e. replace tf with 1 + log(tf).', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - - use_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to force primitive to operate on. If any specified column cannot be parsed, it is skipped.", - ) - exclude_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to not operate on. Applicable only if \"use_columns\" is not provided.", - ) - return_result = hyperparams.Enumeration( - values=['append', 'replace', 'new'], - default='new', - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Should parsed columns be appended, should they replace original columns, or should only parsed columns be returned? This hyperparam is ignored if use_semantic_types is set to false.", - ) - use_semantic_types = hyperparams.UniformBool( - default=False, - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Controls whether semantic_types metadata will be used for filtering columns in input dataframe. Setting this to false makes the code ignore return_result and will produce only the output dataframe" - ) - add_index_columns = hyperparams.UniformBool( - default=False, - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Also include primary index columns if input data has them. Applicable only if \"return_result\" is set to \"new\".", - ) - error_on_no_input = hyperparams.UniformBool( - default=True, - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Throw an exception if no input column is selected/provided. Defaults to true to behave like sklearn. To prevent pipelines from breaking set this to False.", - ) - - - -class SKTfidfVectorizer(UnsupervisedLearnerPrimitiveBase[Inputs, Outputs, Params, Hyperparams]): - """ - Primitive wrapping for sklearn TfidfVectorizer - `sklearn documentation `_ - - """ - - __author__ = "JPL MARVIN" - metadata = metadata_base.PrimitiveMetadata({ - "algorithm_types": [metadata_base.PrimitiveAlgorithmType.MINIMUM_REDUNDANCY_FEATURE_SELECTION, ], - "name": "sklearn.feature_extraction.text.TfidfVectorizer", - "primitive_family": metadata_base.PrimitiveFamily.DATA_PREPROCESSING, - "python_path": "d3m.primitives.data_preprocessing.tfidf_vectorizer.SKlearn", - "source": {'name': 'JPL', 'contact': 'mailto:shah@jpl.nasa.gov', 'uris': ['https://gitlab.com/datadrivendiscovery/sklearn-wrap/issues', 'https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.TfidfVectorizer.html']}, - "version": "2019.11.13", - "id": "1f7ce2c7-1ec8-3483-9a65-eedd4b5811d6", - "hyperparams_to_tune": ['max_df', 'min_df'], - 'installation': [ - {'type': metadata_base.PrimitiveInstallationType.PIP, - 'package_uri': 'git+https://gitlab.com/datadrivendiscovery/sklearn-wrap.git@{git_commit}#egg=sklearn_wrap'.format( - git_commit=utils.current_git_commit(os.path.dirname(__file__)), - ), - }] - }) - - def __init__(self, *, - hyperparams: Hyperparams, - random_seed: int = 0, - docker_containers: Dict[str, DockerContainer] = None) -> None: - - super().__init__(hyperparams=hyperparams, random_seed=random_seed, docker_containers=docker_containers) - - # True - - self._clf = list() - - self._training_inputs = None - self._target_names = None - self._training_indices = None - self._fitted = False - - - def set_training_data(self, *, inputs: Inputs) -> None: - self._inputs = inputs - self._fitted = False - - def fit(self, *, timeout: float = None, iterations: int = None) -> CallResult[None]: - if self._fitted: - return CallResult(None) - - self._training_inputs, self._training_indices = self._get_columns_to_fit(self._inputs, self.hyperparams) - - if self._training_inputs is None: - raise ValueError("Missing training data.") - - if len(self._training_indices) > 0: - for column_index in range(len(self._training_inputs.columns)): - clf = self._create_new_sklearn_estimator() - clf.fit(self._training_inputs.iloc[:, column_index]) - self._clf.append(clf) - - self._fitted = True - else: - if self.hyperparams['error_on_no_input']: - raise RuntimeError("No input columns were selected") - self.logger.warn("No input columns were selected") - - return CallResult(None) - - def produce(self, *, inputs: Inputs, timeout: float = None, iterations: int = None) -> CallResult[Outputs]: - if not self._fitted: - raise PrimitiveNotFittedError("Primitive not fitted.") - sk_inputs = inputs - if self.hyperparams['use_semantic_types']: - sk_inputs, training_indices = self._get_columns_to_fit(inputs, self.hyperparams) - else: - training_indices = list(range(len(inputs))) - - # Iterating over all estimators and call transform on them. - # No. of estimators should be equal to the number of columns in the input - if len(self._clf) != len(sk_inputs.columns): - raise RuntimeError("Input data does not have the same number of columns as training data") - outputs = [] - if len(self._training_indices) > 0: - for column_index in range(len(sk_inputs.columns)): - clf = self._clf[column_index] - output = clf.transform(sk_inputs.iloc[:, column_index]) - column_name = sk_inputs.columns[column_index] - - if sparse.issparse(output): - output = output.toarray() - output = self._wrap_predictions(inputs, output) - - # Updating column names. - output.columns = map(lambda x: "{}_{}".format(column_name, x), clf.get_feature_names()) - for i, name in enumerate(clf.get_feature_names()): - output.metadata = output.metadata.update((ALL_ELEMENTS, i), {'name': name}) - - outputs.append(output) - else: - if self.hyperparams['error_on_no_input']: - raise RuntimeError("No input columns were selected") - self.logger.warn("No input columns were selected") - - outputs = base_utils.combine_columns(return_result=self.hyperparams['return_result'], - add_index_columns=self.hyperparams['add_index_columns'], - inputs=inputs, column_indices=self._training_indices, - columns_list=outputs) - - return CallResult(outputs) - - - def get_params(self) -> Params: - if not self._fitted: - return Params( - vocabulary_=None, - stop_words_=None, - _tfidf=None, - fixed_vocabulary_=None, - _stop_words_id=None, - training_indices_=self._training_indices, - target_names_=self._target_names - ) - - return Params( - vocabulary_=list(map(lambda clf: getattr(clf, 'vocabulary_', None), self._clf)), - stop_words_=list(map(lambda clf: getattr(clf, 'stop_words_', None), self._clf)), - _tfidf=list(map(lambda clf: getattr(clf, '_tfidf', None), self._clf)), - fixed_vocabulary_=list(map(lambda clf: getattr(clf, 'fixed_vocabulary_', None), self._clf)), - _stop_words_id=list(map(lambda clf: getattr(clf, '_stop_words_id', None), self._clf)), - training_indices_=self._training_indices, - target_names_=self._target_names - ) - - def set_params(self, *, params: Params) -> None: - for param, val in params.items(): - if val is not None and param not in ['target_names_', 'training_indices_']: - self._clf = list(map(lambda x: self._create_new_sklearn_estimator(), val)) - break - for index in range(len(self._clf)): - for param, val in params.items(): - if val is not None: - setattr(self._clf[index], param, val[index]) - else: - setattr(self._clf[index], param, None) - self._training_indices = params['training_indices_'] - self._target_names = params['target_names_'] - self._fitted = False - - if params['vocabulary_'] is not None: - self._fitted = True - if params['stop_words_'] is not None: - self._fitted = True - if params['_tfidf'] is not None: - self._fitted = True - if params['fixed_vocabulary_'] is not None: - self._fitted = True - if params['_stop_words_id'] is not None: - self._fitted = True - - def _create_new_sklearn_estimator(self): - clf = TfidfVectorizer( - strip_accents=self.hyperparams['strip_accents'], - analyzer=self.hyperparams['analyzer'], - ngram_range=self.hyperparams['ngram_range'], - stop_words=self.hyperparams['stop_words'], - lowercase=self.hyperparams['lowercase'], - token_pattern=self.hyperparams['token_pattern'], - max_df=self.hyperparams['max_df'], - min_df=self.hyperparams['min_df'], - max_features=self.hyperparams['max_features'], - binary=self.hyperparams['binary'], - norm=self.hyperparams['norm'], - use_idf=self.hyperparams['use_idf'], - smooth_idf=self.hyperparams['smooth_idf'], - sublinear_tf=self.hyperparams['sublinear_tf'], - ) - return clf - - - - - - - @classmethod - def _get_columns_to_fit(cls, inputs: Inputs, hyperparams: Hyperparams): - if not hyperparams['use_semantic_types']: - return inputs, list(range(len(inputs.columns))) - - inputs_metadata = inputs.metadata - - def can_produce_column(column_index: int) -> bool: - return cls._can_produce_column(inputs_metadata, column_index, hyperparams) - - columns_to_produce, columns_not_to_produce = base_utils.get_columns_to_use(inputs_metadata, - use_columns=hyperparams['use_columns'], - exclude_columns=hyperparams['exclude_columns'], - can_use_column=can_produce_column) - return inputs.iloc[:, columns_to_produce], columns_to_produce - # return columns_to_produce - - @classmethod - def _can_produce_column(cls, inputs_metadata: metadata_base.DataMetadata, column_index: int, hyperparams: Hyperparams) -> bool: - column_metadata = inputs_metadata.query((metadata_base.ALL_ELEMENTS, column_index)) - - accepted_structural_types = (str,) - accepted_semantic_types = set(["http://schema.org/Text",]) - accepted_semantic_types.add("https://metadata.datadrivendiscovery.org/types/Attribute") - if not issubclass(column_metadata['structural_type'], accepted_structural_types): - return False - - semantic_types = set(column_metadata.get('semantic_types', [])) - - if len(semantic_types) == 0: - cls.logger.warning("No semantic types found in column metadata") - return False - - # Making sure all accepted_semantic_types are available in semantic_types - if len(accepted_semantic_types - semantic_types) == 0: - return True - - return False - - @classmethod - def _get_targets(cls, data: d3m_dataframe, hyperparams: Hyperparams): - if not hyperparams['use_semantic_types']: - return data, list(data.columns), [] - target_names = [] - target_semantic_type = [] - target_column_indices = [] - metadata = data.metadata - target_column_indices.extend(metadata.get_columns_with_semantic_type('https://metadata.datadrivendiscovery.org/types/TrueTarget')) - - for column_index in target_column_indices: - if column_index is metadata_base.ALL_ELEMENTS: - continue - column_index = typing.cast(metadata_base.SimpleSelectorSegment, column_index) - column_metadata = metadata.query((metadata_base.ALL_ELEMENTS, column_index)) - target_names.append(column_metadata.get('name', str(column_index))) - target_semantic_type.append(column_metadata.get('semantic_types', [])) - - targets = data.iloc[:, target_column_indices] - return targets, target_names, target_semantic_type - - @classmethod - def _get_target_columns_metadata(cls, outputs_metadata: metadata_base.DataMetadata, hyperparams) -> List[OrderedDict]: - outputs_length = outputs_metadata.query((metadata_base.ALL_ELEMENTS,))['dimension']['length'] - - target_columns_metadata: List[OrderedDict] = [] - for column_index in range(outputs_length): - column_metadata = OrderedDict(outputs_metadata.query_column(column_index)) - - # Update semantic types and prepare it for predicted targets. - semantic_types = set(column_metadata.get('semantic_types', [])) - semantic_types_to_remove = set([]) - add_semantic_types = [] - add_semantic_types.add(hyperparams["return_semantic_type"]) - semantic_types = semantic_types - semantic_types_to_remove - semantic_types = semantic_types.union(add_semantic_types) - column_metadata['semantic_types'] = list(semantic_types) - - target_columns_metadata.append(column_metadata) - - return target_columns_metadata - - @classmethod - def _update_predictions_metadata(cls, inputs_metadata: metadata_base.DataMetadata, outputs: Optional[Outputs], - target_columns_metadata: List[OrderedDict]) -> metadata_base.DataMetadata: - outputs_metadata = metadata_base.DataMetadata().generate(value=outputs) - - for column_index, column_metadata in enumerate(target_columns_metadata): - column_metadata.pop("structural_type", None) - outputs_metadata = outputs_metadata.update_column(column_index, column_metadata) - - return outputs_metadata - - def _wrap_predictions(self, inputs: Inputs, predictions: ndarray) -> Outputs: - outputs = d3m_dataframe(predictions, generate_metadata=True) - target_columns_metadata = self._add_target_columns_metadata(outputs.metadata) - outputs.metadata = self._update_predictions_metadata(inputs.metadata, outputs, target_columns_metadata) - return outputs - - - @classmethod - def _add_target_columns_metadata(cls, outputs_metadata: metadata_base.DataMetadata): - outputs_length = outputs_metadata.query((metadata_base.ALL_ELEMENTS,))['dimension']['length'] - - target_columns_metadata: List[OrderedDict] = [] - for column_index in range(outputs_length): - column_metadata = OrderedDict() - semantic_types = [] - semantic_types.append('https://metadata.datadrivendiscovery.org/types/Attribute') - column_name = outputs_metadata.query((metadata_base.ALL_ELEMENTS, column_index)).get("name") - if column_name is None: - column_name = "output_{}".format(column_index) - column_metadata["semantic_types"] = semantic_types - column_metadata["name"] = str(column_name) - target_columns_metadata.append(column_metadata) - - return target_columns_metadata - - -SKTfidfVectorizer.__doc__ = TfidfVectorizer.__doc__ \ No newline at end of file diff --git a/common-primitives/sklearn-wrap/sklearn_wrap/SKTruncatedSVD.py b/common-primitives/sklearn-wrap/sklearn_wrap/SKTruncatedSVD.py deleted file mode 100644 index 2591180..0000000 --- a/common-primitives/sklearn-wrap/sklearn_wrap/SKTruncatedSVD.py +++ /dev/null @@ -1,369 +0,0 @@ -from typing import Any, Callable, List, Dict, Union, Optional, Sequence, Tuple -from numpy import ndarray -from collections import OrderedDict -from scipy import sparse -import os -import sklearn -import numpy -import typing - -# Custom import commands if any -from sklearn.decomposition.truncated_svd import TruncatedSVD - - -from d3m.container.numpy import ndarray as d3m_ndarray -from d3m.container import DataFrame as d3m_dataframe -from d3m.metadata import hyperparams, params, base as metadata_base -from d3m import utils -from d3m.base import utils as base_utils -from d3m.exceptions import PrimitiveNotFittedError -from d3m.primitive_interfaces.base import CallResult, DockerContainer -from d3m.primitive_interfaces.unsupervised_learning import UnsupervisedLearnerPrimitiveBase - - -Inputs = d3m_dataframe -Outputs = d3m_dataframe - - -class Params(params.Params): - components_: Optional[ndarray] - explained_variance_ratio_: Optional[ndarray] - explained_variance_: Optional[ndarray] - singular_values_: Optional[ndarray] - input_column_names: Optional[Any] - target_names_: Optional[Sequence[Any]] - training_indices_: Optional[Sequence[int]] - target_column_indices_: Optional[Sequence[int]] - target_columns_metadata_: Optional[List[OrderedDict]] - - - -class Hyperparams(hyperparams.Hyperparams): - n_components = hyperparams.Bounded[int]( - default=2, - lower=0, - upper=None, - description='Desired dimensionality of output data. Must be strictly less than the number of features. The default value is useful for visualisation. For LSA, a value of 100 is recommended.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - algorithm = hyperparams.Choice( - choices={ - 'randomized': hyperparams.Hyperparams.define( - configuration=OrderedDict({ - 'n_iter': hyperparams.Bounded[int]( - default=5, - lower=0, - upper=None, - description='Number of iterations for randomized SVD solver. Not used in arpack', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - }) - ), - 'arpack': hyperparams.Hyperparams.define( - configuration=OrderedDict({ - 'tol': hyperparams.Bounded[float]( - default=0, - lower=0, - upper=None, - description='Tolerance for ARPACK. 0 means machine precision. Ignored by randomized SVD solver.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - }) - ) - }, - default='randomized', - description='SVD solver to use. Either "arpack" for the ARPACK wrapper in SciPy (scipy.sparse.linalg.svds), or "randomized" for the randomized algorithm due to Halko (2009).', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - - use_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to force primitive to operate on. If any specified column cannot be parsed, it is skipped.", - ) - exclude_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to not operate on. Applicable only if \"use_columns\" is not provided.", - ) - return_result = hyperparams.Enumeration( - values=['append', 'replace', 'new'], - default='new', - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Should parsed columns be appended, should they replace original columns, or should only parsed columns be returned? This hyperparam is ignored if use_semantic_types is set to false.", - ) - use_semantic_types = hyperparams.UniformBool( - default=False, - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Controls whether semantic_types metadata will be used for filtering columns in input dataframe. Setting this to false makes the code ignore return_result and will produce only the output dataframe" - ) - add_index_columns = hyperparams.UniformBool( - default=False, - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Also include primary index columns if input data has them. Applicable only if \"return_result\" is set to \"new\".", - ) - error_on_no_input = hyperparams.UniformBool( - default=True, - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Throw an exception if no input column is selected/provided. Defaults to true to behave like sklearn. To prevent pipelines from breaking set this to False.", - ) - - return_semantic_type = hyperparams.Enumeration[str]( - values=['https://metadata.datadrivendiscovery.org/types/Attribute', 'https://metadata.datadrivendiscovery.org/types/ConstructedAttribute'], - default='https://metadata.datadrivendiscovery.org/types/Attribute', - description='Decides what semantic type to attach to generated attributes', - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'] - ) - -class SKTruncatedSVD(UnsupervisedLearnerPrimitiveBase[Inputs, Outputs, Params, Hyperparams]): - """ - Primitive wrapping for sklearn TruncatedSVD - `sklearn documentation `_ - - """ - - __author__ = "JPL MARVIN" - metadata = metadata_base.PrimitiveMetadata({ - "algorithm_types": [metadata_base.PrimitiveAlgorithmType.SINGULAR_VALUE_DECOMPOSITION, ], - "name": "sklearn.decomposition.truncated_svd.TruncatedSVD", - "primitive_family": metadata_base.PrimitiveFamily.DATA_PREPROCESSING, - "python_path": "d3m.primitives.data_preprocessing.truncated_svd.SKlearn", - "source": {'name': 'JPL', 'contact': 'mailto:shah@jpl.nasa.gov', 'uris': ['https://gitlab.com/datadrivendiscovery/sklearn-wrap/issues', 'https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.TruncatedSVD.html']}, - "version": "2019.11.13", - "id": "9231fde3-7322-3c41-b4cf-d00a93558c44", - "hyperparams_to_tune": ['n_components'], - 'installation': [ - {'type': metadata_base.PrimitiveInstallationType.PIP, - 'package_uri': 'git+https://gitlab.com/datadrivendiscovery/sklearn-wrap.git@{git_commit}#egg=sklearn_wrap'.format( - git_commit=utils.current_git_commit(os.path.dirname(__file__)), - ), - }] - }) - - def __init__(self, *, - hyperparams: Hyperparams, - random_seed: int = 0, - docker_containers: Dict[str, DockerContainer] = None) -> None: - - super().__init__(hyperparams=hyperparams, random_seed=random_seed, docker_containers=docker_containers) - - # False - self._clf = TruncatedSVD( - n_components=self.hyperparams['n_components'], - algorithm=self.hyperparams['algorithm']['choice'], - n_iter=self.hyperparams['algorithm'].get('n_iter', 5), - tol=self.hyperparams['algorithm'].get('tol', 0), - random_state=self.random_seed, - ) - - self._inputs = None - self._outputs = None - self._training_inputs = None - self._training_outputs = None - self._target_names = None - self._training_indices = None - self._target_column_indices = None - self._target_columns_metadata: List[OrderedDict] = None - self._input_column_names = None - self._fitted = False - - - def set_training_data(self, *, inputs: Inputs) -> None: - self._inputs = inputs - self._fitted = False - - def fit(self, *, timeout: float = None, iterations: int = None) -> CallResult[None]: - if self._fitted: - return CallResult(None) - - self._training_inputs, self._training_indices = self._get_columns_to_fit(self._inputs, self.hyperparams) - self._input_column_names = self._training_inputs.columns - - if self._training_inputs is None: - return CallResult(None) - - if len(self._training_indices) > 0: - self._clf.fit(self._training_inputs) - self._fitted = True - else: - if self.hyperparams['error_on_no_input']: - raise RuntimeError("No input columns were selected") - self.logger.warn("No input columns were selected") - return CallResult(None) - - def produce(self, *, inputs: Inputs, timeout: float = None, iterations: int = None) -> CallResult[Outputs]: - if not self._fitted: - raise PrimitiveNotFittedError("Primitive not fitted.") - sk_inputs = inputs - if self.hyperparams['use_semantic_types']: - sk_inputs = inputs.iloc[:, self._training_indices] - output_columns = [] - if len(self._training_indices) > 0: - sk_output = self._clf.transform(sk_inputs) - if sparse.issparse(sk_output): - sk_output = sk_output.toarray() - outputs = self._wrap_predictions(inputs, sk_output) - if len(outputs.columns) == len(self._input_column_names): - outputs.columns = self._input_column_names - output_columns = [outputs] - else: - if self.hyperparams['error_on_no_input']: - raise RuntimeError("No input columns were selected") - self.logger.warn("No input columns were selected") - outputs = base_utils.combine_columns(return_result=self.hyperparams['return_result'], - add_index_columns=self.hyperparams['add_index_columns'], - inputs=inputs, column_indices=self._training_indices, - columns_list=output_columns) - return CallResult(outputs) - - - def get_params(self) -> Params: - if not self._fitted: - return Params( - components_=None, - explained_variance_ratio_=None, - explained_variance_=None, - singular_values_=None, - input_column_names=self._input_column_names, - training_indices_=self._training_indices, - target_names_=self._target_names, - target_column_indices_=self._target_column_indices, - target_columns_metadata_=self._target_columns_metadata - ) - - return Params( - components_=getattr(self._clf, 'components_', None), - explained_variance_ratio_=getattr(self._clf, 'explained_variance_ratio_', None), - explained_variance_=getattr(self._clf, 'explained_variance_', None), - singular_values_=getattr(self._clf, 'singular_values_', None), - input_column_names=self._input_column_names, - training_indices_=self._training_indices, - target_names_=self._target_names, - target_column_indices_=self._target_column_indices, - target_columns_metadata_=self._target_columns_metadata - ) - - def set_params(self, *, params: Params) -> None: - self._clf.components_ = params['components_'] - self._clf.explained_variance_ratio_ = params['explained_variance_ratio_'] - self._clf.explained_variance_ = params['explained_variance_'] - self._clf.singular_values_ = params['singular_values_'] - self._input_column_names = params['input_column_names'] - self._training_indices = params['training_indices_'] - self._target_names = params['target_names_'] - self._target_column_indices = params['target_column_indices_'] - self._target_columns_metadata = params['target_columns_metadata_'] - - if params['components_'] is not None: - self._fitted = True - if params['explained_variance_ratio_'] is not None: - self._fitted = True - if params['explained_variance_'] is not None: - self._fitted = True - if params['singular_values_'] is not None: - self._fitted = True - - - - - - @classmethod - def _get_columns_to_fit(cls, inputs: Inputs, hyperparams: Hyperparams): - if not hyperparams['use_semantic_types']: - return inputs, list(range(len(inputs.columns))) - - inputs_metadata = inputs.metadata - - def can_produce_column(column_index: int) -> bool: - return cls._can_produce_column(inputs_metadata, column_index, hyperparams) - - columns_to_produce, columns_not_to_produce = base_utils.get_columns_to_use(inputs_metadata, - use_columns=hyperparams['use_columns'], - exclude_columns=hyperparams['exclude_columns'], - can_use_column=can_produce_column) - return inputs.iloc[:, columns_to_produce], columns_to_produce - # return columns_to_produce - - @classmethod - def _can_produce_column(cls, inputs_metadata: metadata_base.DataMetadata, column_index: int, hyperparams: Hyperparams) -> bool: - column_metadata = inputs_metadata.query((metadata_base.ALL_ELEMENTS, column_index)) - - accepted_structural_types = (int, float, numpy.integer, numpy.float64) - accepted_semantic_types = set() - accepted_semantic_types.add("https://metadata.datadrivendiscovery.org/types/Attribute") - if not issubclass(column_metadata['structural_type'], accepted_structural_types): - return False - - semantic_types = set(column_metadata.get('semantic_types', [])) - - if len(semantic_types) == 0: - cls.logger.warning("No semantic types found in column metadata") - return False - - # Making sure all accepted_semantic_types are available in semantic_types - if len(accepted_semantic_types - semantic_types) == 0: - return True - - return False - - - @classmethod - def _get_target_columns_metadata(cls, outputs_metadata: metadata_base.DataMetadata, hyperparams) -> List[OrderedDict]: - outputs_length = outputs_metadata.query((metadata_base.ALL_ELEMENTS,))['dimension']['length'] - - target_columns_metadata: List[OrderedDict] = [] - for column_index in range(outputs_length): - column_metadata = OrderedDict(outputs_metadata.query_column(column_index)) - - # Update semantic types and prepare it for predicted targets. - semantic_types = set(column_metadata.get('semantic_types', [])) - semantic_types_to_remove = set([]) - add_semantic_types = [] - add_semantic_types.add(hyperparams["return_semantic_type"]) - semantic_types = semantic_types - semantic_types_to_remove - semantic_types = semantic_types.union(add_semantic_types) - column_metadata['semantic_types'] = list(semantic_types) - - target_columns_metadata.append(column_metadata) - - return target_columns_metadata - - @classmethod - def _update_predictions_metadata(cls, inputs_metadata: metadata_base.DataMetadata, outputs: Optional[Outputs], - target_columns_metadata: List[OrderedDict]) -> metadata_base.DataMetadata: - outputs_metadata = metadata_base.DataMetadata().generate(value=outputs) - - for column_index, column_metadata in enumerate(target_columns_metadata): - column_metadata.pop("structural_type", None) - outputs_metadata = outputs_metadata.update_column(column_index, column_metadata) - - return outputs_metadata - - def _wrap_predictions(self, inputs: Inputs, predictions: ndarray) -> Outputs: - outputs = d3m_dataframe(predictions, generate_metadata=True) - target_columns_metadata = self._add_target_columns_metadata(outputs.metadata, self.hyperparams) - outputs.metadata = self._update_predictions_metadata(inputs.metadata, outputs, target_columns_metadata) - return outputs - - - @classmethod - def _add_target_columns_metadata(cls, outputs_metadata: metadata_base.DataMetadata, hyperparams): - - outputs_length = outputs_metadata.query((metadata_base.ALL_ELEMENTS,))['dimension']['length'] - target_columns_metadata: List[OrderedDict] = [] - for column_index in range(outputs_length): - column_name = "output_{}".format(column_index) - column_metadata = OrderedDict() - semantic_types = set() - semantic_types.add(hyperparams["return_semantic_type"]) - column_metadata['semantic_types'] = list(semantic_types) - - column_metadata["name"] = str(column_name) - target_columns_metadata.append(column_metadata) - - return target_columns_metadata - - -SKTruncatedSVD.__doc__ = TruncatedSVD.__doc__ \ No newline at end of file diff --git a/common-primitives/sklearn-wrap/sklearn_wrap/SKVarianceThreshold.py b/common-primitives/sklearn-wrap/sklearn_wrap/SKVarianceThreshold.py deleted file mode 100644 index d6f30ab..0000000 --- a/common-primitives/sklearn-wrap/sklearn_wrap/SKVarianceThreshold.py +++ /dev/null @@ -1,414 +0,0 @@ -from typing import Any, Callable, List, Dict, Union, Optional, Sequence, Tuple -from numpy import ndarray -from collections import OrderedDict -from scipy import sparse -import os -import sklearn -import numpy -import typing - -# Custom import commands if any -from sklearn.feature_selection.variance_threshold import VarianceThreshold - - -from d3m.container.numpy import ndarray as d3m_ndarray -from d3m.container import DataFrame as d3m_dataframe -from d3m.metadata import hyperparams, params, base as metadata_base -from d3m import utils -from d3m.base import utils as base_utils -from d3m.exceptions import PrimitiveNotFittedError -from d3m.primitive_interfaces.base import CallResult, DockerContainer - -from d3m.primitive_interfaces.supervised_learning import SupervisedLearnerPrimitiveBase -from d3m.primitive_interfaces.base import ProbabilisticCompositionalityMixin, ContinueFitMixin -from d3m import exceptions -import pandas - - - -Inputs = d3m_dataframe -Outputs = d3m_dataframe - - -class Params(params.Params): - variances_: Optional[ndarray] - input_column_names: Optional[Any] - target_names_: Optional[Sequence[Any]] - training_indices_: Optional[Sequence[int]] - target_column_indices_: Optional[Sequence[int]] - target_columns_metadata_: Optional[List[OrderedDict]] - - - -class Hyperparams(hyperparams.Hyperparams): - threshold = hyperparams.Bounded[float]( - default=0.0, - lower=0, - upper=None, - description='Features with a training-set variance lower than this threshold will be removed. The default is to keep all features with non-zero variance, i.e. remove the features that have the same value in all samples.', - semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] - ) - - use_inputs_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to force primitive to use as training input. If any specified column cannot be parsed, it is skipped.", - ) - use_outputs_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to force primitive to use as training target. If any specified column cannot be parsed, it is skipped.", - ) - exclude_inputs_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to not use as training inputs. Applicable only if \"use_columns\" is not provided.", - ) - exclude_outputs_columns = hyperparams.Set( - elements=hyperparams.Hyperparameter[int](-1), - default=(), - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="A set of column indices to not use as training target. Applicable only if \"use_columns\" is not provided.", - ) - return_result = hyperparams.Enumeration( - values=['update_semantic_types', 'replace', 'new'], - default='new', - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Should parsed columns be appended, should they replace original columns, or should only parsed columns be returned? This hyperparam is ignored if use_semantic_types is set to false.", -) - use_semantic_types = hyperparams.UniformBool( - default=False, - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Controls whether semantic_types metadata will be used for filtering columns in input dataframe. Setting this to false makes the code ignore return_result and will produce only the output dataframe" - ) - add_index_columns = hyperparams.UniformBool( - default=False, - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Also include primary index columns if input data has them. Applicable only if \"return_result\" is set to \"new\".", - ) - error_on_no_input = hyperparams.UniformBool( - default=True, - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'], - description="Throw an exception if no input column is selected/provided. Defaults to true to behave like sklearn. To prevent pipelines from breaking set this to False.", - ) - - return_semantic_type = hyperparams.Enumeration[str]( - values=['https://metadata.datadrivendiscovery.org/types/Attribute', 'https://metadata.datadrivendiscovery.org/types/ConstructedAttribute', 'https://metadata.datadrivendiscovery.org/types/PredictedTarget'], - default='https://metadata.datadrivendiscovery.org/types/PredictedTarget', - description='Decides what semantic type to attach to generated output', - semantic_types=['https://metadata.datadrivendiscovery.org/types/ControlParameter'] - ) - -class SKVarianceThreshold(SupervisedLearnerPrimitiveBase[Inputs, Outputs, Params, Hyperparams]): - """ - Primitive wrapping for sklearn VarianceThreshold - `sklearn documentation `_ - - """ - - __author__ = "JPL MARVIN" - metadata = metadata_base.PrimitiveMetadata({ - "algorithm_types": [metadata_base.PrimitiveAlgorithmType.FEATURE_SCALING, ], - "name": "sklearn.feature_selection.variance_threshold.VarianceThreshold", - "primitive_family": metadata_base.PrimitiveFamily.FEATURE_SELECTION, - "python_path": "d3m.primitives.feature_selection.variance_threshold.SKlearn", - "source": {'name': 'JPL', 'contact': 'mailto:shah@jpl.nasa.gov', 'uris': ['https://gitlab.com/datadrivendiscovery/sklearn-wrap/issues', 'https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.VarianceThreshold.html']}, - "version": "2019.11.13", - "id": "980c43c7-ab2a-3dc9-943b-db08a7c25cb6", - "hyperparams_to_tune": ['threshold'], - 'installation': [ - {'type': metadata_base.PrimitiveInstallationType.PIP, - 'package_uri': 'git+https://gitlab.com/datadrivendiscovery/sklearn-wrap.git@{git_commit}#egg=sklearn_wrap'.format( - git_commit=utils.current_git_commit(os.path.dirname(__file__)), - ), - }] - }) - - def __init__(self, *, - hyperparams: Hyperparams, - random_seed: int = 0, - docker_containers: Dict[str, DockerContainer] = None) -> None: - - super().__init__(hyperparams=hyperparams, random_seed=random_seed, docker_containers=docker_containers) - - # False - self._clf = VarianceThreshold( - threshold=self.hyperparams['threshold'], - ) - - self._inputs = None - self._outputs = None - self._training_inputs = None - self._training_outputs = None - self._target_names = None - self._training_indices = None - self._target_column_indices = None - self._target_columns_metadata: List[OrderedDict] = None - self._input_column_names = None - self._fitted = False - self._new_training_data = False - - def set_training_data(self, *, inputs: Inputs, outputs: Outputs) -> None: - self._inputs = inputs - self._outputs = outputs - self._fitted = False - self._new_training_data = True - - def fit(self, *, timeout: float = None, iterations: int = None) -> CallResult[None]: - if self._fitted: - return CallResult(None) - - self._training_inputs, self._training_indices = self._get_columns_to_fit(self._inputs, self.hyperparams) - self._training_outputs, self._target_names, self._target_column_indices = self._get_targets(self._outputs, self.hyperparams) - self._input_column_names = self._training_inputs.columns - - if self._training_inputs is None or self._training_outputs is None: - raise ValueError("Missing training data.") - - if len(self._training_indices) > 0 and len(self._target_column_indices) > 0: - sk_training_output = self._training_outputs.values - - shape = sk_training_output.shape - if len(shape) == 2 and shape[1] == 1: - sk_training_output = numpy.ravel(sk_training_output) - - self._clf.fit(self._training_inputs, sk_training_output) - self._fitted = True - else: - if self.hyperparams['error_on_no_input']: - raise RuntimeError("No input columns were selected") - self.logger.warn("No input columns were selected") - - return CallResult(None) - - def produce(self, *, inputs: Inputs, timeout: float = None, iterations: int = None) -> CallResult[Outputs]: - sk_inputs, columns_to_use = self._get_columns_to_fit(inputs, self.hyperparams) - output = [] - if len(sk_inputs.columns): - try: - sk_output = self._clf.transform(sk_inputs) - except sklearn.exceptions.NotFittedError as error: - raise PrimitiveNotFittedError("Primitive not fitted.") from error - if sparse.issparse(sk_output): - sk_output = sk_output.toarray() - target_columns_metadata = self._copy_columns_metadata(inputs.iloc[:, self._training_indices].metadata, - self.produce_support().value) - output = self._wrap_predictions(inputs, sk_output, target_columns_metadata) - output.columns = [inputs.columns[idx] for idx in range(len(inputs.columns)) if idx in self.produce_support().value] - output = [output] - else: - if self.hyperparams['error_on_no_input']: - raise RuntimeError("No input columns were selected") - self.logger.warn("No input columns were selected") - - if self.hyperparams['return_result'] == 'update_semantic_types': - temp_inputs = inputs.copy() - columns_not_selected = sorted(set(range(len(temp_inputs.columns))) - set(self.produce_support().value)) - - for idx in columns_not_selected: - temp_inputs.metadata = temp_inputs.metadata.remove_semantic_type((metadata_base.ALL_ELEMENTS, idx), - 'https://metadata.datadrivendiscovery.org/types/Attribute') - - temp_inputs = temp_inputs.select_columns(self._training_indices) - outputs = base_utils.combine_columns(return_result='replace', - add_index_columns=self.hyperparams['add_index_columns'], - inputs=inputs, column_indices=self._training_indices, - columns_list=[temp_inputs]) - return CallResult(outputs) - - outputs = base_utils.combine_columns(return_result=self.hyperparams['return_result'], - add_index_columns=self.hyperparams['add_index_columns'], - inputs=inputs, column_indices=self._training_indices, - columns_list=output) - - return CallResult(outputs) - - def produce_support(self, *, timeout: float = None, iterations: int = None) -> CallResult[Any]: - all_indices = self._training_indices - selected_indices = self._clf.get_support(indices=True).tolist() - indices = [all_indices[index] for index in selected_indices] - return CallResult(indices) - - - def get_params(self) -> Params: - if not self._fitted: - return Params( - variances_=None, - input_column_names=self._input_column_names, - training_indices_=self._training_indices, - target_names_=self._target_names, - target_column_indices_=self._target_column_indices, - target_columns_metadata_=self._target_columns_metadata - ) - - return Params( - variances_=getattr(self._clf, 'variances_', None), - input_column_names=self._input_column_names, - training_indices_=self._training_indices, - target_names_=self._target_names, - target_column_indices_=self._target_column_indices, - target_columns_metadata_=self._target_columns_metadata - ) - - def set_params(self, *, params: Params) -> None: - self._clf.variances_ = params['variances_'] - self._input_column_names = params['input_column_names'] - self._training_indices = params['training_indices_'] - self._target_names = params['target_names_'] - self._target_column_indices = params['target_column_indices_'] - self._target_columns_metadata = params['target_columns_metadata_'] - - if params['variances_'] is not None: - self._fitted = True - - - - - - - - @classmethod - def _get_columns_to_fit(cls, inputs: Inputs, hyperparams: Hyperparams): - if not hyperparams['use_semantic_types']: - return inputs, list(range(len(inputs.columns))) - - inputs_metadata = inputs.metadata - - def can_produce_column(column_index: int) -> bool: - return cls._can_produce_column(inputs_metadata, column_index, hyperparams) - - columns_to_produce, columns_not_to_produce = base_utils.get_columns_to_use(inputs_metadata, - use_columns=hyperparams['use_inputs_columns'], - exclude_columns=hyperparams['exclude_inputs_columns'], - can_use_column=can_produce_column) - return inputs.iloc[:, columns_to_produce], columns_to_produce - # return columns_to_produce - - @classmethod - def _can_produce_column(cls, inputs_metadata: metadata_base.DataMetadata, column_index: int, hyperparams: Hyperparams) -> bool: - column_metadata = inputs_metadata.query((metadata_base.ALL_ELEMENTS, column_index)) - - accepted_structural_types = (int, float, numpy.integer, numpy.float64) - accepted_semantic_types = set() - accepted_semantic_types.add("https://metadata.datadrivendiscovery.org/types/Attribute") - if not issubclass(column_metadata['structural_type'], accepted_structural_types): - return False - - semantic_types = set(column_metadata.get('semantic_types', [])) - - if len(semantic_types) == 0: - cls.logger.warning("No semantic types found in column metadata") - return False - # Making sure all accepted_semantic_types are available in semantic_types - if len(accepted_semantic_types - semantic_types) == 0: - return True - - return False - - @classmethod - def _get_targets(cls, data: d3m_dataframe, hyperparams: Hyperparams): - if not hyperparams['use_semantic_types']: - return data, list(data.columns), list(range(len(data.columns))) - - metadata = data.metadata - - def can_produce_column(column_index: int) -> bool: - accepted_semantic_types = set() - accepted_semantic_types.add("https://metadata.datadrivendiscovery.org/types/TrueTarget") - column_metadata = metadata.query((metadata_base.ALL_ELEMENTS, column_index)) - semantic_types = set(column_metadata.get('semantic_types', [])) - if len(semantic_types) == 0: - cls.logger.warning("No semantic types found in column metadata") - return False - # Making sure all accepted_semantic_types are available in semantic_types - if len(accepted_semantic_types - semantic_types) == 0: - return True - return False - - target_column_indices, target_columns_not_to_produce = base_utils.get_columns_to_use(metadata, - use_columns=hyperparams[ - 'use_outputs_columns'], - exclude_columns= - hyperparams[ - 'exclude_outputs_columns'], - can_use_column=can_produce_column) - targets = [] - if target_column_indices: - targets = data.select_columns(target_column_indices) - target_column_names = [] - for idx in target_column_indices: - target_column_names.append(data.columns[idx]) - return targets, target_column_names, target_column_indices - - @classmethod - def _get_target_columns_metadata(cls, outputs_metadata: metadata_base.DataMetadata, hyperparams) -> List[OrderedDict]: - outputs_length = outputs_metadata.query((metadata_base.ALL_ELEMENTS,))['dimension']['length'] - - target_columns_metadata: List[OrderedDict] = [] - for column_index in range(outputs_length): - column_metadata = OrderedDict(outputs_metadata.query_column(column_index)) - - # Update semantic types and prepare it for predicted targets. - semantic_types = set(column_metadata.get('semantic_types', [])) - semantic_types_to_remove = set([]) - add_semantic_types = [] - add_semantic_types.add(hyperparams["return_semantic_type"]) - semantic_types = semantic_types - semantic_types_to_remove - semantic_types = semantic_types.union(add_semantic_types) - column_metadata['semantic_types'] = list(semantic_types) - - target_columns_metadata.append(column_metadata) - - return target_columns_metadata - - @classmethod - def _update_predictions_metadata(cls, inputs_metadata: metadata_base.DataMetadata, outputs: Optional[Outputs], - target_columns_metadata: List[OrderedDict]) -> metadata_base.DataMetadata: - outputs_metadata = metadata_base.DataMetadata().generate(value=outputs) - - for column_index, column_metadata in enumerate(target_columns_metadata): - if len(target_columns_metadata) == 1: - name = column_metadata.get("name") - for idx in range(len(outputs.columns)): - outputs_metadata = outputs_metadata.update_column(idx, column_metadata) - if len(outputs.columns) > 1: - # Updating column names. - outputs_metadata = outputs_metadata.update((metadata_base.ALL_ELEMENTS, idx), {'name': "{}_{}".format(name, idx)}) - else: - outputs_metadata = outputs_metadata.update_column(column_index, column_metadata) - - return outputs_metadata - - - def _wrap_predictions(self, inputs: Inputs, predictions: ndarray, target_columns_metadata) -> Outputs: - outputs = d3m_dataframe(predictions, generate_metadata=False) - outputs.metadata = self._update_predictions_metadata(inputs.metadata, outputs, target_columns_metadata) - return outputs - - - - @classmethod - def _copy_columns_metadata(cls, inputs_metadata: metadata_base.DataMetadata, column_indices) -> List[OrderedDict]: - outputs_length = inputs_metadata.query((metadata_base.ALL_ELEMENTS,))['dimension']['length'] - - target_columns_metadata: List[OrderedDict] = [] - for column_index in column_indices: - column_name = inputs_metadata.query((metadata_base.ALL_ELEMENTS, column_index)).get("name") - column_metadata = OrderedDict(inputs_metadata.query_column(column_index)) - semantic_types = set(column_metadata.get('semantic_types', [])) - semantic_types_to_remove = set([]) - add_semantic_types = [] - semantic_types = semantic_types - semantic_types_to_remove - semantic_types = semantic_types.union(add_semantic_types) - column_metadata['semantic_types'] = list(semantic_types) - - column_metadata["name"] = str(column_name) - target_columns_metadata.append(column_metadata) - - return target_columns_metadata - - -SKVarianceThreshold.__doc__ = VarianceThreshold.__doc__ \ No newline at end of file diff --git a/common-primitives/sklearn-wrap/sklearn_wrap/__init__.py b/common-primitives/sklearn-wrap/sklearn_wrap/__init__.py deleted file mode 100644 index def4f5b..0000000 --- a/common-primitives/sklearn-wrap/sklearn_wrap/__init__.py +++ /dev/null @@ -1,2 +0,0 @@ -__author__ = 'JPL DARPA D3M TEAM' -__version__ = '2019.11.13' diff --git a/common-primitives/tests/test_audio_reader.py b/common-primitives/tests/test_audio_reader.py deleted file mode 100644 index f02bd2b..0000000 --- a/common-primitives/tests/test_audio_reader.py +++ /dev/null @@ -1,105 +0,0 @@ -import unittest -import os - -from d3m import container - -from common_primitives import audio_reader, dataset_to_dataframe, denormalize - - -class AudioReaderPrimitiveTestCase(unittest.TestCase): - def test_basic(self): - dataset_doc_path = os.path.abspath(os.path.join(os.path.dirname(__file__), 'data', 'datasets', 'audio_dataset_1', 'datasetDoc.json')) - - dataset = container.Dataset.load('file://{dataset_doc_path}'.format(dataset_doc_path=dataset_doc_path)) - - dataframe_hyperparams_class = dataset_to_dataframe.DatasetToDataFramePrimitive.metadata.get_hyperparams() - dataframe_primitive = dataset_to_dataframe.DatasetToDataFramePrimitive(hyperparams=dataframe_hyperparams_class.defaults().replace({'dataframe_resource': '0'})) - dataframe = dataframe_primitive.produce(inputs=dataset).value - - audio_hyperparams_class = audio_reader.AudioReaderPrimitive.metadata.get_hyperparams() - audio_primitive = audio_reader.AudioReaderPrimitive(hyperparams=audio_hyperparams_class.defaults().replace({'return_result': 'replace'})) - audios = audio_primitive.produce(inputs=dataframe).value - - self.assertEqual(audios.shape, (1, 1)) - self.assertEqual(audios.iloc[0, 0].shape, (4410, 1)) - - self._test_metadata(audios.metadata, True) - - self.assertEqual(audios.metadata.query((0, 0))['dimension']['length'], 4410) - self.assertEqual(audios.metadata.query((0, 0))['dimension']['sampling_rate'], 44100) - - def _test_metadata(self, metadata, is_table): - semantic_types = ('https://metadata.datadrivendiscovery.org/types/PrimaryKey', 'http://schema.org/AudioObject') - - if is_table: - semantic_types += ('https://metadata.datadrivendiscovery.org/types/Table',) - - self.assertEqual(metadata.query_column(0)['name'], 'filename') - self.assertEqual(metadata.query_column(0)['structural_type'], container.ndarray) - self.assertEqual(metadata.query_column(0)['semantic_types'], semantic_types) - - def test_boundaries_reassign(self): - dataset_doc_path = os.path.abspath(os.path.join(os.path.dirname(__file__), 'data', 'datasets', 'audio_dataset_1', 'datasetDoc.json')) - - dataset = container.Dataset.load('file://{dataset_doc_path}'.format(dataset_doc_path=dataset_doc_path)) - - denormalize_hyperparams_class = denormalize.DenormalizePrimitive.metadata.get_hyperparams() - denormalize_primitive = denormalize.DenormalizePrimitive(hyperparams=denormalize_hyperparams_class.defaults()) - dataset = denormalize_primitive.produce(inputs=dataset).value - - dataframe_hyperparams_class = dataset_to_dataframe.DatasetToDataFramePrimitive.metadata.get_hyperparams() - dataframe_primitive = dataset_to_dataframe.DatasetToDataFramePrimitive(hyperparams=dataframe_hyperparams_class.defaults()) - dataframe = dataframe_primitive.produce(inputs=dataset).value - - audio_hyperparams_class = audio_reader.AudioReaderPrimitive.metadata.get_hyperparams() - audio_primitive = audio_reader.AudioReaderPrimitive(hyperparams=audio_hyperparams_class.defaults().replace({'return_result': 'append'})) - audios = audio_primitive.produce(inputs=dataframe).value - - self.assertEqual(audios.shape, (1, 6)) - self.assertEqual(audios.iloc[0, 5].shape, (4410, 1)) - - self._test_boundaries_reassign_metadata(audios.metadata, True) - - self.assertEqual(audios.metadata.query((0, 5))['dimension']['length'], 4410) - self.assertEqual(audios.metadata.query((0, 5))['dimension']['sampling_rate'], 44100) - - def _test_boundaries_reassign_metadata(self, metadata, is_table): - semantic_types = ('http://schema.org/AudioObject', 'https://metadata.datadrivendiscovery.org/types/Attribute', 'https://metadata.datadrivendiscovery.org/types/UniqueKey') - - if is_table: - semantic_types += ('https://metadata.datadrivendiscovery.org/types/Table',) - - self.assertEqual(metadata.query_column(5)['name'], 'filename') - self.assertEqual(metadata.query_column(5)['structural_type'], container.ndarray) - self.assertEqual(metadata.query_column(5)['semantic_types'], semantic_types) - - self.assertEqual(metadata.query_column(2), { - 'structural_type': str, - 'name': 'start', - 'semantic_types': ( - 'http://schema.org/Float', - 'https://metadata.datadrivendiscovery.org/types/Boundary', - 'https://metadata.datadrivendiscovery.org/types/IntervalStart', - ), - 'boundary_for': { - 'resource_id': 'learningData', - 'column_index': 5, - }, - }) - self.assertEqual(metadata.query_column(3), { - 'structural_type': str, - 'name': 'end', - 'semantic_types': ( - 'http://schema.org/Float', - 'https://metadata.datadrivendiscovery.org/types/Boundary', - 'https://metadata.datadrivendiscovery.org/types/IntervalEnd', - ), - 'boundary_for': { - 'resource_id': 'learningData', - 'column_index': 5, - }, - }) - - -if __name__ == '__main__': - unittest.main() diff --git a/common-primitives/tests/test_cast_to_type.py b/common-primitives/tests/test_cast_to_type.py deleted file mode 100644 index 304ef18..0000000 --- a/common-primitives/tests/test_cast_to_type.py +++ /dev/null @@ -1,131 +0,0 @@ -import os -import logging -import unittest - -import numpy - -from d3m import container -from d3m.metadata import base as metadata_base - -from common_primitives import cast_to_type, column_parser, dataset_to_dataframe, extract_columns_semantic_types - - -class CastToTypePrimitiveTestCase(unittest.TestCase): - def test_basic(self): - inputs = container.DataFrame({'a': [1, 2, 3], 'b': ['a', 'b', 'c']}, generate_metadata=True) - - self.assertEqual(inputs.dtypes['a'], numpy.int64) - self.assertEqual(inputs.dtypes['b'], object) - - hyperparams_class = cast_to_type.CastToTypePrimitive.metadata.get_hyperparams() - - primitive = cast_to_type.CastToTypePrimitive(hyperparams=hyperparams_class.defaults().replace({'type_to_cast': 'str'})) - - call_metadata = primitive.produce(inputs=inputs) - - self.assertIsInstance(call_metadata.value, container.DataFrame) - - self.assertEqual(len(call_metadata.value.dtypes), 2) - self.assertEqual(call_metadata.value.dtypes['a'], object) - self.assertEqual(call_metadata.value.dtypes['b'], object) - - self.assertEqual(call_metadata.value.metadata.query((metadata_base.ALL_ELEMENTS, 0))['structural_type'], str) - self.assertEqual(call_metadata.value.metadata.query((metadata_base.ALL_ELEMENTS, 1))['structural_type'], str) - self.assertEqual(call_metadata.value.metadata.query((metadata_base.ALL_ELEMENTS,))['dimension']['length'], 2) - - primitive = cast_to_type.CastToTypePrimitive(hyperparams=hyperparams_class.defaults().replace({'type_to_cast': 'float'})) - - with self.assertLogs(level=logging.WARNING) as cm: - call_metadata = primitive.produce(inputs=inputs) - - self.assertEqual(len(call_metadata.value.dtypes), 1) - self.assertEqual(call_metadata.value.dtypes['a'], float) - - self.assertEqual(call_metadata.value.metadata.query((metadata_base.ALL_ELEMENTS, 0))['structural_type'], float) - self.assertEqual(call_metadata.value.metadata.query((metadata_base.ALL_ELEMENTS,))['dimension']['length'], 1) - - self.assertEqual(len(cm.records), 1) - self.assertEqual(cm.records[0].msg, "Not all columns can be cast to type '%(type)s'. Skipping columns: %(columns)s") - - primitive = cast_to_type.CastToTypePrimitive(hyperparams=hyperparams_class.defaults().replace({'exclude_columns': (0,), 'type_to_cast': 'float'})) - - with self.assertRaisesRegex(ValueError, 'No columns to be cast to type'): - primitive.produce(inputs=inputs) - - def test_objects(self): - hyperparams_class = cast_to_type.CastToTypePrimitive.metadata.get_hyperparams() - - inputs = container.DataFrame({'a': [1, 2, 3], 'b': [{'a': 1}, {'b': 1}, {'c': 1}]}, { - 'schema': metadata_base.CONTAINER_SCHEMA_VERSION, - 'structural_type': container.DataFrame, - 'dimension': { - 'length': 3, - }, - }, generate_metadata=False) - inputs.metadata = inputs.metadata.update((metadata_base.ALL_ELEMENTS,), { - 'dimension': { - 'length': 2, - }, - }) - inputs.metadata = inputs.metadata.update((metadata_base.ALL_ELEMENTS, 0), { - 'structural_type': int, - }) - inputs.metadata = inputs.metadata.update((metadata_base.ALL_ELEMENTS, 1), { - 'structural_type': dict, - }) - - self.assertEqual(inputs.dtypes['a'], numpy.int64) - self.assertEqual(inputs.dtypes['b'], object) - - primitive = cast_to_type.CastToTypePrimitive(hyperparams=hyperparams_class.defaults().replace({'type_to_cast': 'str'})) - - call_metadata = primitive.produce(inputs=inputs) - - self.assertEqual(len(call_metadata.value.dtypes), 2) - self.assertEqual(call_metadata.value.dtypes['a'], object) - self.assertEqual(call_metadata.value.dtypes['b'], object) - - self.assertEqual(call_metadata.value.metadata.query((metadata_base.ALL_ELEMENTS, 0))['structural_type'], str) - self.assertEqual(call_metadata.value.metadata.query((metadata_base.ALL_ELEMENTS, 1))['structural_type'], str) - self.assertEqual(call_metadata.value.metadata.query((metadata_base.ALL_ELEMENTS,))['dimension']['length'], 2) - - primitive = cast_to_type.CastToTypePrimitive(hyperparams=hyperparams_class.defaults().replace({'type_to_cast': 'float'})) - - with self.assertLogs(level=logging.WARNING) as cm: - call_metadata = primitive.produce(inputs=inputs) - - self.assertEqual(len(call_metadata.value.dtypes), 1) - self.assertEqual(call_metadata.value.dtypes['a'], float) - - self.assertEqual(call_metadata.value.metadata.query((metadata_base.ALL_ELEMENTS, 0))['structural_type'], float) - self.assertEqual(call_metadata.value.metadata.query((metadata_base.ALL_ELEMENTS,))['dimension']['length'], 1) - - self.assertEqual(len(cm.records), 1) - self.assertEqual(cm.records[0].msg, "Not all columns can be cast to type '%(type)s'. Skipping columns: %(columns)s") - - def test_data(self): - dataset_doc_path = os.path.abspath(os.path.join(os.path.dirname(__file__), 'data', 'datasets', 'iris_dataset_1', 'datasetDoc.json')) - - dataset = container.Dataset.load('file://{dataset_doc_path}'.format(dataset_doc_path=dataset_doc_path)) - - hyperparams_class = dataset_to_dataframe.DatasetToDataFramePrimitive.metadata.get_hyperparams() - primitive = dataset_to_dataframe.DatasetToDataFramePrimitive(hyperparams=hyperparams_class.defaults()) - dataframe = primitive.produce(inputs=dataset).value - - hyperparams_class = column_parser.ColumnParserPrimitive.metadata.get_hyperparams() - primitive = column_parser.ColumnParserPrimitive(hyperparams=hyperparams_class.defaults()) - dataframe = primitive.produce(inputs=dataframe).value - - hyperparams_class = extract_columns_semantic_types.ExtractColumnsBySemanticTypesPrimitive.metadata.get_hyperparams() - primitive = extract_columns_semantic_types.ExtractColumnsBySemanticTypesPrimitive(hyperparams=hyperparams_class.defaults()) - attributes = primitive.produce(inputs=dataframe).value - - hyperparams_class = cast_to_type.CastToTypePrimitive.metadata.get_hyperparams() - primitive = cast_to_type.CastToTypePrimitive(hyperparams=hyperparams_class.defaults().replace({'type_to_cast': 'float'})) - cast_attributes = primitive.produce(inputs=attributes).value - - self.assertEqual(cast_attributes.values.dtype, numpy.float64) - - -if __name__ == '__main__': - unittest.main() diff --git a/common-primitives/tests/test_column_map.py b/common-primitives/tests/test_column_map.py deleted file mode 100644 index 0323239..0000000 --- a/common-primitives/tests/test_column_map.py +++ /dev/null @@ -1,75 +0,0 @@ -import unittest -import os -import pickle -import sys - -from d3m import container, index, utils as d3m_utils - -TEST_PRIMITIVES_DIR = os.path.join(os.path.dirname(__file__), 'data', 'primitives') -sys.path.insert(0, TEST_PRIMITIVES_DIR) - -from test_primitives.null import NullTransformerPrimitive, NullUnsupervisedLearnerPrimitive - -# To hide any logging or stdout output. -with d3m_utils.silence(): - index.register_primitive('d3m.primitives.operator.null.TransformerTest', NullTransformerPrimitive) - index.register_primitive('d3m.primitives.operator.null.UnsupervisedLearnerTest', NullUnsupervisedLearnerPrimitive) - -from common_primitives import dataset_to_dataframe, csv_reader, denormalize, column_map, column_parser - -import utils as test_utils - - -class ColumnMapTestCase(unittest.TestCase): - def test_transformer(self): - self.maxDiff = None - - dataset_doc_path = os.path.abspath(os.path.join(os.path.dirname(__file__), 'data', 'datasets', 'timeseries_dataset_2', 'datasetDoc.json')) - - dataset = container.Dataset.load('file://{dataset_doc_path}'.format(dataset_doc_path=dataset_doc_path)) - - hyperparams = denormalize.DenormalizePrimitive.metadata.get_hyperparams() - primitive = denormalize.DenormalizePrimitive(hyperparams=hyperparams.defaults()) - dataset = primitive.produce(inputs=dataset).value - - hyperparams = dataset_to_dataframe.DatasetToDataFramePrimitive.metadata.get_hyperparams() - primitive = dataset_to_dataframe.DatasetToDataFramePrimitive(hyperparams=hyperparams.defaults()) - dataframe = primitive.produce(inputs=dataset).value - - hyperparams = csv_reader.CSVReaderPrimitive.metadata.get_hyperparams() - primitive = csv_reader.CSVReaderPrimitive(hyperparams=hyperparams.defaults().replace({'return_result': 'replace'})) - dataframe = primitive.produce(inputs=dataframe).value - - hyperparams = column_map.DataFrameColumnMapPrimitive.metadata.get_hyperparams() - primitive = column_map.DataFrameColumnMapPrimitive( - # We have to make an instance of the primitive ourselves. - hyperparams=hyperparams.defaults().replace({ - # First we use identity primitive which should not really change anything. - 'primitive': NullTransformerPrimitive( - hyperparams=NullTransformerPrimitive.metadata.get_hyperparams().defaults(), - ), - }), - ) - mapped_dataframe = primitive.produce(inputs=dataframe).value - - self.assertEqual(test_utils.convert_through_json(test_utils.effective_metadata(dataframe.metadata)), test_utils.convert_through_json(test_utils.effective_metadata(mapped_dataframe.metadata))) - - self.assertEqual(test_utils.convert_through_json(dataframe), test_utils.convert_through_json(mapped_dataframe)) - - primitive = column_map.DataFrameColumnMapPrimitive( - # We have to make an instance of the primitive ourselves. - hyperparams=hyperparams.defaults().replace({ - 'primitive': column_parser.ColumnParserPrimitive( - hyperparams=column_parser.ColumnParserPrimitive.metadata.get_hyperparams().defaults(), - ), - }), - ) - dataframe = primitive.produce(inputs=mapped_dataframe).value - - self.assertEqual(test_utils.convert_through_json(dataframe)[0][1][0], [0, 2.6173]) - - pickle.dumps(primitive) - - -if __name__ == '__main__': - unittest.main() diff --git a/common-primitives/tests/test_column_parser.py b/common-primitives/tests/test_column_parser.py deleted file mode 100644 index 5d4e4b6..0000000 --- a/common-primitives/tests/test_column_parser.py +++ /dev/null @@ -1,474 +0,0 @@ -import math -import os.path -import unittest - -import numpy - -from d3m import container, utils -from d3m.metadata import base as metadata_base - -from common_primitives import dataset_to_dataframe, column_parser, utils as common_utils - -import utils as test_utils - - -class ColumnParserPrimitiveTestCase(unittest.TestCase): - def test_basic(self): - dataset_doc_path = os.path.abspath(os.path.join(os.path.dirname(__file__), 'data', 'datasets', 'iris_dataset_1', 'datasetDoc.json')) - dataset = container.Dataset.load('file://{dataset_doc_path}'.format(dataset_doc_path=dataset_doc_path)) - - hyperparams_class = dataset_to_dataframe.DatasetToDataFramePrimitive.metadata.get_hyperparams() - - primitive = dataset_to_dataframe.DatasetToDataFramePrimitive(hyperparams=hyperparams_class.defaults()) - - call_metadata = primitive.produce(inputs=dataset) - - dataframe = call_metadata.value - - hyperparams_class = column_parser.ColumnParserPrimitive.metadata.get_hyperparams() - - primitive = column_parser.ColumnParserPrimitive(hyperparams=hyperparams_class.defaults()) - - call_metadata = primitive.produce(inputs=dataframe) - - dataframe = call_metadata.value - - first_row = list(dataframe.itertuples(index=False, name=None))[0] - - self.assertEqual(first_row, (0, 5.1, 3.5, 1.4, 0.2, 6241605690342144121)) - - self.assertEqual([type(o) for o in first_row], [int, float, float, float, float, int]) - - self._test_basic_metadata(dataframe.metadata) - - def _test_basic_metadata(self, metadata): - self.maxDiff = None - - self.assertEqual(test_utils.convert_through_json(metadata.query(())), { - 'schema': metadata_base.CONTAINER_SCHEMA_VERSION, - 'structural_type': 'd3m.container.pandas.DataFrame', - 'semantic_types': [ - 'https://metadata.datadrivendiscovery.org/types/Table', - ], - 'dimension': { - 'name': 'rows', - 'semantic_types': ['https://metadata.datadrivendiscovery.org/types/TabularRow'], - 'length': 150, - } - }) - - self.assertEqual(test_utils.convert_through_json(metadata.query((metadata_base.ALL_ELEMENTS,))), { - 'dimension': { - 'name': 'columns', - 'semantic_types': ['https://metadata.datadrivendiscovery.org/types/TabularColumn'], - 'length': 6, - } - }) - - self.assertEqual(test_utils.convert_through_json(metadata.query((metadata_base.ALL_ELEMENTS, 0))), { - 'name': 'd3mIndex', - 'structural_type': 'int', - 'semantic_types': [ - 'http://schema.org/Integer', - 'https://metadata.datadrivendiscovery.org/types/PrimaryKey', - ], - }) - - for i in range(1, 5): - self.assertEqual(test_utils.convert_through_json(metadata.query((metadata_base.ALL_ELEMENTS, i))), { - 'name': ['sepalLength', 'sepalWidth', 'petalLength', 'petalWidth'][i - 1], - 'structural_type': 'float', - 'semantic_types': [ - 'http://schema.org/Float', - 'https://metadata.datadrivendiscovery.org/types/Attribute', - ], - }, i) - - self.assertEqual(test_utils.convert_through_json(metadata.query((metadata_base.ALL_ELEMENTS, 5))), { - 'name': 'species', - 'structural_type': 'int', - 'semantic_types': [ - 'https://metadata.datadrivendiscovery.org/types/CategoricalData', - 'https://metadata.datadrivendiscovery.org/types/SuggestedTarget', - 'https://metadata.datadrivendiscovery.org/types/Attribute', - ], - }) - - def test_new(self): - dataset_doc_path = os.path.abspath(os.path.join(os.path.dirname(__file__), 'data', 'datasets', 'iris_dataset_1', 'datasetDoc.json')) - - dataset = container.Dataset.load('file://{dataset_doc_path}'.format(dataset_doc_path=dataset_doc_path)) - - hyperparams_class = dataset_to_dataframe.DatasetToDataFramePrimitive.metadata.get_hyperparams() - - primitive = dataset_to_dataframe.DatasetToDataFramePrimitive(hyperparams=hyperparams_class.defaults()) - - call_metadata = primitive.produce(inputs=dataset) - - dataframe = call_metadata.value - - hyperparams_class = column_parser.ColumnParserPrimitive.metadata.get_hyperparams() - - primitive = column_parser.ColumnParserPrimitive(hyperparams=hyperparams_class.defaults().replace({'return_result': 'new', 'use_columns': [2]})) - - call_metadata = primitive.produce(inputs=dataframe) - - dataframe = call_metadata.value - - first_row = list(dataframe.itertuples(index=False, name=None))[0] - - self.assertEqual(first_row, ('0', 3.5)) - - self.assertEqual([type(o) for o in first_row], [str, float]) - - self._test_new_metadata(dataframe.metadata) - - def _test_new_metadata(self, metadata): - self.maxDiff = None - - self.assertEqual(test_utils.convert_through_json(metadata.query(())), { - 'schema': metadata_base.CONTAINER_SCHEMA_VERSION, - 'structural_type': 'd3m.container.pandas.DataFrame', - 'semantic_types': [ - 'https://metadata.datadrivendiscovery.org/types/Table', - ], - 'dimension': { - 'name': 'rows', - 'semantic_types': ['https://metadata.datadrivendiscovery.org/types/TabularRow'], - 'length': 150, - } - }) - - self.assertEqual(test_utils.convert_through_json(metadata.query((metadata_base.ALL_ELEMENTS,))), { - 'dimension': { - 'name': 'columns', - 'semantic_types': ['https://metadata.datadrivendiscovery.org/types/TabularColumn'], - 'length': 2, - } - }) - - self.assertEqual(test_utils.convert_through_json(metadata.query((metadata_base.ALL_ELEMENTS, 0))), { - 'name': 'd3mIndex', - 'structural_type': 'str', - 'semantic_types': [ - 'http://schema.org/Integer', - 'https://metadata.datadrivendiscovery.org/types/PrimaryKey', - ], - }) - - self.assertEqual(test_utils.convert_through_json(metadata.query((metadata_base.ALL_ELEMENTS, 1))), { - 'name': 'sepalWidth', - 'structural_type': 'float', - 'semantic_types': [ - 'http://schema.org/Float', - 'https://metadata.datadrivendiscovery.org/types/Attribute', - ], - }) - - def test_append(self): - dataset_doc_path = os.path.abspath(os.path.join(os.path.dirname(__file__), 'data', 'datasets', 'iris_dataset_1', 'datasetDoc.json')) - - dataset = container.Dataset.load('file://{dataset_doc_path}'.format(dataset_doc_path=dataset_doc_path)) - - hyperparams_class = dataset_to_dataframe.DatasetToDataFramePrimitive.metadata.get_hyperparams() - - primitive = dataset_to_dataframe.DatasetToDataFramePrimitive(hyperparams=hyperparams_class.defaults()) - - call_metadata = primitive.produce(inputs=dataset) - - dataframe = call_metadata.value - - hyperparams_class = column_parser.ColumnParserPrimitive.metadata.get_hyperparams() - - primitive = column_parser.ColumnParserPrimitive(hyperparams=hyperparams_class.defaults().replace({'return_result': 'append', 'replace_index_columns': False, 'parse_semantic_types': ['https://metadata.datadrivendiscovery.org/types/CategoricalData', 'http://schema.org/Integer']})) - - call_metadata = primitive.produce(inputs=dataframe) - - dataframe = call_metadata.value - - first_row = list(dataframe.itertuples(index=False, name=None))[0] - - self.assertEqual(first_row, ('0', '5.1', '3.5', '1.4', '0.2', 'Iris-setosa', 0, 6241605690342144121)) - - self.assertEqual([type(o) for o in first_row], [str, str, str, str, str, str, int, int]) - - self._test_append_metadata(dataframe.metadata, False) - - def test_append_replace_index_columns(self): - dataset_doc_path = os.path.abspath(os.path.join(os.path.dirname(__file__), 'data', 'datasets', 'iris_dataset_1', 'datasetDoc.json')) - - dataset = container.Dataset.load('file://{dataset_doc_path}'.format(dataset_doc_path=dataset_doc_path)) - - hyperparams_class = dataset_to_dataframe.DatasetToDataFramePrimitive.metadata.get_hyperparams() - - primitive = dataset_to_dataframe.DatasetToDataFramePrimitive(hyperparams=hyperparams_class.defaults()) - - call_metadata = primitive.produce(inputs=dataset) - - dataframe = call_metadata.value - - hyperparams_class = column_parser.ColumnParserPrimitive.metadata.get_hyperparams() - - primitive = column_parser.ColumnParserPrimitive(hyperparams=hyperparams_class.defaults().replace({'return_result': 'append', 'parse_semantic_types': ['https://metadata.datadrivendiscovery.org/types/CategoricalData', 'http://schema.org/Integer']})) - - call_metadata = primitive.produce(inputs=dataframe) - - dataframe = call_metadata.value - - first_row = list(dataframe.itertuples(index=False, name=None))[0] - - self.assertEqual(first_row, (0, '5.1', '3.5', '1.4', '0.2', 'Iris-setosa', 6241605690342144121)) - - self.assertEqual([type(o) for o in first_row], [int, str, str, str, str, str, int]) - - self._test_append_metadata(dataframe.metadata, True) - - def _test_append_metadata(self, metadata, replace_index_columns): - self.maxDiff = None - - self.assertEqual(test_utils.convert_through_json(metadata.query(())), { - 'schema': metadata_base.CONTAINER_SCHEMA_VERSION, - 'structural_type': 'd3m.container.pandas.DataFrame', - 'semantic_types': [ - 'https://metadata.datadrivendiscovery.org/types/Table', - ], - 'dimension': { - 'name': 'rows', - 'semantic_types': ['https://metadata.datadrivendiscovery.org/types/TabularRow'], - 'length': 150, - } - }) - - self.assertEqual(test_utils.convert_through_json(metadata.query((metadata_base.ALL_ELEMENTS,))), { - 'dimension': { - 'name': 'columns', - 'semantic_types': ['https://metadata.datadrivendiscovery.org/types/TabularColumn'], - 'length': 7 if replace_index_columns else 8, - } - }) - - self.assertEqual(test_utils.convert_through_json(metadata.query((metadata_base.ALL_ELEMENTS, 0))), { - 'name': 'd3mIndex', - 'structural_type': 'int' if replace_index_columns else 'str', - 'semantic_types': [ - 'http://schema.org/Integer', - 'https://metadata.datadrivendiscovery.org/types/PrimaryKey', - ], - }) - - for i in range(1, 5): - self.assertEqual(test_utils.convert_through_json(metadata.query((metadata_base.ALL_ELEMENTS, i))), { - 'name': ['sepalLength', 'sepalWidth', 'petalLength', 'petalWidth'][i - 1], - 'structural_type': 'str', - 'semantic_types': [ - 'http://schema.org/Float', - 'https://metadata.datadrivendiscovery.org/types/Attribute', - ], - }, i) - - self.assertEqual(test_utils.convert_through_json(metadata.query((metadata_base.ALL_ELEMENTS, 5))), { - 'name': 'species', - 'structural_type': 'str', - 'semantic_types': [ - 'https://metadata.datadrivendiscovery.org/types/CategoricalData', - 'https://metadata.datadrivendiscovery.org/types/SuggestedTarget', - 'https://metadata.datadrivendiscovery.org/types/Attribute', - ], - }) - - if not replace_index_columns: - self.assertEqual(test_utils.convert_through_json(metadata.query((metadata_base.ALL_ELEMENTS, 6))), { - 'name': 'd3mIndex', - 'structural_type': 'int', - 'semantic_types': [ - 'http://schema.org/Integer', - 'https://metadata.datadrivendiscovery.org/types/PrimaryKey', - ], - }) - - self.assertEqual(test_utils.convert_through_json(metadata.query((metadata_base.ALL_ELEMENTS, 6 if replace_index_columns else 7))), { - 'name': 'species', - 'structural_type': 'int', - 'semantic_types': [ - 'https://metadata.datadrivendiscovery.org/types/CategoricalData', - 'https://metadata.datadrivendiscovery.org/types/SuggestedTarget', - 'https://metadata.datadrivendiscovery.org/types/Attribute', - ], - }) - - def test_integer(self): - hyperparams_class = column_parser.ColumnParserPrimitive.metadata.get_hyperparams() - - primitive = column_parser.ColumnParserPrimitive(hyperparams=hyperparams_class.defaults()) - - dataframe = container.DataFrame({'a': ['1.0', '2.0', '3.0']}, generate_metadata=True) - - dataframe.metadata = dataframe.metadata.update((metadata_base.ALL_ELEMENTS, 0), { - 'name': 'test', - 'semantic_types': [ - 'http://schema.org/Integer', - 'https://metadata.datadrivendiscovery.org/types/PrimaryKey', - ], - }) - - call_metadata = primitive.produce(inputs=dataframe) - - parsed_dataframe = call_metadata.value - - self.assertEqual(test_utils.convert_through_json(parsed_dataframe.metadata.query((metadata_base.ALL_ELEMENTS, 0))), { - 'name': 'test', - 'structural_type': 'int', - 'semantic_types': [ - 'http://schema.org/Integer', - 'https://metadata.datadrivendiscovery.org/types/PrimaryKey', - ], - }) - - self.assertEqual(list(parsed_dataframe.iloc[:, 0]), [1, 2, 3]) - - dataframe.iloc[2, 0] = '3.1' - - call_metadata = primitive.produce(inputs=dataframe) - - parsed_dataframe = call_metadata.value - - self.assertEqual(test_utils.convert_through_json(parsed_dataframe.metadata.query((metadata_base.ALL_ELEMENTS, 0))), { - 'name': 'test', - 'structural_type': 'int', - 'semantic_types': [ - 'http://schema.org/Integer', - 'https://metadata.datadrivendiscovery.org/types/PrimaryKey', - ], - }) - - self.assertEqual(list(parsed_dataframe.iloc[:, 0]), [1, 2, 3]) - - dataframe.iloc[2, 0] = 'aaa' - - with self.assertRaisesRegex(ValueError, 'Not all values in a column can be parsed into integers, but only integers were expected'): - primitive.produce(inputs=dataframe) - - dataframe.metadata = dataframe.metadata.update((metadata_base.ALL_ELEMENTS, 0), { - 'name': 'test', - 'structural_type': str, - 'semantic_types': [ - 'http://schema.org/Integer', - ], - }) - - call_metadata = primitive.produce(inputs=dataframe) - - parsed_dataframe = call_metadata.value - - self.assertEqual(test_utils.convert_through_json(parsed_dataframe.metadata.query((metadata_base.ALL_ELEMENTS, 0))), { - 'name': 'test', - 'structural_type': 'float', - 'semantic_types': [ - 'http://schema.org/Integer', - ], - }) - - self.assertEqual(list(parsed_dataframe.iloc[0:2, 0]), [1.0, 2.0]) - self.assertTrue(math.isnan(parsed_dataframe.iloc[2, 0])) - - def test_float_vector(self): - dataset_doc_path = os.path.abspath(os.path.join(os.path.dirname(__file__), 'data', 'datasets', 'object_dataset_1', 'datasetDoc.json')) - - dataset = container.Dataset.load('file://{dataset_doc_path}'.format(dataset_doc_path=dataset_doc_path)) - - hyperparams_class = dataset_to_dataframe.DatasetToDataFramePrimitive.metadata.get_hyperparams() - primitive = dataset_to_dataframe.DatasetToDataFramePrimitive(hyperparams=hyperparams_class.defaults().replace({'dataframe_resource': 'learningData'})) - dataframe = primitive.produce(inputs=dataset).value - - hyperparams_class = column_parser.ColumnParserPrimitive.metadata.get_hyperparams() - primitive = column_parser.ColumnParserPrimitive(hyperparams=hyperparams_class.defaults()) - dataframe = primitive.produce(inputs=dataframe).value - - self.assertIsInstance(dataframe.iloc[0, 3], container.ndarray) - self.assertEqual(dataframe.iloc[0, 3].shape, (8,)) - - self.assertEqual(utils.to_json_structure(dataframe.metadata.to_internal_simple_structure()), [{ - 'selector': [], - 'metadata': { - 'structural_type': 'd3m.container.pandas.DataFrame', - 'semantic_types': ['https://metadata.datadrivendiscovery.org/types/Table'], - 'dimension': { - 'name': 'rows', - 'semantic_types': ['https://metadata.datadrivendiscovery.org/types/TabularRow'], - 'length': 4, - }, - 'schema': 'https://metadata.datadrivendiscovery.org/schemas/v0/container.json'}, - }, { - 'selector': ['__ALL_ELEMENTS__'], - 'metadata': { - 'dimension': { - 'name': 'columns', - 'semantic_types': ['https://metadata.datadrivendiscovery.org/types/TabularColumn'], - 'length': 4, - }, - }, - }, { - 'selector': ['__ALL_ELEMENTS__', 0], - 'metadata': { - 'structural_type': 'int', - 'name': 'd3mIndex', - 'semantic_types': ['http://schema.org/Integer', 'https://metadata.datadrivendiscovery.org/types/PrimaryMultiKey'], - }, - }, { - 'selector': ['__ALL_ELEMENTS__', 1], - 'metadata': { - 'name': 'image', - 'structural_type': 'str', - 'semantic_types': ['http://schema.org/Text', 'https://metadata.datadrivendiscovery.org/types/Attribute'], - 'foreign_key': { - 'type': 'COLUMN', - 'resource_id': '0', - 'column_index': 0, - }, - }, - }, { - 'selector': ['__ALL_ELEMENTS__', 2], - 'metadata': { - 'name': 'color_not_class', - 'structural_type': 'int', - 'semantic_types': [ - 'https://metadata.datadrivendiscovery.org/types/CategoricalData', - 'https://metadata.datadrivendiscovery.org/types/SuggestedTarget', - 'https://metadata.datadrivendiscovery.org/types/Attribute', - ], - }, - }, { - 'selector': ['__ALL_ELEMENTS__', 3], - 'metadata': { - 'structural_type': 'd3m.container.numpy.ndarray', - 'dimension': { - 'length': 8, - }, - 'name': 'bounding_polygon_area', - 'semantic_types': [ - 'https://metadata.datadrivendiscovery.org/types/FloatVector', - 'https://metadata.datadrivendiscovery.org/types/SuggestedTarget', - 'https://metadata.datadrivendiscovery.org/types/Boundary', - 'https://metadata.datadrivendiscovery.org/types/BoundingPolygon', - 'https://metadata.datadrivendiscovery.org/types/Attribute', - ], - 'boundary_for': { - 'resource_id': 'learningData', - 'column_name': 'image', - }, - }, - }, { - 'selector': ['__ALL_ELEMENTS__', 3, '__ALL_ELEMENTS__'], - 'metadata': {'structural_type': 'numpy.float64'}, - }]) - - def test_ugly_time_values(self): - for value in [ - 'Original chained constant price data are rescaled.', - '1986/87', - ]: - self.assertTrue(numpy.isnan(common_utils.parse_datetime_to_float(value)), value) - - -if __name__ == '__main__': - unittest.main() diff --git a/common-primitives/tests/test_compute_metafeatures.py b/common-primitives/tests/test_compute_metafeatures.py deleted file mode 100644 index 07a1e4c..0000000 --- a/common-primitives/tests/test_compute_metafeatures.py +++ /dev/null @@ -1,1106 +0,0 @@ -import math -import os -import os.path -import unittest - -import numpy - -from d3m import container -from d3m.metadata import base as metadata_base - -from common_primitives import column_parser, compute_metafeatures, dataset_to_dataframe, denormalize - -import utils as test_utils - - -def round_to_significant_digits(x, n): - if x == 0: - return x - elif not numpy.isfinite(x): - return x - else: - return round(x, -int(math.floor(math.log10(abs(x)))) + (n - 1)) - - -def round_numbers(obj): - if isinstance(obj, (int, str)): - return obj - elif isinstance(obj, float): - return round_to_significant_digits(obj, 12) - elif isinstance(obj, list): - return [round_numbers(el) for el in obj] - elif isinstance(obj, tuple): - return tuple(round_numbers(el) for el in obj) - elif isinstance(obj, dict): - return {k: round_numbers(v) for k, v in obj.items()} - else: - return obj - - -class ComputeMetafeaturesPrimitiveTestCase(unittest.TestCase): - def _get_iris(self): - dataset_doc_path = os.path.abspath(os.path.join(os.path.dirname(__file__), 'data', 'datasets', 'iris_dataset_1', 'datasetDoc.json')) - - dataset = container.Dataset.load('file://{dataset_doc_path}'.format(dataset_doc_path=dataset_doc_path)) - - # We set semantic types like runtime would. - dataset.metadata = dataset.metadata.add_semantic_type(('learningData', metadata_base.ALL_ELEMENTS, 5), 'https://metadata.datadrivendiscovery.org/types/Target') - dataset.metadata = dataset.metadata.add_semantic_type(('learningData', metadata_base.ALL_ELEMENTS, 5), 'https://metadata.datadrivendiscovery.org/types/TrueTarget') - dataset.metadata = dataset.metadata.remove_semantic_type(('learningData', metadata_base.ALL_ELEMENTS, 5), 'https://metadata.datadrivendiscovery.org/types/Attribute') - - dataframe_hyperparams_class = dataset_to_dataframe.DatasetToDataFramePrimitive.metadata.get_hyperparams() - dataframe_primitive = dataset_to_dataframe.DatasetToDataFramePrimitive(hyperparams=dataframe_hyperparams_class.defaults()) - dataframe = dataframe_primitive.produce(inputs=dataset).value - - column_parser_hyperparams_class = column_parser.ColumnParserPrimitive.metadata.get_hyperparams() - column_parser_primitive = column_parser.ColumnParserPrimitive(hyperparams=column_parser_hyperparams_class.defaults()) - dataframe = column_parser_primitive.produce(inputs=dataframe).value - - return dataframe - - def _get_database(self, parse_categorical_columns): - dataset_doc_path = os.path.abspath(os.path.join(os.path.dirname(__file__), 'data', 'datasets', 'database_dataset_1', 'datasetDoc.json')) - - dataset = container.Dataset.load('file://{dataset_doc_path}'.format(dataset_doc_path=dataset_doc_path)) - - # We set semantic types like runtime would. - dataset.metadata = dataset.metadata.add_semantic_type(('learningData', metadata_base.ALL_ELEMENTS, 4), 'https://metadata.datadrivendiscovery.org/types/Target') - dataset.metadata = dataset.metadata.add_semantic_type(('learningData', metadata_base.ALL_ELEMENTS, 4), 'https://metadata.datadrivendiscovery.org/types/TrueTarget') - dataset.metadata = dataset.metadata.remove_semantic_type(('learningData', metadata_base.ALL_ELEMENTS, 4), 'https://metadata.datadrivendiscovery.org/types/Attribute') - - denormalize_hyperparams_class = denormalize.DenormalizePrimitive.metadata.get_hyperparams() - denormalize_primitive = denormalize.DenormalizePrimitive(hyperparams=denormalize_hyperparams_class.defaults()) - dataset = denormalize_primitive.produce(inputs=dataset).value - - dataframe_hyperparams_class = dataset_to_dataframe.DatasetToDataFramePrimitive.metadata.get_hyperparams() - dataframe_primitive = dataset_to_dataframe.DatasetToDataFramePrimitive(hyperparams=dataframe_hyperparams_class.defaults()) - dataframe = dataframe_primitive.produce(inputs=dataset).value - - if parse_categorical_columns: - parse_semantic_types = ( - 'http://schema.org/Boolean', 'https://metadata.datadrivendiscovery.org/types/CategoricalData', - 'http://schema.org/Integer', 'http://schema.org/Float', - 'https://metadata.datadrivendiscovery.org/types/FloatVector', 'http://schema.org/DateTime', - ) - else: - parse_semantic_types = ( - 'http://schema.org/Boolean', - 'http://schema.org/Integer', 'http://schema.org/Float', - 'https://metadata.datadrivendiscovery.org/types/FloatVector', 'http://schema.org/DateTime', - ) - - column_parser_hyperparams_class = column_parser.ColumnParserPrimitive.metadata.get_hyperparams() - column_parser_primitive = column_parser.ColumnParserPrimitive(hyperparams=column_parser_hyperparams_class.defaults().replace({'parse_semantic_types': parse_semantic_types})) - dataframe = column_parser_primitive.produce(inputs=dataframe).value - - return dataframe - - def test_iris(self): - self.maxDiff = None - - dataframe = self._get_iris() - - hyperparams_class = compute_metafeatures.ComputeMetafeaturesPrimitive.metadata.get_hyperparams() - primitive = compute_metafeatures.ComputeMetafeaturesPrimitive(hyperparams=hyperparams_class.defaults()) - dataframe = primitive.produce(inputs=dataframe).value - - self.assertEqual(round_numbers(test_utils.convert_through_json(dataframe.metadata.query(())['data_metafeatures'])), round_numbers({ - 'attribute_counts_by_semantic_type': { - 'http://schema.org/Float': 4, - 'https://metadata.datadrivendiscovery.org/types/Attribute': 4, - }, - 'attribute_counts_by_structural_type': { - 'float': 4, - }, - 'attribute_ratios_by_semantic_type': { - 'http://schema.org/Float': 1.0, - 'https://metadata.datadrivendiscovery.org/types/Attribute': 1.0, - }, - 'attribute_ratios_by_structural_type': { - 'float': 1.0, - }, - 'dimensionality': 0.02666666666666667, - 'entropy_of_attributes': { - 'count': 4, - 'kurtosis': -1.4343159590314425, - 'max': 1.525353510619575, - 'mean': 1.4166844257365265, - 'median': 1.4323995290219738, - 'min': 1.2765851342825842, - 'quartile_1': 1.3565647450899858, - 'quartile_3': 1.4925192096685145, - 'skewness': -0.6047691718752254, - 'std': 0.11070539686522164, - }, - 'entropy_of_numeric_attributes': { - 'count': 4, - 'kurtosis': -1.4343159590314425, - 'max': 1.525353510619575, - 'mean': 1.4166844257365265, - 'median': 1.4323995290219738, - 'min': 1.2765851342825842, - 'quartile_1': 1.3565647450899858, - 'quartile_3': 1.4925192096685145, - 'skewness': -0.6047691718752254, - 'std': 0.11070539686522164, - }, - 'kurtosis_of_attributes': { - 'count': 4, - 'kurtosis': -1.1515850633224236, - 'max': 0.2907810623654279, - 'mean': -0.7507394876837397, - 'median': -0.9459091062274914, - 'min': -1.4019208006454036, - 'quartile_1': -1.3552958285158583, - 'quartile_3': -0.3413527653953726, - 'skewness': 0.8725328682893572, - 'std': 0.7948191385132984, - }, - 'mean_of_attributes': { - 'count': 4, - 'kurtosis': 0.8595879081956515, - 'max': 5.843333333333335, - 'mean': 3.4636666666666684, - 'median': 3.406333333333335, - 'min': 1.1986666666666672, - 'quartile_1': 2.5901666666666676, - 'quartile_3': 4.279833333333335, - 'skewness': 0.17098811780721151, - 'std': 1.919017997329383, - }, - 'number_distinct_values_of_numeric_attributes': { - 'count': 4, - 'kurtosis': -3.0617196548227046, - 'max': 43, - 'mean': 30.75, - 'median': 29.0, - 'min': 22, - 'quartile_1': 22.75, - 'quartile_3': 37.0, - 'skewness': 0.5076458131399395, - 'std': 10.07885575516057, - }, - 'number_of_attributes': 4, - 'number_of_binary_attributes': 0, - 'number_of_categorical_attributes': 0, - 'number_of_discrete_attributes': 0, - 'number_of_instances': 150, - 'number_of_instances_with_missing_values': 0, - 'number_of_instances_with_present_values': 150, - 'number_of_numeric_attributes': 4, - 'number_of_other_attributes': 0, - 'number_of_string_attributes': 0, - 'ratio_of_binary_attributes': 0.0, - 'ratio_of_categorical_attributes': 0.0, - 'ratio_of_discrete_attributes': 0.0, - 'ratio_of_instances_with_missing_values': 0.0, - 'ratio_of_instances_with_present_values': 1.0, - 'ratio_of_numeric_attributes': 1.0, - 'ratio_of_other_attributes': 0.0, - 'ratio_of_string_attributes': 0.0, - 'skew_of_attributes': { - 'count': 4, - 'kurtosis': -4.4981774675194846, - 'max': 0.3340526621720866, - 'mean': 0.06737570104778733, - 'median': 0.10495719724642275, - 'min': -0.27446425247378287, - 'quartile_1': -0.1473634847265412, - 'quartile_3': 0.3196963830207513, - 'skewness': -0.25709026597426626, - 'std': 0.3049355425307816, - }, - 'standard_deviation_of_attributes': { - 'count': 4, - 'kurtosis': 2.65240266862979, - 'max': 1.7644204199522617, - 'mean': 0.9473104002482848, - 'median': 0.7956134348393522, - 'min': 0.4335943113621737, - 'quartile_1': 0.6807691341161745, - 'quartile_3': 1.0621547009714627, - 'skewness': 1.4362343455338735, - 'std': 0.5714610798918619, - } - })) - self.assertFalse('data_metafeatures' in dataframe.metadata.query_column(0)) - self.assertEqual(round_numbers(test_utils.convert_through_json(dataframe.metadata.query_column(1)['data_metafeatures'])), round_numbers({ - 'entropy_of_values': 1.525353510619575, - 'number_distinct_values': 35, - 'number_of_missing_values': 0, - 'number_of_negative_numeric_values': 0, - 'number_of_numeric_values': 150, - 'number_of_numeric_values_equal_-1': 0, - 'number_of_numeric_values_equal_0': 0, - 'number_of_numeric_values_equal_1': 0, - 'number_of_positive_numeric_values': 150, - 'number_of_present_values': 150, - 'ratio_of_missing_values': 0.0, - 'ratio_of_negative_numeric_values': 0.0, - 'ratio_of_numeric_values': 1.0, - 'ratio_of_numeric_values_equal_-1': 0.0, - 'ratio_of_numeric_values_equal_0': 0.0, - 'ratio_of_numeric_values_equal_1': 0.0, - 'ratio_of_positive_numeric_values': 1.0, - 'ratio_of_present_values': 1.0, - 'value_counts_aggregate': { - 'count': 5, - 'kurtosis': -0.46949652355057747, - 'max': 42, - 'mean': 30.0, - 'median': 32.0, - 'min': 11, - 'quartile_1': 24.0, - 'quartile_3': 41.0, - 'skewness': -0.7773115383470599, - 'std': 12.90348790056394, - }, - 'value_probabilities_aggregate': { - 'count': 5, - 'kurtosis': -0.4694965235505757, - 'max': 0.28, - 'mean': 0.2, - 'median': 0.21333333333333335, - 'min': 0.07333333333333333, - 'quartile_1': 0.16, - 'quartile_3': 0.2733333333333333, - 'skewness': -0.7773115383470603, - 'std': 0.08602325267042626, - }, - 'values_aggregate': { - 'count': 150, - 'kurtosis': -0.5520640413156395, - 'max': 7.9, - 'mean': 5.843333333333335, - 'median': 5.8, - 'min': 4.3, - 'quartile_1': 5.1, - 'quartile_3': 6.4, - 'skewness': 0.3149109566369728, - 'std': 0.8280661279778629, - }, - })) - self.assertEqual(round_numbers(test_utils.convert_through_json(dataframe.metadata.query_column(2)['data_metafeatures'])), round_numbers({ - 'entropy_of_values': 1.2765851342825842, - 'number_distinct_values': 23, - 'number_of_missing_values': 0, - 'number_of_negative_numeric_values': 0, - 'number_of_numeric_values': 150, - 'number_of_numeric_values_equal_-1': 0, - 'number_of_numeric_values_equal_0': 0, - 'number_of_numeric_values_equal_1': 0, - 'number_of_positive_numeric_values': 150, - 'number_of_present_values': 150, - 'ratio_of_missing_values': 0.0, - 'ratio_of_negative_numeric_values': 0.0, - 'ratio_of_numeric_values': 1.0, - 'ratio_of_numeric_values_equal_-1': 0.0, - 'ratio_of_numeric_values_equal_0': 0.0, - 'ratio_of_numeric_values_equal_1': 0.0, - 'ratio_of_positive_numeric_values': 1.0, - 'ratio_of_present_values': 1.0, - 'value_counts_aggregate': { - 'count': 5, - 'kurtosis': -0.9899064888741496, - 'max': 69, - 'mean': 30.0, - 'median': 20.0, - 'min': 4, - 'quartile_1': 11.0, - 'quartile_3': 46.0, - 'skewness': 0.8048211570183503, - 'std': 26.99073915253156, - }, - 'value_probabilities_aggregate': { - 'count': 5, - 'kurtosis': -0.9899064888741478, - 'max': 0.46, - 'mean': 0.19999999999999998, - 'median': 0.13333333333333333, - 'min': 0.02666666666666667, - 'quartile_1': 0.07333333333333333, - 'quartile_3': 0.30666666666666664, - 'skewness': 0.8048211570183509, - 'std': 0.17993826101687704, - }, - 'values_aggregate': { - 'count': 150, - 'kurtosis': 0.2907810623654279, - 'max': 4.4, - 'mean': 3.0540000000000007, - 'median': 3.0, - 'min': 2.0, - 'quartile_1': 2.8, - 'quartile_3': 3.3, - 'skewness': 0.3340526621720866, - 'std': 0.4335943113621737, - }, - })) - self.assertEqual(round_numbers(test_utils.convert_through_json(dataframe.metadata.query_column(3)['data_metafeatures'])), round_numbers({ - 'entropy_of_values': 1.38322461535912, - 'number_distinct_values': 43, - 'number_of_missing_values': 0, - 'number_of_negative_numeric_values': 0, - 'number_of_numeric_values': 150, - 'number_of_numeric_values_equal_-1': 0, - 'number_of_numeric_values_equal_0': 0, - 'number_of_numeric_values_equal_1': 1, - 'number_of_positive_numeric_values': 150, - 'number_of_present_values': 150, - 'ratio_of_missing_values': 0.0, - 'ratio_of_negative_numeric_values': 0.0, - 'ratio_of_numeric_values': 1.0, - 'ratio_of_numeric_values_equal_-1': 0.0, - 'ratio_of_numeric_values_equal_0': 0.0, - 'ratio_of_numeric_values_equal_1': 0.006666666666666667, - 'ratio_of_positive_numeric_values': 1.0, - 'ratio_of_present_values': 1.0, - 'value_counts_aggregate': { - 'count': 5, - 'kurtosis': -1.875313335089766, - 'max': 50, - 'mean': 30.0, - 'median': 34.0, - 'min': 3, - 'quartile_1': 16.0, - 'quartile_3': 47.0, - 'skewness': -0.4786622161186872, - 'std': 20.18662923818635, - }, - 'value_probabilities_aggregate': { - 'count': 5, - 'kurtosis': -1.8753133350897668, - 'max': 0.3333333333333333, - 'mean': 0.2, - 'median': 0.22666666666666666, - 'min': 0.02, - 'quartile_1': 0.10666666666666667, - 'quartile_3': 0.31333333333333335, - 'skewness': -0.4786622161186876, - 'std': 0.13457752825457567, - }, - 'values_aggregate': { - 'count': 150, - 'kurtosis': -1.4019208006454036, - 'max': 6.9, - 'mean': 3.7586666666666693, - 'median': 4.35, - 'min': 1.0, - 'quartile_1': 1.6, - 'quartile_3': 5.1, - 'skewness': -0.27446425247378287, - 'std': 1.7644204199522617, - }, - })) - self.assertEqual(round_numbers(test_utils.convert_through_json(dataframe.metadata.query_column(4)['data_metafeatures'])), round_numbers({ - 'entropy_of_values': 1.4815744426848276, - 'number_distinct_values': 22, - 'number_of_missing_values': 0, - 'number_of_negative_numeric_values': 0, - 'number_of_numeric_values': 150, - 'number_of_numeric_values_equal_-1': 0, - 'number_of_numeric_values_equal_0': 0, - 'number_of_numeric_values_equal_1': 7, - 'number_of_positive_numeric_values': 150, - 'number_of_present_values': 150, - 'ratio_of_missing_values': 0.0, - 'ratio_of_negative_numeric_values': 0.0, - 'ratio_of_numeric_values': 1.0, - 'ratio_of_numeric_values_equal_-1': 0.0, - 'ratio_of_numeric_values_equal_0': 0.0, - 'ratio_of_numeric_values_equal_1': 0.04666666666666667, - 'ratio_of_positive_numeric_values': 1.0, - 'ratio_of_present_values': 1.0, - 'value_counts_aggregate': { - 'count': 5, - 'kurtosis': -0.6060977121954245, - 'max': 49, - 'mean': 30.0, - 'median': 29.0, - 'min': 8, - 'quartile_1': 23.0, - 'quartile_3': 41.0, - 'skewness': -0.28840734350346464, - 'std': 15.937377450509228, - }, - 'value_probabilities_aggregate': { - 'count': 5, - 'kurtosis': -0.606097712195421, - 'max': 0.32666666666666666, - 'mean': 0.2, - 'median': 0.19333333333333333, - 'min': 0.05333333333333334, - 'quartile_1': 0.15333333333333332, - 'quartile_3': 0.2733333333333333, - 'skewness': -0.2884073435034653, - 'std': 0.10624918300339484, - }, - 'values_aggregate': { - 'count': 150, - 'kurtosis': -1.3397541711393433, - 'max': 2.5, - 'mean': 1.1986666666666672, - 'median': 1.3, - 'min': 0.1, - 'quartile_1': 0.3, - 'quartile_3': 1.8, - 'skewness': -0.10499656214412734, - 'std': 0.7631607417008414, - }, - })) - self.assertEqual(round_numbers(test_utils.convert_through_json(dataframe.metadata.query_column(5)['data_metafeatures'])), round_numbers({ - 'default_accuracy': 0.3333333333333333, - 'entropy_of_values': 1.0986122886681096, - 'equivalent_number_of_numeric_attributes': 1.7538156960944151, - 'joint_entropy_of_attributes': { - 'count': 4, - 'kurtosis': -4.468260105522818, - 'max': 0.9180949375453917, - 'mean': 0.6264126219845205, - 'median': 0.6607409495199184, - 'min': 0.26607365135285327, - 'quartile_1': 0.3993550878466134, - 'quartile_3': 0.8877984836578254, - 'skewness': -0.24309705749856694, - 'std': 0.3221913428169348, - }, - 'joint_entropy_of_numeric_attributes': { - 'count': 4, - 'kurtosis': -5.533056612798099, - 'max': 2.1801835659431514, - 'mean': 1.8888840924201158, - 'median': 1.8856077827026931, - 'min': 1.604137238331926, - 'quartile_1': 1.6476031549386407, - 'quartile_3': 2.1268887201841684, - 'skewness': 0.01639056780792744, - 'std': 0.29770030633854977, - }, - 'mutual_information_of_numeric_attributes': { - 'count': 4, - 'kurtosis': -4.468260105522818, - 'max': 0.9180949375453917, - 'mean': 0.6264126219845205, - 'median': 0.6607409495199184, - 'min': 0.26607365135285327, - 'quartile_1': 0.3993550878466134, - 'quartile_3': 0.8877984836578254, - 'skewness': -0.24309705749856694, - 'std': 0.3221913428169348, - }, - 'number_distinct_values': 3, - 'number_of_missing_values': 0, - 'number_of_present_values': 150, - 'numeric_noise_to_signal_ratio': 1.2615834611511623, - 'ratio_of_missing_values': 0.0, - 'ratio_of_present_values': 1.0, - 'value_counts_aggregate': { - 'count': 3, - 'max': 50, - 'mean': 50.0, - 'median': 50.0, - 'min': 50, - 'quartile_1': 50.0, - 'quartile_3': 50.0, - 'skewness': 0, - 'std': 0.0, - }, - 'value_probabilities_aggregate': { - 'count': 3, - 'max': 0.3333333333333333, - 'mean': 0.3333333333333333, - 'median': 0.3333333333333333, - 'min': 0.3333333333333333, - 'quartile_1': 0.3333333333333333, - 'quartile_3': 0.3333333333333333, - 'skewness': 0, - 'std': 0.0, - }, - })) - - def test_database_with_parsed_categorical_columns(self): - self.maxDiff = None - - dataframe = self._get_database(True) - - hyperparams_class = compute_metafeatures.ComputeMetafeaturesPrimitive.metadata.get_hyperparams() - primitive = compute_metafeatures.ComputeMetafeaturesPrimitive(hyperparams=hyperparams_class.defaults()) - dataframe = primitive.produce(inputs=dataframe).value - - self._test_database_metafeatures(dataframe.metadata, True) - - def test_database_without_parsed_categorical_columns(self): - self.maxDiff = None - - dataframe = self._get_database(False) - - hyperparams_class = compute_metafeatures.ComputeMetafeaturesPrimitive.metadata.get_hyperparams() - primitive = compute_metafeatures.ComputeMetafeaturesPrimitive(hyperparams=hyperparams_class.defaults()) - dataframe = primitive.produce(inputs=dataframe).value - - self._test_database_metafeatures(dataframe.metadata, False) - - def _test_database_metafeatures(self, metadata, parse_categorical_columns): - expected_metafeatures = { - 'attribute_counts_by_semantic_type': { - 'http://schema.org/DateTime': 1, - 'http://schema.org/Integer': 1, - 'http://schema.org/Text': 2, - 'https://metadata.datadrivendiscovery.org/types/Attribute': 6, - 'https://metadata.datadrivendiscovery.org/types/CategoricalData': 2, - }, - 'attribute_counts_by_structural_type': { - 'float': 2, - 'str': 4, - }, - 'attribute_ratios_by_semantic_type': { - 'http://schema.org/DateTime': 0.16666666666666666, - 'http://schema.org/Integer': 0.16666666666666666, - 'http://schema.org/Text': 0.3333333333333333, - 'https://metadata.datadrivendiscovery.org/types/Attribute': 1.0, - 'https://metadata.datadrivendiscovery.org/types/CategoricalData': 0.3333333333333333, - }, - 'attribute_ratios_by_structural_type': { - 'float': 0.3333333333333333, - 'str': 0.6666666666666666, - }, - 'dimensionality': 0.13333333333333333, - 'entropy_of_attributes': { - 'count': 4, - 'kurtosis': 1.5975414707531783, - 'max': 1.6094379124341005, - 'mean': 1.1249524175825663, - 'median': 1.0986122886681096, - 'min': 0.6931471805599453, - 'quartile_1': 0.9972460116410685, - 'quartile_3': 1.2263186946096072, - 'skewness': 0.4183300365459641, - 'std': 0.3753085673700856, - }, - 'entropy_of_categorical_attributes': { - 'count': 2, - 'max': 1.6094379124341005, - 'mean': 1.354025100551105, - 'median': 1.354025100551105, - 'min': 1.0986122886681096, - 'quartile_1': 1.2263186946096072, - 'quartile_3': 1.4817315064926029, - 'std': 0.3612082625687802, - }, - 'entropy_of_discrete_attributes': { - 'count': 2, - 'max': 1.0986122886681096, - 'mean': 0.8958797346140275, - 'median': 0.8958797346140275, - 'min': 0.6931471805599453, - 'quartile_1': 0.7945134575869863, - 'quartile_3': 0.9972460116410685, - 'std': 0.28670712747781957, - }, - 'entropy_of_numeric_attributes': { - 'count': 2, - 'max': 1.0986122886681096, - 'mean': 0.8958797346140275, - 'median': 0.8958797346140275, - 'min': 0.6931471805599453, - 'quartile_1': 0.7945134575869863, - 'quartile_3': 0.9972460116410685, - 'std': 0.28670712747781957, - }, - 'kurtosis_of_attributes': { - 'count': 2, - 'max': -1.5348837209302326, - 'mean': -1.8415159345391905, - 'median': -1.8415159345391905, - 'min': -2.1481481481481484, - 'quartile_1': -1.9948320413436693, - 'quartile_3': -1.6881998277347114, - 'std': 0.4336434351462721, - }, - 'mean_of_attributes': { - 'count': 2, - 'max': 946713600.0, - 'mean': 473356800.75, - 'median': 473356800.75, - 'min': 1.5, - 'quartile_1': 236678401.125, - 'quartile_3': 710035200.375, - 'std': 669427605.3408685, - }, - 'number_distinct_values_of_categorical_attributes': { - 'count': 2, - 'max': 5, - 'mean': 4.0, - 'median': 4.0, - 'min': 3, - 'quartile_1': 3.5, - 'quartile_3': 4.5, - 'std': 1.4142135623730951, - }, - 'number_distinct_values_of_discrete_attributes': { - 'count': 2, - 'max': 3, - 'mean': 2.5, - 'median': 2.5, - 'min': 2, - 'quartile_1': 2.25, - 'quartile_3': 2.75, - 'std': 0.7071067811865476, - }, - 'number_distinct_values_of_numeric_attributes': { - 'count': 2, - 'max': 3, - 'mean': 2.5, - 'median': 2.5, - 'min': 2, - 'quartile_1': 2.25, - 'quartile_3': 2.75, - 'std': 0.7071067811865476, - }, - 'number_of_attributes': 6, - 'number_of_binary_attributes': 1, - 'number_of_categorical_attributes': 2, - 'number_of_discrete_attributes': 2, - 'number_of_instances': 45, - 'number_of_instances_with_missing_values': 15, - 'number_of_instances_with_present_values': 45, - 'number_of_numeric_attributes': 2, - 'number_of_other_attributes': 0, - 'number_of_string_attributes': 2, - 'ratio_of_binary_attributes': 0.16666666666666666, - 'ratio_of_categorical_attributes': 0.3333333333333333, - 'ratio_of_discrete_attributes': 0.3333333333333333, - 'ratio_of_instances_with_missing_values': 0.3333333333333333, - 'ratio_of_instances_with_present_values': 1.0, - 'ratio_of_numeric_attributes': 0.3333333333333333, - 'ratio_of_other_attributes': 0.0, - 'ratio_of_string_attributes': 0.3333333333333333, - 'skew_of_attributes': { - 'count': 2, - 'max': 0.00017349603091112943, - 'mean': 8.674801545556472e-05, - 'median': 8.674801545556472e-05, - 'min': 0.0, - 'quartile_1': 4.337400772778236e-05, - 'quartile_3': 0.00013012202318334707, - 'std': 0.0001226802199662105, - }, - 'standard_deviation_of_attributes': { - 'count': 2, - 'max': 260578306.67149138, - 'mean': 130289153.59001951, - 'median': 130289153.59001951, - 'min': 0.5085476277156078, - 'quartile_1': 65144577.049283564, - 'quartile_3': 195433730.13075545, - 'std': 184256687.31792185, - }, - } - - if parse_categorical_columns: - expected_metafeatures['attribute_counts_by_structural_type'] = { - 'float': 2, - 'int': 2, - 'str': 2, - } - expected_metafeatures['attribute_ratios_by_structural_type'] = { - 'float': 0.3333333333333333, - 'int': 0.3333333333333333, - 'str': 0.3333333333333333, - } - - self.assertEqual(round_numbers(test_utils.convert_through_json(metadata.query(())['data_metafeatures'])), round_numbers(expected_metafeatures)) - self.assertFalse('data_metafeatures' in metadata.query_column(0)) - - expected_metafeatures = { - 'entropy_of_values': 1.0986122886681096, - 'number_distinct_values': 3, - 'number_of_missing_values': 0, - 'number_of_present_values': 45, - 'ratio_of_missing_values': 0.0, - 'ratio_of_present_values': 1.0, - 'value_counts_aggregate': { - 'count': 3, - 'max': 15, - 'mean': 15.0, - 'median': 15.0, - 'min': 15, - 'quartile_1': 15.0, - 'quartile_3': 15.0, - 'skewness': 0, - 'std': 0.0, - }, - 'value_probabilities_aggregate': { - 'count': 3, - 'max': 0.3333333333333333, - 'mean': 0.3333333333333333, - 'median': 0.3333333333333333, - 'min': 0.3333333333333333, - 'quartile_1': 0.3333333333333333, - 'quartile_3': 0.3333333333333333, - 'skewness': 0, - 'std': 0.0, - }, - } - - if parse_categorical_columns: - expected_metafeatures['values_aggregate'] = { - 'count': 45, - 'kurtosis': -1.5348837209302337, - 'max': 3183890296585507471, - 'mean': 1.3152606765673695e+18, - 'median': 5.866629697275507e+17, - 'min': 175228763389048878, - 'quartile_1': 1.7522876338904886e+17, - 'quartile_3': 3.1838902965855073e+18, - 'skewness': 0.679711376572956, - 'std': 1.3470047628846746e+18, - } - - self.assertEqual(round_numbers(test_utils.convert_through_json(metadata.query_column(1)['data_metafeatures'])), round_numbers(expected_metafeatures)) - self.assertEqual(round_numbers(test_utils.convert_through_json(metadata.query_column(2)['data_metafeatures'])), round_numbers({ - 'number_of_missing_values': 0, - 'number_of_present_values': 45, - 'ratio_of_missing_values': 0.0, - 'ratio_of_present_values': 1.0, - })) - self.assertEqual(round_numbers(test_utils.convert_through_json(metadata.query_column(3)['data_metafeatures'])), round_numbers({ - 'entropy_of_values': 0.6931471805599453, - 'number_distinct_values': 2, - 'number_of_missing_values': 15, - 'number_of_negative_numeric_values': 0, - 'number_of_numeric_values': 30, - 'number_of_numeric_values_equal_-1': 0, - 'number_of_numeric_values_equal_0': 0, - 'number_of_numeric_values_equal_1': 15, - 'number_of_positive_numeric_values': 30, - 'number_of_present_values': 30, - 'ratio_of_missing_values': 0.3333333333333333, - 'ratio_of_negative_numeric_values': 0.0, - 'ratio_of_numeric_values': 0.6666666666666666, - 'ratio_of_numeric_values_equal_-1': 0.0, - 'ratio_of_numeric_values_equal_0': 0.0, - 'ratio_of_numeric_values_equal_1': 0.3333333333333333, - 'ratio_of_positive_numeric_values': 0.6666666666666666, - 'ratio_of_present_values': 0.6666666666666666, - 'value_counts_aggregate': { - 'count': 2, - 'max': 15, - 'mean': 15.0, - 'median': 15.0, - 'min': 15, - 'quartile_1': 15.0, - 'quartile_3': 15.0, - 'std': 0.0, - }, - 'value_probabilities_aggregate': { - 'count': 2, - 'max': 0.5, - 'mean': 0.5, - 'median': 0.5, - 'min': 0.5, - 'quartile_1': 0.5, - 'quartile_3': 0.5, - 'std': 0.0, - }, - 'values_aggregate': { - 'count': 30, - 'kurtosis': -2.1481481481481484, - 'max': 2.0, - 'mean': 1.5, - 'median': 1.5, - 'min': 1.0, - 'quartile_1': 1.0, - 'quartile_3': 2.0, - 'skewness': 0.0, - 'std': 0.5085476277156078, - }, - })) - self.assertEqual(round_numbers(test_utils.convert_through_json(metadata.query_column(4)['data_metafeatures'])), round_numbers({ - 'number_of_missing_values': 0, - 'number_of_present_values': 45, - 'ratio_of_missing_values': 0.0, - 'ratio_of_present_values': 1.0, - })) - - expected_metafeatures = { - 'entropy_of_values': 1.6094379124341005, - 'number_distinct_values': 5, - 'number_of_missing_values': 0, - 'number_of_present_values': 45, - 'ratio_of_missing_values': 0.0, - 'ratio_of_present_values': 1.0, - 'value_counts_aggregate': { - 'count': 5, - 'kurtosis': 0, - 'max': 9, - 'mean': 9.0, - 'median': 9.0, - 'min': 9, - 'quartile_1': 9.0, - 'quartile_3': 9.0, - 'skewness': 0, - 'std': 0.0, - }, - 'value_probabilities_aggregate': { - 'count': 5, - 'kurtosis': 0, - 'max': 0.2, - 'mean': 0.2, - 'median': 0.2, - 'min': 0.2, - 'quartile_1': 0.2, - 'quartile_3': 0.2, - 'skewness': 0, - 'std': 0.0, - }, - } - - if parse_categorical_columns: - expected_metafeatures['values_aggregate'] = { - 'count': 45, - 'kurtosis': -0.8249445297886884, - 'max': 17926897368031380755, - 'mean': 1.1617029581691474e+19, - 'median': 1.1818891258207388e+19, - 'min': 4819821729471251610, - 'quartile_1': 9.804127312560234e+18, - 'quartile_3': 1.3715410240187093e+19, - 'skewness': -0.15176089654708094, - 'std': 4.378987201456074e+18, - } - - self.assertEqual(round_numbers(test_utils.convert_through_json(metadata.query_column(5)['data_metafeatures'])), round_numbers(expected_metafeatures)) - self.assertEqual(round_numbers(test_utils.convert_through_json(metadata.query_column(6)['data_metafeatures'])), round_numbers({ - 'entropy_of_values': 1.0986122886681096, - 'number_distinct_values': 3, - 'number_of_missing_values': 0, - 'number_of_negative_numeric_values': 0, - 'number_of_numeric_values': 45, - 'number_of_numeric_values_equal_-1': 0, - 'number_of_numeric_values_equal_0': 0, - 'number_of_numeric_values_equal_1': 0, - 'number_of_positive_numeric_values': 45, - 'number_of_present_values': 45, - 'ratio_of_missing_values': 0.0, - 'ratio_of_negative_numeric_values': 0.0, - 'ratio_of_numeric_values': 1.0, - 'ratio_of_numeric_values_equal_-1': 0.0, - 'ratio_of_numeric_values_equal_0': 0.0, - 'ratio_of_numeric_values_equal_1': 0.0, - 'ratio_of_positive_numeric_values': 1.0, - 'ratio_of_present_values': 1.0, - 'value_counts_aggregate': { - 'count': 3, - 'max': 15, - 'mean': 15.0, - 'median': 15.0, - 'min': 15, - 'quartile_1': 15.0, - 'quartile_3': 15.0, - 'skewness': 0, - 'std': 0.0, - }, - 'value_probabilities_aggregate': { - 'count': 3, - 'max': 0.3333333333333333, - 'mean': 0.3333333333333333, - 'median': 0.3333333333333333, - 'min': 0.3333333333333333, - 'quartile_1': 0.3333333333333333, - 'quartile_3': 0.3333333333333333, - 'skewness': 0, - 'std': 0.0, - }, - 'values_aggregate': { - 'count': 45, - 'kurtosis': -1.5348837209302326, - 'max': 1262304000.0, - 'mean': 946713600.0, - 'median': 946684800.0, - 'min': 631152000.0, - 'quartile_1': 631152000.0, - 'quartile_3': 1262304000.0, - 'skewness': 0.00017349603091112943, - 'std': 260578306.67149138, - }, - })) - - expected_metafeatures = { - 'categorical_noise_to_signal_ratio': 6.856024896846719, - 'discrete_noise_to_signal_ratio': 16.280596971377722, - 'entropy_of_values': 1.2922333886497557, - 'equivalent_number_of_attributes': 7.497510695804063, - 'equivalent_number_of_categorical_attributes': 7.497510695804063, - 'equivalent_number_of_discrete_attributes': 24.925850557201, - 'equivalent_number_of_numeric_attributes': 24.925850557201, - 'joint_entropy_of_attributes': { - 'count': 4, - 'kurtosis': 3.8310594212937232, - 'max': 0.27405736318703244, - 'mean': 0.11209904602421886, - 'median': 0.06401513288957879, - 'min': 0.04630855513068542, - 'quartile_1': 0.05461037397689525, - 'quartile_3': 0.12150380493690241, - 'skewness': 1.949786087429789, - 'std': 0.10842988984399864, - }, - 'joint_entropy_of_categorical_attributes': { - 'count': 2, - 'max': 2.6276139378968235, - 'mean': 2.473903498180581, - 'median': 2.473903498180581, - 'min': 2.3201930584643393, - 'quartile_1': 2.3970482783224605, - 'quartile_3': 2.5507587180387024, - 'std': 0.2173793885250416, - }, - 'joint_entropy_of_discrete_attributes': { - 'count': 2, - 'max': 2.3334680303922335, - 'mean': 2.139600733638498, - 'median': 2.139600733638498, - 'min': 1.945733436884763, - 'quartile_1': 2.0426670852616304, - 'quartile_3': 2.236534382015366, - 'std': 0.2741697603697419, - }, - 'joint_entropy_of_numeric_attributes': { - 'count': 2, - 'max': 2.3334680303922335, - 'mean': 2.139600733638498, - 'median': 2.139600733638498, - 'min': 1.945733436884763, - 'quartile_1': 2.0426670852616304, - 'quartile_3': 2.236534382015366, - 'std': 0.2741697603697419, - }, - 'mutual_information_of_attributes': { - 'count': 2, - 'max': 0.27405736318703244, - 'mean': 0.17235499102027907, - 'median': 0.17235499102027907, - 'min': 0.07065261885352572, - 'quartile_1': 0.12150380493690241, - 'quartile_3': 0.22320617710365576, - 'std': 0.1438288740437386, - }, - 'mutual_information_of_categorical_attributes': { - 'count': 2, - 'max': 0.27405736318703244, - 'mean': 0.17235499102027907, - 'median': 0.17235499102027907, - 'min': 0.07065261885352572, - 'quartile_1': 0.12150380493690241, - 'quartile_3': 0.22320617710365576, - 'std': 0.1438288740437386, - }, - 'mutual_information_of_discrete_attributes': { - 'count': 2, - 'max': 0.05737764692563185, - 'mean': 0.05184310102815864, - 'median': 0.05184310102815864, - 'min': 0.04630855513068542, - 'quartile_1': 0.049075828079422026, - 'quartile_3': 0.05461037397689525, - 'std': 0.007827029869782995, - }, - 'mutual_information_of_numeric_attributes': { - 'count': 2, - 'max': 0.05737764692563185, - 'mean': 0.05184310102815864, - 'median': 0.05184310102815864, - 'min': 0.04630855513068542, - 'quartile_1': 0.049075828079422026, - 'quartile_3': 0.05461037397689525, - 'std': 0.007827029869782995, - }, - 'noise_to_signal_ratio': 5.526950051885679, - 'number_distinct_values': 45, - 'number_of_missing_values': 0, - 'number_of_negative_numeric_values': 0, - 'number_of_numeric_values': 45, - 'number_of_numeric_values_equal_-1': 0, - 'number_of_numeric_values_equal_0': 0, - 'number_of_numeric_values_equal_1': 0, - 'number_of_positive_numeric_values': 45, - 'number_of_present_values': 45, - 'numeric_noise_to_signal_ratio': 16.280596971377722, - 'ratio_of_missing_values': 0.0, - 'ratio_of_negative_numeric_values': 0.0, - 'ratio_of_numeric_values': 1.0, - 'ratio_of_numeric_values_equal_-1': 0.0, - 'ratio_of_numeric_values_equal_0': 0.0, - 'ratio_of_numeric_values_equal_1': 0.0, - 'ratio_of_positive_numeric_values': 1.0, - 'ratio_of_present_values': 1.0, - 'value_counts_aggregate': { - 'count': 4, - 'kurtosis': 0.2795705816375573, - 'max': 19, - 'mean': 11.25, - 'median': 10.0, - 'min': 6, - 'quartile_1': 7.5, - 'quartile_3': 13.75, - 'skewness': 1.0126926768695854, - 'std': 5.737304826019502, - }, - 'value_probabilities_aggregate': { - 'count': 4, - 'kurtosis': 0.2795705816375609, - 'max': 0.4222222222222222, - 'mean': 0.25, - 'median': 0.2222222222222222, - 'min': 0.13333333333333333, - 'quartile_1': 0.16666666666666666, - 'quartile_3': 0.3055555555555556, - 'skewness': 1.0126926768695859, - 'std': 0.12749566280043337, - }, - 'values_aggregate': { - 'count': 45, - 'kurtosis': -1.376558337329924, - 'max': 70.8170731707317, - 'mean': 54.363425575007106, - 'median': 53.6699876392329, - 'min': 32.328512195122, - 'quartile_1': 45.648691933945, - 'quartile_3': 65.5693658536586, - 'skewness': -0.11742803570367141, - 'std': 11.607381033992365, - }, - } - - if parse_categorical_columns: - # Because the order of string values is different from the order of encoded values, - # the numbers are slightly different between parsed and not parsed cases. - expected_metafeatures['joint_entropy_of_categorical_attributes'] = { - 'count': 2, - 'max': 2.6276139378968226, - 'mean': 2.473903498180581, - 'median': 2.473903498180581, - 'min': 2.3201930584643393, - 'quartile_1': 2.39704827832246, - 'quartile_3': 2.550758718038702, - 'std': 0.217379388525041, - } - expected_metafeatures['joint_entropy_of_attributes'] = { - 'count': 4, - 'kurtosis': 3.8310594212937232, - 'max': 0.27405736318703244, - 'mean': 0.11209904602421886, - 'median': 0.06401513288957879, - 'min': 0.04630855513068542, - 'quartile_1': 0.05461037397689525, - 'quartile_3': 0.12150380493690241, - 'skewness': 1.949786087429789, - 'std': 0.10842988984399864, - } - - self.assertEqual(round_numbers(test_utils.convert_through_json(metadata.query_column(7)['data_metafeatures'])), round_numbers(expected_metafeatures)) - - -if __name__ == '__main__': - unittest.main() diff --git a/common-primitives/tests/test_construct_predictions.py b/common-primitives/tests/test_construct_predictions.py deleted file mode 100644 index 531d711..0000000 --- a/common-primitives/tests/test_construct_predictions.py +++ /dev/null @@ -1,233 +0,0 @@ -import copy -import os -import unittest - -import numpy - -from d3m import container -from d3m.metadata import base as metadata_base - -from common_primitives import dataset_to_dataframe, construct_predictions, extract_columns_semantic_types - -import utils as test_utils - - -class ConstructPredictionsPrimitiveTestCase(unittest.TestCase): - # TODO: Make this part of metadata API. - # Something like setting a semantic type for given columns. - def _mark_all_targets(self, dataset, targets): - for target in targets: - dataset.metadata = dataset.metadata.add_semantic_type((target['resource_id'], metadata_base.ALL_ELEMENTS, target['column_index']), 'https://metadata.datadrivendiscovery.org/types/Target') - dataset.metadata = dataset.metadata.add_semantic_type((target['resource_id'], metadata_base.ALL_ELEMENTS, target['column_index']), 'https://metadata.datadrivendiscovery.org/types/TrueTarget') - dataset.metadata = dataset.metadata.remove_semantic_type((target['resource_id'], metadata_base.ALL_ELEMENTS, target['column_index']), 'https://metadata.datadrivendiscovery.org/types/Attribute') - - def _get_iris_dataframe(self): - dataset_doc_path = os.path.abspath(os.path.join(os.path.dirname(__file__), 'data', 'datasets', 'iris_dataset_1', 'datasetDoc.json')) - - dataset = container.Dataset.load('file://{dataset_doc_path}'.format(dataset_doc_path=dataset_doc_path)) - - self._mark_all_targets(dataset, [{'resource_id': 'learningData', 'column_index': 5}]) - - hyperparams_class = dataset_to_dataframe.DatasetToDataFramePrimitive.metadata.get_hyperparams() - - primitive = dataset_to_dataframe.DatasetToDataFramePrimitive(hyperparams=hyperparams_class.defaults()) - - call_metadata = primitive.produce(inputs=dataset) - - dataframe = call_metadata.value - - return dataframe - - def test_correct_order(self): - dataframe = self._get_iris_dataframe() - - hyperparams_class = extract_columns_semantic_types.ExtractColumnsBySemanticTypesPrimitive.metadata.get_hyperparams() - - # We extract both the primary index and targets. So it is in the output format already. - primitive = extract_columns_semantic_types.ExtractColumnsBySemanticTypesPrimitive(hyperparams=hyperparams_class.defaults().replace({'semantic_types': ('https://metadata.datadrivendiscovery.org/types/PrimaryKey', 'https://metadata.datadrivendiscovery.org/types/Target',)})) - - call_metadata = primitive.produce(inputs=dataframe) - - targets = call_metadata.value - - # We pretend these are our predictions. - targets.metadata = targets.metadata.remove_semantic_type((metadata_base.ALL_ELEMENTS, 1), 'https://metadata.datadrivendiscovery.org/types/TrueTarget') - targets.metadata = targets.metadata.add_semantic_type((metadata_base.ALL_ELEMENTS, 1), 'https://metadata.datadrivendiscovery.org/types/PredictedTarget') - - # We switch columns around. - targets = targets.select_columns([1, 0]) - - hyperparams_class = construct_predictions.ConstructPredictionsPrimitive.metadata.get_hyperparams() - - construct_primitive = construct_predictions.ConstructPredictionsPrimitive(hyperparams=hyperparams_class.defaults()) - - call_metadata = construct_primitive.produce(inputs=targets, reference=dataframe) - - dataframe = call_metadata.value - - self.assertEqual(list(dataframe.columns), ['d3mIndex', 'species']) - - self._test_metadata(dataframe.metadata) - - def test_all_columns(self): - dataframe = self._get_iris_dataframe() - - # We use all columns. Output has to be just index and targets. - targets = copy.copy(dataframe) - - # We pretend these are our predictions. - targets.metadata = targets.metadata.remove_semantic_type((metadata_base.ALL_ELEMENTS, 5), 'https://metadata.datadrivendiscovery.org/types/TrueTarget') - targets.metadata = targets.metadata.add_semantic_type((metadata_base.ALL_ELEMENTS, 5), 'https://metadata.datadrivendiscovery.org/types/PredictedTarget') - - hyperparams_class = construct_predictions.ConstructPredictionsPrimitive.metadata.get_hyperparams() - - construct_primitive = construct_predictions.ConstructPredictionsPrimitive(hyperparams=hyperparams_class.defaults()) - - call_metadata = construct_primitive.produce(inputs=targets, reference=dataframe) - - dataframe = call_metadata.value - - self.assertEqual(list(dataframe.columns), ['d3mIndex', 'species']) - - self._test_metadata(dataframe.metadata) - - def test_missing_index(self): - dataframe = self._get_iris_dataframe() - - # We just use all columns. - targets = copy.copy(dataframe) - - # We pretend these are our predictions. - targets.metadata = targets.metadata.remove_semantic_type((metadata_base.ALL_ELEMENTS, 5), 'https://metadata.datadrivendiscovery.org/types/TrueTarget') - targets.metadata = targets.metadata.add_semantic_type((metadata_base.ALL_ELEMENTS, 5), 'https://metadata.datadrivendiscovery.org/types/PredictedTarget') - - # Remove primary index. This one has to be reconstructed. - targets = targets.remove_columns([0]) - - hyperparams_class = construct_predictions.ConstructPredictionsPrimitive.metadata.get_hyperparams() - - construct_primitive = construct_predictions.ConstructPredictionsPrimitive(hyperparams=hyperparams_class.defaults()) - - call_metadata = construct_primitive.produce(inputs=targets, reference=dataframe) - - dataframe = call_metadata.value - - self.assertEqual(list(dataframe.columns), ['d3mIndex', 'species']) - - self._test_metadata(dataframe.metadata) - - def test_just_targets_no_metadata(self): - dataframe = self._get_iris_dataframe() - - hyperparams_class = extract_columns_semantic_types.ExtractColumnsBySemanticTypesPrimitive.metadata.get_hyperparams() - - # We extract just targets. - primitive = extract_columns_semantic_types.ExtractColumnsBySemanticTypesPrimitive(hyperparams=hyperparams_class.defaults().replace({'semantic_types': ('https://metadata.datadrivendiscovery.org/types/Target',)})) - - call_metadata = primitive.produce(inputs=dataframe) - - targets = call_metadata.value - - # Remove all metadata. - targets.metadata = metadata_base.DataMetadata().generate(targets) - - hyperparams_class = construct_predictions.ConstructPredictionsPrimitive.metadata.get_hyperparams() - - construct_primitive = construct_predictions.ConstructPredictionsPrimitive(hyperparams=hyperparams_class.defaults()) - - call_metadata = construct_primitive.produce(inputs=targets, reference=dataframe) - - dataframe = call_metadata.value - - self.assertEqual(list(dataframe.columns), ['d3mIndex', 'species']) - - self._test_metadata(dataframe.metadata, True) - - def _test_metadata(self, metadata, no_metadata=False): - self.maxDiff = None - - self.assertEqual(test_utils.convert_through_json(metadata.query(())), { - 'schema': metadata_base.CONTAINER_SCHEMA_VERSION, - 'structural_type': 'd3m.container.pandas.DataFrame', - 'semantic_types': [ - 'https://metadata.datadrivendiscovery.org/types/Table', - ], - 'dimension': { - 'name': 'rows', - 'semantic_types': ['https://metadata.datadrivendiscovery.org/types/TabularRow'], - 'length': 150, - } - }) - - self.assertEqual(test_utils.convert_through_json(metadata.query((metadata_base.ALL_ELEMENTS,))), { - 'dimension': { - 'name': 'columns', - 'semantic_types': ['https://metadata.datadrivendiscovery.org/types/TabularColumn'], - 'length': 2, - } - }) - - self.assertEqual(test_utils.convert_through_json(metadata.query((metadata_base.ALL_ELEMENTS, 0))), { - 'name': 'd3mIndex', - 'structural_type': 'str', - 'semantic_types': [ - 'http://schema.org/Integer', - 'https://metadata.datadrivendiscovery.org/types/PrimaryKey', - ], - }) - - if no_metadata: - self.assertEqual(test_utils.convert_through_json(metadata.query((metadata_base.ALL_ELEMENTS, 1))), { - 'name': 'species', - 'structural_type': 'str', - 'semantic_types': [ - 'https://metadata.datadrivendiscovery.org/types/Target', - 'https://metadata.datadrivendiscovery.org/types/PredictedTarget', - ], - }) - - else: - self.assertEqual(test_utils.convert_through_json(metadata.query((metadata_base.ALL_ELEMENTS, 1))), { - 'name': 'species', - 'structural_type': 'str', - 'semantic_types': [ - 'https://metadata.datadrivendiscovery.org/types/CategoricalData', - 'https://metadata.datadrivendiscovery.org/types/SuggestedTarget', - 'https://metadata.datadrivendiscovery.org/types/Target', - 'https://metadata.datadrivendiscovery.org/types/PredictedTarget', - ], - }) - - def test_float_vector(self): - dataframe = container.DataFrame({ - 'd3mIndex': [0], - 'target': [container.ndarray(numpy.array([3,5,9,10]))], - }, generate_metadata=True) - - # Update metadata. - dataframe.metadata = dataframe.metadata.add_semantic_type((metadata_base.ALL_ELEMENTS, 0), 'https://metadata.datadrivendiscovery.org/types/PrimaryKey') - dataframe.metadata = dataframe.metadata.add_semantic_type((metadata_base.ALL_ELEMENTS, 1), 'https://metadata.datadrivendiscovery.org/types/PredictedTarget') - - hyperparams_class = construct_predictions.ConstructPredictionsPrimitive.metadata.get_hyperparams() - - construct_primitive = construct_predictions.ConstructPredictionsPrimitive(hyperparams=hyperparams_class.defaults()) - - dataframe = construct_primitive.produce(inputs=dataframe, reference=dataframe).value - - self.assertEqual(list(dataframe.columns), ['d3mIndex', 'target']) - - self.assertEqual(dataframe.values.tolist(), [ - [0, '3,5,9,10'], - ]) - - self.assertEqual(dataframe.metadata.query_column(1), { - 'structural_type': str, - 'name': 'target', - 'semantic_types': ( - 'https://metadata.datadrivendiscovery.org/types/PredictedTarget', - ), - }) - - -if __name__ == '__main__': - unittest.main() diff --git a/common-primitives/tests/test_csv_reader.py b/common-primitives/tests/test_csv_reader.py deleted file mode 100644 index 3430c33..0000000 --- a/common-primitives/tests/test_csv_reader.py +++ /dev/null @@ -1,50 +0,0 @@ -import unittest -import os - -from d3m import container - -from common_primitives import dataset_to_dataframe, csv_reader - - -class CSVReaderPrimitiveTestCase(unittest.TestCase): - def test_basic(self): - dataset_doc_path = os.path.abspath(os.path.join(os.path.dirname(__file__), 'data', 'datasets', 'timeseries_dataset_2', 'datasetDoc.json')) - - dataset = container.Dataset.load('file://{dataset_doc_path}'.format(dataset_doc_path=dataset_doc_path)) - - dataframe_hyperparams_class = dataset_to_dataframe.DatasetToDataFramePrimitive.metadata.get_hyperparams() - dataframe_primitive = dataset_to_dataframe.DatasetToDataFramePrimitive(hyperparams=dataframe_hyperparams_class.defaults().replace({'dataframe_resource': '0'})) - dataframe = dataframe_primitive.produce(inputs=dataset).value - - csv_hyperparams_class = csv_reader.CSVReaderPrimitive.metadata.get_hyperparams() - csv_primitive = csv_reader.CSVReaderPrimitive(hyperparams=csv_hyperparams_class.defaults().replace({'return_result': 'replace'})) - tables = csv_primitive.produce(inputs=dataframe).value - - self.assertEqual(tables.shape, (5, 1)) - - self._test_metadata(tables.metadata) - - def _test_metadata(self, metadata): - self.assertEqual(metadata.query_column(0)['structural_type'], container.DataFrame) - self.assertEqual(metadata.query_column(0)['semantic_types'], ('https://metadata.datadrivendiscovery.org/types/PrimaryKey', 'https://metadata.datadrivendiscovery.org/types/Timeseries', 'https://metadata.datadrivendiscovery.org/types/Table')) - - self.assertEqual(metadata.query_column(0, at=(0, 0)), { - 'structural_type': str, - 'name': 'time', - 'semantic_types': ( - 'http://schema.org/Integer', - 'https://metadata.datadrivendiscovery.org/types/Time', - ) - }) - self.assertEqual(metadata.query_column(1, at=(0, 0)), { - 'structural_type': str, - 'name': 'value', - 'semantic_types': ( - 'http://schema.org/Float', - 'https://metadata.datadrivendiscovery.org/types/Attribute', - ) - }) - - -if __name__ == '__main__': - unittest.main() diff --git a/common-primitives/tests/test_cut_audio.py b/common-primitives/tests/test_cut_audio.py deleted file mode 100644 index da8282a..0000000 --- a/common-primitives/tests/test_cut_audio.py +++ /dev/null @@ -1,122 +0,0 @@ -import unittest -import os - -from d3m import container - -from common_primitives import audio_reader, cut_audio, dataset_to_dataframe, denormalize, column_parser - - -class AudioReaderPrimitiveTestCase(unittest.TestCase): - def test_basic(self): - dataset_doc_path = os.path.abspath(os.path.join(os.path.dirname(__file__), 'data', 'datasets', 'audio_dataset_1', 'datasetDoc.json')) - - dataset = container.Dataset.load('file://{dataset_doc_path}'.format(dataset_doc_path=dataset_doc_path)) - - denormalize_hyperparams_class = denormalize.DenormalizePrimitive.metadata.get_hyperparams() - denormalize_primitive = denormalize.DenormalizePrimitive(hyperparams=denormalize_hyperparams_class.defaults()) - dataset = denormalize_primitive.produce(inputs=dataset).value - - dataframe_hyperparams_class = dataset_to_dataframe.DatasetToDataFramePrimitive.metadata.get_hyperparams() - dataframe_primitive = dataset_to_dataframe.DatasetToDataFramePrimitive(hyperparams=dataframe_hyperparams_class.defaults()) - dataframe = dataframe_primitive.produce(inputs=dataset).value - - column_parser_hyperparams_class = column_parser.ColumnParserPrimitive.metadata.get_hyperparams() - column_parser_primitive = column_parser.ColumnParserPrimitive(hyperparams=column_parser_hyperparams_class.defaults()) - dataframe = column_parser_primitive.produce(inputs=dataframe).value - - audio_hyperparams_class = audio_reader.AudioReaderPrimitive.metadata.get_hyperparams() - audio_primitive = audio_reader.AudioReaderPrimitive(hyperparams=audio_hyperparams_class.defaults()) - dataframe = audio_primitive.produce(inputs=dataframe).value - - self.assertEqual(dataframe.iloc[0, 1], 'test_audio.mp3') - self.assertEqual(dataframe.iloc[0, 5].shape, (4410, 1)) - - cut_audio_hyperparams_class = cut_audio.CutAudioPrimitive.metadata.get_hyperparams() - cut_audio_primitive = cut_audio.CutAudioPrimitive(hyperparams=cut_audio_hyperparams_class.defaults()) - dataframe = cut_audio_primitive.produce(inputs=dataframe).value - - self.assertEqual(dataframe.iloc[0, 1], 'test_audio.mp3') - self.assertEqual(dataframe.iloc[0, 5].shape, (44, 1)) - - self._test_metadata(dataframe.metadata, False) - - def _test_metadata(self, dataframe_metadata, is_can_accept): - self.assertEqual(dataframe_metadata.query_column(2), { - 'structural_type': float, - 'name': 'start', - 'semantic_types': ( - 'http://schema.org/Float', - 'https://metadata.datadrivendiscovery.org/types/Boundary', - 'https://metadata.datadrivendiscovery.org/types/IntervalStart', - ), - }) - self.assertEqual(dataframe_metadata.query_column(3), { - 'structural_type': float, - 'name': 'end', - 'semantic_types': ( - 'http://schema.org/Float', - 'https://metadata.datadrivendiscovery.org/types/Boundary', - 'https://metadata.datadrivendiscovery.org/types/IntervalEnd', - ), - }) - - if is_can_accept: - self.assertEqual(dataframe_metadata.query_column(5), { - 'structural_type': container.ndarray, - 'semantic_types': ( - 'http://schema.org/AudioObject', - 'https://metadata.datadrivendiscovery.org/types/Attribute', - 'https://metadata.datadrivendiscovery.org/types/UniqueKey', - ), - 'name': 'filename', - }) - self.assertEqual(dataframe_metadata.query((0, 5)), { - 'structural_type': container.ndarray, - 'semantic_types': ( - 'http://schema.org/AudioObject', - 'https://metadata.datadrivendiscovery.org/types/Attribute', - 'https://metadata.datadrivendiscovery.org/types/UniqueKey', - ), - 'name': 'filename', - }) - else: - self.assertEqual(dataframe_metadata.query_column(5), { - 'structural_type': container.ndarray, - 'semantic_types': ( - 'http://schema.org/AudioObject', - 'https://metadata.datadrivendiscovery.org/types/Attribute', - 'https://metadata.datadrivendiscovery.org/types/UniqueKey', - 'https://metadata.datadrivendiscovery.org/types/Table', - ), - 'dimension': { - # The length is set here only because there is only one row. - 'length': 44, - 'name': 'rows', - 'semantic_types': ( - 'https://metadata.datadrivendiscovery.org/types/TabularRow', - ), - }, - 'name': 'filename', - }) - self.assertEqual(dataframe_metadata.query((0, 5)), { - 'structural_type': container.ndarray, - 'semantic_types': ( - 'http://schema.org/AudioObject', - 'https://metadata.datadrivendiscovery.org/types/Attribute', - 'https://metadata.datadrivendiscovery.org/types/UniqueKey', - 'https://metadata.datadrivendiscovery.org/types/Table', - ), - 'dimension': { - 'length': 44, - 'name': 'rows', - 'semantic_types': ( - 'https://metadata.datadrivendiscovery.org/types/TabularRow', - ), - 'sampling_rate': 44100, - }, - 'name': 'filename', - }) - - -if __name__ == '__main__': - unittest.main() diff --git a/common-primitives/tests/test_dataframe_flatten.py b/common-primitives/tests/test_dataframe_flatten.py deleted file mode 100644 index 7554132..0000000 --- a/common-primitives/tests/test_dataframe_flatten.py +++ /dev/null @@ -1,132 +0,0 @@ -import unittest -import os - -from d3m import container -from d3m.metadata import base as metadata_base - -from common_primitives import dataset_to_dataframe, csv_reader, dataframe_flatten - - -class DataFrameFlattenPrimitiveTestCase(unittest.TestCase): - - COLUMN_METADATA = { - 'time': { - 'structural_type': str, - 'name': 'time', - 'semantic_types': ( - 'http://schema.org/Integer', - 'https://metadata.datadrivendiscovery.org/types/Time' - ), - }, - 'value': { - 'structural_type': str, - 'name': 'value', - 'semantic_types': ( - 'http://schema.org/Float', - 'https://metadata.datadrivendiscovery.org/types/Attribute' - ), - } - } - - def test_replace(self) -> None: - tables = self._load_data() - flat_hyperparams_class = dataframe_flatten.DataFrameFlattenPrimitive.metadata.get_hyperparams() - flat_primitive = dataframe_flatten.DataFrameFlattenPrimitive(hyperparams=flat_hyperparams_class.defaults()) - flat_result = flat_primitive.produce(inputs=tables).value - - self.assertEqual(flat_result.shape, (830, 3)) - - metadata = flat_result.metadata - self._check_filename_metadata(metadata, 0) - self.assertEqual(metadata.query_column(1), self.COLUMN_METADATA['time']) - self.assertEqual(metadata.query_column(2), self.COLUMN_METADATA['value']) - - def test_new(self) -> None: - tables = self._load_data() - - flat_hyperparams_class = dataframe_flatten.DataFrameFlattenPrimitive.metadata.get_hyperparams() - hp = flat_hyperparams_class.defaults().replace({ - 'return_result': 'new', - 'add_index_columns': False - }) - flat_primitive = dataframe_flatten.DataFrameFlattenPrimitive(hyperparams=hp) - flat_result = flat_primitive.produce(inputs=tables).value - - self.assertEqual(flat_result.shape, (830, 2)) - metadata = flat_result.metadata - self.assertEqual(metadata.query_column(0), self.COLUMN_METADATA['time']) - self.assertEqual(metadata.query_column(1), self.COLUMN_METADATA['value']) - - def test_add_index_columns(self) -> None: - tables = self._load_data() - - flat_hyperparams_class = dataframe_flatten.DataFrameFlattenPrimitive.metadata.get_hyperparams() - hp = flat_hyperparams_class.defaults().replace({ - 'return_result': 'new', - 'add_index_columns': True - }) - flat_primitive = dataframe_flatten.DataFrameFlattenPrimitive(hyperparams=hp) - flat_result = flat_primitive.produce(inputs=tables).value - - self.assertEqual(flat_result.shape, (830, 3)) - metadata = flat_result.metadata - self._check_filename_metadata(metadata, 0) - self.assertEqual(metadata.query_column(1), self.COLUMN_METADATA['time']) - self.assertEqual(metadata.query_column(2), self.COLUMN_METADATA['value']) - - def test_use_columns(self) -> None: - tables = self._load_data() - - flat_hyperparams_class = dataframe_flatten.DataFrameFlattenPrimitive.metadata.get_hyperparams() - hp = flat_hyperparams_class.defaults().replace({'use_columns': [1]}) - - flat_primitive = dataframe_flatten.DataFrameFlattenPrimitive(hyperparams=hp) - flat_result = flat_primitive.produce(inputs=tables).value - - self.assertEqual(flat_result.shape, (830, 3), [0]) - - metadata = flat_result.metadata - self._check_filename_metadata(metadata, 0) - self.assertEqual(metadata.query_column(1), self.COLUMN_METADATA['time']) - self.assertEqual(metadata.query_column(2), self.COLUMN_METADATA['value']) - - def test_exclude_columns(self) -> None: - tables = self._load_data() - - flat_hyperparams_class = dataframe_flatten.DataFrameFlattenPrimitive.metadata.get_hyperparams() - hp = flat_hyperparams_class.defaults().replace({'exclude_columns': [0]}) - - flat_primitive = dataframe_flatten.DataFrameFlattenPrimitive(hyperparams=hp) - flat_result = flat_primitive.produce(inputs=tables).value - - self.assertEqual(flat_result.shape, (830, 3), [0]) - - metadata = flat_result.metadata - self._check_filename_metadata(metadata, 0) - self.assertEqual(metadata.query_column(1), self.COLUMN_METADATA['time']) - self.assertEqual(metadata.query_column(2), self.COLUMN_METADATA['value']) - - def _load_data(self) -> container.DataFrame: - dataset_doc_path = os.path.abspath(os.path.join(os.path.dirname(__file__), 'data', 'datasets', 'timeseries_dataset_2', 'datasetDoc.json')) - - dataset = container.Dataset.load('file://{dataset_doc_path}'.format(dataset_doc_path=dataset_doc_path)) - - dataframe_hyperparams_class = dataset_to_dataframe.DatasetToDataFramePrimitive.metadata.get_hyperparams() - dataframe_primitive = dataset_to_dataframe.DatasetToDataFramePrimitive(hyperparams=dataframe_hyperparams_class.defaults().replace({'dataframe_resource': '0'})) - dataframe = dataframe_primitive.produce(inputs=dataset).value - - csv_hyperparams_class = csv_reader.CSVReaderPrimitive.metadata.get_hyperparams() - csv_primitive = csv_reader.CSVReaderPrimitive(hyperparams=csv_hyperparams_class.defaults().replace({'return_result': 'append'})) - return csv_primitive.produce(inputs=dataframe).value - - def _check_filename_metadata(self, metadata: metadata_base.Metadata, col_num: int) -> None: - self.assertEqual(metadata.query_column(col_num)['name'], 'filename') - self.assertEqual(metadata.query_column(col_num)['structural_type'], str) - self.assertEqual(metadata.query_column(col_num)['semantic_types'], ( - 'https://metadata.datadrivendiscovery.org/types/PrimaryKey', - 'https://metadata.datadrivendiscovery.org/types/FileName', - 'https://metadata.datadrivendiscovery.org/types/Timeseries')) - - -if __name__ == '__main__': - unittest.main() diff --git a/common-primitives/tests/test_dataframe_image_reader.py b/common-primitives/tests/test_dataframe_image_reader.py deleted file mode 100644 index 7368997..0000000 --- a/common-primitives/tests/test_dataframe_image_reader.py +++ /dev/null @@ -1,46 +0,0 @@ -import unittest -import os - -from d3m import container - -from common_primitives import dataset_to_dataframe, dataframe_image_reader - - -class DataFrameImageReaderPrimitiveTestCase(unittest.TestCase): - def test_basic(self): - dataset_doc_path = os.path.abspath(os.path.join(os.path.dirname(__file__), 'data', 'datasets', 'image_dataset_1', 'datasetDoc.json')) - - dataset = container.Dataset.load('file://{dataset_doc_path}'.format(dataset_doc_path=dataset_doc_path)) - - dataframe_hyperparams_class = dataset_to_dataframe.DatasetToDataFramePrimitive.metadata.get_hyperparams() - dataframe_primitive = dataset_to_dataframe.DatasetToDataFramePrimitive(hyperparams=dataframe_hyperparams_class.defaults().replace({'dataframe_resource': '0'})) - dataframe = dataframe_primitive.produce(inputs=dataset).value - - image_hyperparams_class = dataframe_image_reader.DataFrameImageReaderPrimitive.metadata.get_hyperparams() - image_primitive = dataframe_image_reader.DataFrameImageReaderPrimitive(hyperparams=image_hyperparams_class.defaults().replace({'return_result': 'replace'})) - images = image_primitive.produce(inputs=dataframe).value - - self.assertEqual(images.shape, (5, 1)) - self.assertEqual(images.iloc[0, 0].shape, (225, 150, 3)) - self.assertEqual(images.iloc[1, 0].shape, (32, 32, 3)) - self.assertEqual(images.iloc[2, 0].shape, (32, 32, 3)) - self.assertEqual(images.iloc[3, 0].shape, (28, 28, 1)) - self.assertEqual(images.iloc[4, 0].shape, (28, 28, 1)) - - self._test_metadata(images.metadata) - - self.assertEqual(images.metadata.query((0, 0))['image_reader_metadata'], { - 'jfif': 257, - 'jfif_version': (1, 1), - 'dpi': (96, 96), - 'jfif_unit': 1, - 'jfif_density': (96, 96), - }) - - def _test_metadata(self, metadata): - self.assertEqual(metadata.query_column(0)['structural_type'], container.ndarray) - self.assertEqual(metadata.query_column(0)['semantic_types'], ('https://metadata.datadrivendiscovery.org/types/PrimaryKey', 'http://schema.org/ImageObject')) - - -if __name__ == '__main__': - unittest.main() diff --git a/common-primitives/tests/test_dataframe_to_list.py b/common-primitives/tests/test_dataframe_to_list.py deleted file mode 100644 index 512396c..0000000 --- a/common-primitives/tests/test_dataframe_to_list.py +++ /dev/null @@ -1,41 +0,0 @@ -import unittest - -from d3m import container - -from common_primitives import dataframe_to_list, dataset_to_dataframe - -import utils as test_utils - - -class DataFrameToListPrimitiveTestCase(unittest.TestCase): - def test_basic(self): - # load the iris dataset - dataset = test_utils.load_iris_metadata() - - # convert the dataset into a dataframe - dataset_hyperparams_class = dataset_to_dataframe.DatasetToDataFramePrimitive.metadata.get_hyperparams() - dataframe_primitive = dataset_to_dataframe.DatasetToDataFramePrimitive(hyperparams=dataset_hyperparams_class.defaults()) - dataframe = dataframe_primitive.produce(inputs=dataset).value - - # convert the dataframe into a list - list_hyperparams_class = dataframe_to_list.DataFrameToListPrimitive.metadata.get_hyperparams() - list_primitive = dataframe_to_list.DataFrameToListPrimitive(hyperparams=list_hyperparams_class.defaults()) - list_value = list_primitive.produce(inputs=dataframe).value - - self.assertIsInstance(list_value, container.List) - - # verify dimensions - self.assertEqual(len(list_value), 150) - self.assertEqual(len(list_value[0]), 6) - - # verify data type is unchanged - for row in list_value: - for val in row: - self.assertIsInstance(val, str) - - # validate metadata - test_utils.test_iris_metadata(self, list_value.metadata, 'd3m.container.list.List', 'd3m.container.list.List') - - -if __name__ == '__main__': - unittest.main() diff --git a/common-primitives/tests/test_dataframe_to_ndarray.py b/common-primitives/tests/test_dataframe_to_ndarray.py deleted file mode 100644 index 6e79645..0000000 --- a/common-primitives/tests/test_dataframe_to_ndarray.py +++ /dev/null @@ -1,40 +0,0 @@ -import unittest - -from common_primitives import dataframe_to_ndarray, dataset_to_dataframe -from d3m import container - -import utils as test_utils - - -class DataFrameToNDArrayPrimitiveTestCase(unittest.TestCase): - def test_basic(self): - # load the iris dataset - dataset = test_utils.load_iris_metadata() - - # convert the dataset into a dataframe - dataset_hyperparams_class = dataset_to_dataframe.DatasetToDataFramePrimitive.metadata.get_hyperparams() - dataframe_primitive = dataset_to_dataframe.DatasetToDataFramePrimitive(hyperparams=dataset_hyperparams_class.defaults()) - dataframe = dataframe_primitive.produce(inputs=dataset).value - - # convert the dataframe into a numpy array - numpy_hyperparams_class = dataframe_to_ndarray.DataFrameToNDArrayPrimitive.metadata.get_hyperparams() - numpy_primitive = dataframe_to_ndarray.DataFrameToNDArrayPrimitive(hyperparams=numpy_hyperparams_class.defaults()) - numpy_array = numpy_primitive.produce(inputs=dataframe).value - - self.assertIsInstance(numpy_array, container.ndarray) - - # verify dimensions - self.assertEqual(len(numpy_array), 150) - self.assertEqual(len(numpy_array[0]), 6) - - # verify data type is unchanged - for row in numpy_array: - for val in row: - self.assertIsInstance(val, str) - - # validate metadata - test_utils.test_iris_metadata(self, numpy_array.metadata, 'd3m.container.numpy.ndarray') - - -if __name__ == '__main__': - unittest.main() diff --git a/common-primitives/tests/test_dataframe_utils.py b/common-primitives/tests/test_dataframe_utils.py deleted file mode 100644 index 9b2b7d7..0000000 --- a/common-primitives/tests/test_dataframe_utils.py +++ /dev/null @@ -1,27 +0,0 @@ -import unittest -import os - -from common_primitives import dataframe_utils -from d3m import container -from d3m.base import utils as base_utils - -import utils as test_utils - - -class DataFrameUtilsTestCase(unittest.TestCase): - def test_inclusive(self): - dataset_doc_path = os.path.abspath(os.path.join(os.path.dirname(__file__), 'data', 'datasets', 'database_dataset_1', 'datasetDoc.json')) - dataset = container.Dataset.load('file://{dataset_doc_path}'.format(dataset_doc_path=dataset_doc_path)) - resource = test_utils.get_dataframe(dataset) - - to_keep_indices = [1, 2, 5] - - output = dataframe_utils.select_rows(resource, to_keep_indices) - self.assertEqual(len(output), 3) - self.assertEqual(len(output.iloc[0]), 5) - self.assertEqual(output.iloc[1, 0], '3') - self.assertEqual(output.iloc[2, 0], '6') - - -if __name__ == '__main__': - unittest.main() diff --git a/common-primitives/tests/test_dataset_map.py b/common-primitives/tests/test_dataset_map.py deleted file mode 100644 index a789d4d..0000000 --- a/common-primitives/tests/test_dataset_map.py +++ /dev/null @@ -1,73 +0,0 @@ -import unittest -import os -import pickle -import sys - -from d3m import container, index, utils as d3m_utils - -TEST_PRIMITIVES_DIR = os.path.join(os.path.dirname(__file__), 'data', 'primitives') -sys.path.insert(0, TEST_PRIMITIVES_DIR) - -from test_primitives.null import NullTransformerPrimitive, NullUnsupervisedLearnerPrimitive - -# To hide any logging or stdout output. -with d3m_utils.silence(): - index.register_primitive('d3m.primitives.operator.null.TransformerTest', NullTransformerPrimitive) - index.register_primitive('d3m.primitives.operator.null.UnsupervisedLearnerTest', NullUnsupervisedLearnerPrimitive) - -from common_primitives import dataset_to_dataframe, denormalize, dataset_map, column_parser - -import utils as test_utils - - -class DatasetMapTestCase(unittest.TestCase): - def test_basic(self): - self.maxDiff = None - - dataset_doc_path = os.path.abspath(os.path.join(os.path.dirname(__file__), 'data', 'datasets', 'database_dataset_1', 'datasetDoc.json')) - - dataset = container.Dataset.load('file://{dataset_doc_path}'.format(dataset_doc_path=dataset_doc_path)) - - # First we try denormalizing and column parsing. - hyperparams = denormalize.DenormalizePrimitive.metadata.get_hyperparams() - primitive = denormalize.DenormalizePrimitive(hyperparams=hyperparams.defaults()) - dataset_1 = primitive.produce(inputs=dataset).value - - hyperparams = dataset_to_dataframe.DatasetToDataFramePrimitive.metadata.get_hyperparams() - primitive = dataset_to_dataframe.DatasetToDataFramePrimitive(hyperparams=hyperparams.defaults()) - dataframe_1 = primitive.produce(inputs=dataset_1).value - - hyperparams = column_parser.ColumnParserPrimitive.metadata.get_hyperparams() - primitive = column_parser.ColumnParserPrimitive(hyperparams=hyperparams.defaults().replace({'return_result': 'replace'})) - dataframe_1 = primitive.produce(inputs=dataframe_1).value - - # Second we try first column parsing and then denormalizing. - hyperparams = dataset_map.DataFrameDatasetMapPrimitive.metadata.get_hyperparams() - primitive = dataset_map.DataFrameDatasetMapPrimitive( - # We have to make an instance of the primitive ourselves. - hyperparams=hyperparams.defaults().replace({ - 'primitive': column_parser.ColumnParserPrimitive( - hyperparams=column_parser.ColumnParserPrimitive.metadata.get_hyperparams().defaults(), - ), - 'resources': 'all', - }), - - ) - dataset_2 = primitive.produce(inputs=dataset).value - - hyperparams = denormalize.DenormalizePrimitive.metadata.get_hyperparams() - primitive = denormalize.DenormalizePrimitive(hyperparams=hyperparams.defaults()) - dataset_2 = primitive.produce(inputs=dataset_2).value - - hyperparams = dataset_to_dataframe.DatasetToDataFramePrimitive.metadata.get_hyperparams() - primitive = dataset_to_dataframe.DatasetToDataFramePrimitive(hyperparams=hyperparams.defaults()) - dataframe_2 = primitive.produce(inputs=dataset_2).value - - self.assertEqual(test_utils.convert_through_json(dataframe_1), test_utils.convert_through_json(dataframe_2)) - self.assertEqual(dataframe_1.metadata.to_internal_json_structure(), dataframe_2.metadata.to_internal_json_structure()) - - pickle.dumps(primitive) - - -if __name__ == '__main__': - unittest.main() diff --git a/common-primitives/tests/test_dataset_sample.py b/common-primitives/tests/test_dataset_sample.py deleted file mode 100644 index 57da93a..0000000 --- a/common-primitives/tests/test_dataset_sample.py +++ /dev/null @@ -1,58 +0,0 @@ -import os -import pickle -import unittest -import pandas as pd - -from d3m import container -from d3m.metadata import base as metadata_base - -from common_primitives import dataset_sample - - -class DatasetSamplePrimitiveTestCase(unittest.TestCase): - def test_produce(self): - dataset_doc_path = os.path.abspath(os.path.join(os.path.dirname(__file__), '..', 'tests', 'data', 'datasets', 'database_dataset_1', 'datasetDoc.json')) - - dataset = container.Dataset.load('file://{dataset_doc_path}'.format(dataset_doc_path=dataset_doc_path)) - - hyperparams_class = dataset_sample.DatasetSamplePrimitive.metadata.get_hyperparams() - - # We set semantic types like runtime would. - dataset.metadata = dataset.metadata.add_semantic_type(('learningData', metadata_base.ALL_ELEMENTS, 4), 'https://metadata.datadrivendiscovery.org/types/Target') - dataset.metadata = dataset.metadata.add_semantic_type(('learningData', metadata_base.ALL_ELEMENTS, 4), 'https://metadata.datadrivendiscovery.org/types/TrueTarget') - - sample_sizes = [0.1, 0.5, 0.9, 4, 22, 40] - dataset_sizes = [4, 22, 40, 4, 22, 40] - for s, d in zip(sample_sizes, dataset_sizes): - primitive = dataset_sample.DatasetSamplePrimitive(hyperparams=hyperparams_class.defaults().replace({ - 'sample_size': s, - })) - result = primitive.produce(inputs=dataset).value - self.assertEqual(len(result['learningData'].iloc[:, 0]), d, s) - - def test_empty_test_set(self): - dataset_doc_path = os.path.abspath(os.path.join(os.path.dirname(__file__), '..', 'tests', 'data', 'datasets', 'iris_dataset_1', 'datasetDoc.json')) - - dataset = container.Dataset.load('file://{dataset_doc_path}'.format(dataset_doc_path=dataset_doc_path)) - - # set target columns to '' to imitate test dataset - dataset['learningData']['species'] = '' - - # We set semantic types like runtime would. - dataset.metadata = dataset.metadata.add_semantic_type(('learningData', metadata_base.ALL_ELEMENTS, 5), 'https://metadata.datadrivendiscovery.org/types/Target') - dataset.metadata = dataset.metadata.add_semantic_type(('learningData', metadata_base.ALL_ELEMENTS, 5), 'https://metadata.datadrivendiscovery.org/types/TrueTarget') - - hyperparams_class = dataset_sample.DatasetSamplePrimitive.metadata.get_hyperparams() - - # check that no rows are sampled - sample_sizes = [0.1, 0.5, 0.9] - for s in sample_sizes: - primitive = dataset_sample.DatasetSamplePrimitive(hyperparams=hyperparams_class.defaults().replace({ - 'sample_size': s, - })) - result = primitive.produce(inputs=dataset).value - self.assertEqual(len(result['learningData'].iloc[:, 0]), 150, s) - - -if __name__ == '__main__': - unittest.main() diff --git a/common-primitives/tests/test_dataset_to_dataframe.py b/common-primitives/tests/test_dataset_to_dataframe.py deleted file mode 100644 index a7718be..0000000 --- a/common-primitives/tests/test_dataset_to_dataframe.py +++ /dev/null @@ -1,93 +0,0 @@ -import os -import unittest - -from d3m import container, utils -from d3m.metadata import base as metadata_base - -from common_primitives import dataset_to_dataframe - -import utils as test_utils - - -class DatasetToDataFramePrimitiveTestCase(unittest.TestCase): - def test_basic(self): - dataset_doc_path = os.path.abspath(os.path.join(os.path.dirname(__file__), 'data', 'datasets', 'iris_dataset_1', 'datasetDoc.json')) - - dataset = container.Dataset.load('file://{dataset_doc_path}'.format(dataset_doc_path=dataset_doc_path)) - - hyperparams_class = dataset_to_dataframe.DatasetToDataFramePrimitive.metadata.get_hyperparams() - - primitive = dataset_to_dataframe.DatasetToDataFramePrimitive(hyperparams=hyperparams_class.defaults()) - - call_metadata = primitive.produce(inputs=dataset) - - dataframe = call_metadata.value - - self.assertIsInstance(dataframe, container.DataFrame) - - for row in dataframe: - for cell in row: - # Nothing should be parsed from a string. - self.assertIsInstance(cell, str) - - self.assertEqual(len(dataframe), 150) - self.assertEqual(len(dataframe.iloc[0]), 6) - - self._test_metadata(dataframe.metadata) - - def _test_metadata(self, metadata): - self.maxDiff = None - - self.assertEqual(test_utils.convert_through_json(metadata.query(())), { - 'schema': metadata_base.CONTAINER_SCHEMA_VERSION, - 'structural_type': 'd3m.container.pandas.DataFrame', - 'semantic_types': [ - 'https://metadata.datadrivendiscovery.org/types/Table', - ], - 'dimension': { - 'name': 'rows', - 'semantic_types': ['https://metadata.datadrivendiscovery.org/types/TabularRow'], - 'length': 150, - } - }) - - self.assertEqual(test_utils.convert_through_json(metadata.query((metadata_base.ALL_ELEMENTS,))), { - 'dimension': { - 'name': 'columns', - 'semantic_types': ['https://metadata.datadrivendiscovery.org/types/TabularColumn'], - 'length': 6, - } - }) - - self.assertEqual(test_utils.convert_through_json(metadata.query((metadata_base.ALL_ELEMENTS, 0))), { - 'name': 'd3mIndex', - 'structural_type': 'str', - 'semantic_types': [ - 'http://schema.org/Integer', - 'https://metadata.datadrivendiscovery.org/types/PrimaryKey', - ], - }) - - for i in range(1, 5): - self.assertEqual(test_utils.convert_through_json(metadata.query((metadata_base.ALL_ELEMENTS, i))), { - 'name': ['sepalLength', 'sepalWidth', 'petalLength', 'petalWidth'][i - 1], - 'structural_type': 'str', - 'semantic_types': [ - 'http://schema.org/Float', - 'https://metadata.datadrivendiscovery.org/types/Attribute', - ], - }, i) - - self.assertEqual(test_utils.convert_through_json(metadata.query((metadata_base.ALL_ELEMENTS, 5))), { - 'name': 'species', - 'structural_type': 'str', - 'semantic_types': [ - 'https://metadata.datadrivendiscovery.org/types/CategoricalData', - 'https://metadata.datadrivendiscovery.org/types/SuggestedTarget', - 'https://metadata.datadrivendiscovery.org/types/Attribute', - ], - }) - - -if __name__ == '__main__': - unittest.main() diff --git a/common-primitives/tests/test_datetime_field_compose.py b/common-primitives/tests/test_datetime_field_compose.py deleted file mode 100644 index ac93823..0000000 --- a/common-primitives/tests/test_datetime_field_compose.py +++ /dev/null @@ -1,67 +0,0 @@ -import math -import os.path -import unittest - -from datetime import datetime -from d3m import container, utils -from d3m.metadata import base as metadata_base - -from common_primitives import dataset_to_dataframe, datetime_field_compose - -import utils as test_utils - - -class DatetimeFieldComposePrimitiveTestCase(unittest.TestCase): - def test_compose_two_fields(self): - dataset_doc_path = os.path.abspath(os.path.join(os.path.dirname(__file__), 'data', 'datasets', 'timeseries_dataset_3', 'datasetDoc.json')) - dataset = container.Dataset.load('file://{dataset_doc_path}'.format(dataset_doc_path=dataset_doc_path)) - resource = test_utils.get_dataframe(dataset) - - compose_hyperparams_class = datetime_field_compose.DatetimeFieldComposePrimitive.metadata.get_hyperparams() - hp = compose_hyperparams_class({ - 'columns': [2,3], - 'join_char': '-', - 'output_name': 'timestamp' - }) - compose_primitive = datetime_field_compose.DatetimeFieldComposePrimitive(hyperparams=hp) - new_dataframe = compose_primitive.produce(inputs=resource).value - - self.assertEqual(new_dataframe.shape, (40, 6)) - self.assertEqual(datetime(2013, 11, 1), new_dataframe['timestamp'][0]) - - col_meta = new_dataframe.metadata.query((metadata_base.ALL_ELEMENTS, 5)) - self.assertEqual(col_meta['name'], 'timestamp') - self.assertTrue('https://metadata.datadrivendiscovery.org/types/Time' in col_meta['semantic_types']) - - def test_bad_join_char(self): - dataset_doc_path = os.path.abspath(os.path.join(os.path.dirname(__file__), 'data', 'datasets', 'timeseries_dataset_3', 'datasetDoc.json')) - dataset = container.Dataset.load('file://{dataset_doc_path}'.format(dataset_doc_path=dataset_doc_path)) - resource = test_utils.get_dataframe(dataset) - - compose_hyperparams_class = datetime_field_compose.DatetimeFieldComposePrimitive.metadata.get_hyperparams() - hp = compose_hyperparams_class({ - 'columns': [2,3], - 'join_char': 'cc', - 'output_name': 'timestamp' - }) - compose_primitive = datetime_field_compose.DatetimeFieldComposePrimitive(hyperparams=hp) - with self.assertRaises(ValueError): - compose_primitive.produce(inputs=resource) - - def test_bad_columns(self): - dataset_doc_path = os.path.abspath(os.path.join(os.path.dirname(__file__), 'data', 'datasets', 'timeseries_dataset_3', 'datasetDoc.json')) - dataset = container.Dataset.load('file://{dataset_doc_path}'.format(dataset_doc_path=dataset_doc_path)) - resource = test_utils.get_dataframe(dataset) - - compose_hyperparams_class = datetime_field_compose.DatetimeFieldComposePrimitive.metadata.get_hyperparams() - hp = compose_hyperparams_class({ - 'columns': [1,2], - 'join_char': '-', - 'output_name': 'timestamp' - }) - compose_primitive = datetime_field_compose.DatetimeFieldComposePrimitive(hyperparams=hp) - with self.assertRaises(ValueError): - compose_primitive.produce(inputs=resource) - -if __name__ == '__main__': - unittest.main() diff --git a/common-primitives/tests/test_datetime_range_filter.py b/common-primitives/tests/test_datetime_range_filter.py deleted file mode 100644 index d047e92..0000000 --- a/common-primitives/tests/test_datetime_range_filter.py +++ /dev/null @@ -1,149 +0,0 @@ -import unittest -import os - -from datetime import datetime -from dateutil import parser -from common_primitives import datetime_range_filter -from d3m import container - -import utils as test_utils - - -class DatetimeRangeFilterPrimitiveTestCase(unittest.TestCase): - def test_inclusive_strict(self): - dataset_doc_path = os.path.abspath(os.path.join(os.path.dirname(__file__), 'data', 'datasets', 'timeseries_dataset_1', 'datasetDoc.json')) - dataset = container.Dataset.load('file://{dataset_doc_path}'.format(dataset_doc_path=dataset_doc_path)) - resource = test_utils.get_dataframe(dataset) - - filter_hyperparams_class = datetime_range_filter.DatetimeRangeFilterPrimitive.metadata.get_hyperparams() - hp = filter_hyperparams_class.defaults().replace({ - 'column': 3, - 'min': datetime(2013, 11, 8), - 'max': datetime(2013, 12, 3), - 'strict': True, - 'inclusive': True - }) - filter_primitive = datetime_range_filter.DatetimeRangeFilterPrimitive(hyperparams=hp) - new_dataframe = filter_primitive.produce(inputs=resource).value - - self.assertGreater(new_dataframe['Date'].apply(parser.parse).min(), datetime(2013, 11, 8)) - self.assertLess(new_dataframe['Date'].apply(parser.parse).max(), datetime(2013, 12, 3)) - self.assertEqual(15, len(new_dataframe)) - - def test_inclusive_permissive(self): - dataset_doc_path = os.path.abspath(os.path.join(os.path.dirname(__file__), 'data', 'datasets', 'timeseries_dataset_1', 'datasetDoc.json')) - dataset = container.Dataset.load('file://{dataset_doc_path}'.format(dataset_doc_path=dataset_doc_path)) - resource = test_utils.get_dataframe(dataset) - - filter_hyperparams_class = datetime_range_filter.DatetimeRangeFilterPrimitive.metadata.get_hyperparams() - hp = filter_hyperparams_class.defaults().replace({ - 'column': 3, - 'min': datetime(2013, 11, 8), - 'max': datetime(2013, 12, 3), - 'strict': False, - 'inclusive': True - }) - filter_primitive = datetime_range_filter.DatetimeRangeFilterPrimitive(hyperparams=hp) - new_dataframe = filter_primitive.produce(inputs=resource).value - - self.assertGreaterEqual(new_dataframe['Date'].apply(parser.parse).min(), datetime(2013, 11, 8)) - self.assertLessEqual(new_dataframe['Date'].apply(parser.parse).max(), datetime(2013, 12, 3)) - self.assertEqual(17, len(new_dataframe)) - - def test_exclusive_strict(self): - dataset_doc_path = os.path.abspath(os.path.join(os.path.dirname(__file__), 'data', 'datasets', 'timeseries_dataset_1', 'datasetDoc.json')) - dataset = container.Dataset.load('file://{dataset_doc_path}'.format(dataset_doc_path=dataset_doc_path)) - resource = test_utils.get_dataframe(dataset) - - filter_hyperparams_class = datetime_range_filter \ - .DatetimeRangeFilterPrimitive.metadata.get_hyperparams() - hp = filter_hyperparams_class.defaults().replace({ - 'column': 3, - 'min': datetime(2013, 11, 8), - 'max': datetime(2013, 12, 3), - 'strict': True, - 'inclusive': False - }) - filter_primitive = datetime_range_filter.DatetimeRangeFilterPrimitive(hyperparams=hp) - new_dataframe = filter_primitive.produce(inputs=resource).value - - self.assertEqual( - len(new_dataframe.loc[ - (new_dataframe['Date'].apply(parser.parse) >= datetime(2013, 11, 8)) & - (new_dataframe['Date'].apply(parser.parse).max() <= datetime(2013, 12, 3))]), 0) - self.assertEqual(23, len(new_dataframe)) - - def test_exclusive_permissive(self): - dataset_doc_path = os.path.abspath(os.path.join(os.path.dirname(__file__), 'data', 'datasets', 'timeseries_dataset_1', 'datasetDoc.json')) - dataset = container.Dataset.load('file://{dataset_doc_path}'.format(dataset_doc_path=dataset_doc_path)) - resource = test_utils.get_dataframe(dataset) - - filter_hyperparams_class = datetime_range_filter \ - .DatetimeRangeFilterPrimitive.metadata.get_hyperparams() - hp = filter_hyperparams_class.defaults().replace({ - 'column': 3, - 'min': datetime(2013, 11, 8), - 'max': datetime(2013, 12, 3), - 'strict': False, - 'inclusive': False - }) - filter_primitive = datetime_range_filter.DatetimeRangeFilterPrimitive(hyperparams=hp) - new_dataframe = filter_primitive.produce(inputs=resource).value - - self.assertEqual( - len(new_dataframe.loc[ - (new_dataframe['Date'].apply(parser.parse) > datetime(2013, 11, 8)) & - (new_dataframe['Date'].apply(parser.parse).max() < datetime(2013, 12, 3))]), 0) - self.assertEqual(25, len(new_dataframe)) - - def test_row_metadata_removal(self): - dataset_doc_path = os.path.abspath(os.path.join(os.path.dirname(__file__), 'data', 'datasets', 'timeseries_dataset_1', 'datasetDoc.json')) - dataset = container.Dataset.load('file://{dataset_doc_path}'.format(dataset_doc_path=dataset_doc_path)) - - # add metadata for rows 0 and 1 - dataset.metadata = dataset.metadata.update(('learningData', 0), {'a': 0}) - dataset.metadata = dataset.metadata.update(('learningData', 5), {'b': 1}) - - resource = test_utils.get_dataframe(dataset) - - # apply filter that removes rows 0 and 1 - filter_hyperparams_class = datetime_range_filter.DatetimeRangeFilterPrimitive.metadata.get_hyperparams() - hp = filter_hyperparams_class.defaults().replace({ - 'column': 3, - 'min': datetime(2013, 11, 4), - 'max': datetime(2013, 11, 7), - 'strict': True, - 'inclusive': False - }) - filter_primitive = datetime_range_filter.DatetimeRangeFilterPrimitive(hyperparams=hp) - new_df = filter_primitive.produce(inputs=resource).value - - # verify that the length is correct - self.assertEqual(len(new_df), new_df.metadata.query(())['dimension']['length']) - - # verify that the rows were re-indexed in the metadata - self.assertEqual(new_df.metadata.query((0,))['a'], 0) - self.assertEqual(new_df.metadata.query((1,))['b'], 1) - self.assertFalse('b' in new_df.metadata.query((5,))) - - def test_bad_type_handling(self): - dataset_doc_path = os.path.abspath(os.path.join(os.path.dirname(__file__), 'data', 'datasets', 'timeseries_dataset_1', 'datasetDoc.json')) - dataset = container.Dataset.load('file://{dataset_doc_path}'.format(dataset_doc_path=dataset_doc_path)) - resource = test_utils.get_dataframe(dataset) - - filter_hyperparams_class = datetime_range_filter \ - .DatetimeRangeFilterPrimitive.metadata.get_hyperparams() - hp = filter_hyperparams_class.defaults().replace({ - 'column': 1, - 'min': datetime(2013, 11, 1), - 'max': datetime(2013, 11, 4), - 'strict': False, - 'inclusive': False, - }) - filter_primitive = datetime_range_filter.DatetimeRangeFilterPrimitive(hyperparams=hp) - with self.assertRaises(ValueError): - filter_primitive.produce(inputs=resource) - - -if __name__ == '__main__': - unittest.main() diff --git a/common-primitives/tests/test_denormalize.py b/common-primitives/tests/test_denormalize.py deleted file mode 100644 index 0737fed..0000000 --- a/common-primitives/tests/test_denormalize.py +++ /dev/null @@ -1,469 +0,0 @@ -import os -import unittest - -from d3m import container, utils -from d3m.metadata import base as metadata_base - -from common_primitives import denormalize - -import utils as test_utils - - -class DenormalizePrimitiveTestCase(unittest.TestCase): - def test_discard(self): - dataset_doc_path = os.path.abspath(os.path.join(os.path.dirname(__file__), 'data', 'datasets', 'database_dataset_1', 'datasetDoc.json')) - - dataset = container.Dataset.load('file://{dataset_doc_path}'.format(dataset_doc_path=dataset_doc_path)) - - dataset_metadata_before = dataset.metadata.to_internal_json_structure() - - hyperparams_class = denormalize.DenormalizePrimitive.metadata.get_hyperparams() - - primitive = denormalize.DenormalizePrimitive(hyperparams=hyperparams_class.defaults().replace({ - 'recursive': False, - 'discard_not_joined_tabular_resources': True, - })) - - denormalized_dataset = primitive.produce(inputs=dataset).value - - self.assertIsInstance(denormalized_dataset, container.Dataset) - - self.assertEqual(len(denormalized_dataset), 1) - - self.assertEqual(set(denormalized_dataset['learningData'].iloc[:, 1]), {'AAA', 'BBB', 'CCC'}) - self.assertEqual(set(denormalized_dataset['learningData'].iloc[:, 2]), {'AAA name', 'BBB name', 'CCC name'}) - self.assertEqual(set(denormalized_dataset['learningData'].iloc[:, 3]), {'1', '2', ''}) - self.assertEqual(set(denormalized_dataset['learningData'].iloc[:, 4]), {'aaa', 'bbb', 'ccc', 'ddd', 'eee'}) - self.assertEqual(set(denormalized_dataset['learningData'].iloc[:, 5]), {'1990', '2000', '2010'}) - - self._test_discard_metadata(denormalized_dataset.metadata, dataset_doc_path) - - self.assertEqual(dataset.metadata.to_internal_json_structure(), dataset_metadata_before) - - def _test_discard_metadata(self, metadata, dataset_doc_path): - self.maxDiff = None - - self.assertEqual(test_utils.convert_through_json(metadata.query(())), { - 'schema': metadata_base.CONTAINER_SCHEMA_VERSION, - 'structural_type': 'd3m.container.dataset.Dataset', - 'id': 'database_dataset_1', - 'version': '4.0.0', - 'name': 'A dataset simulating a database dump', - 'location_uris': [ - 'file://{dataset_doc_path}'.format(dataset_doc_path=dataset_doc_path), - ], - 'dimension': { - 'name': 'resources', - 'semantic_types': [ - 'https://metadata.datadrivendiscovery.org/types/DatasetResource', - ], - 'length': 1, - }, - 'digest': '68c435c6ba9a1c419c79507275c0d5710786dfe481e48f35591d87a7dbf5bb1a', - 'description': 'A synthetic dataset trying to be similar to a database dump, with tables with different relations between them.', - 'source': { - 'license': 'CC', - 'redacted': False, - }, - }) - - self.assertEqual(test_utils.convert_through_json(metadata.query(('learningData',))), { - 'structural_type': 'd3m.container.pandas.DataFrame', - 'semantic_types': [ - 'https://metadata.datadrivendiscovery.org/types/Table', - 'https://metadata.datadrivendiscovery.org/types/DatasetEntryPoint', - ], - 'dimension': { - 'name': 'rows', - 'semantic_types': [ - 'https://metadata.datadrivendiscovery.org/types/TabularRow', - ], - 'length': 45, - }, - }) - - self.assertEqual(test_utils.convert_through_json(metadata.query(('learningData', metadata_base.ALL_ELEMENTS,))), { - 'dimension': { - 'name': 'columns', - 'semantic_types': ['https://metadata.datadrivendiscovery.org/types/TabularColumn'], - 'length': 7, - } - }) - - self.assertEqual(test_utils.convert_through_json(metadata.query(('learningData', metadata_base.ALL_ELEMENTS, 0))), { - 'name': 'd3mIndex', - 'structural_type': 'str', - 'semantic_types': [ - 'http://schema.org/Integer', - 'https://metadata.datadrivendiscovery.org/types/PrimaryKey', - ], - }) - - self.assertEqual(test_utils.convert_through_json(metadata.query(('learningData', metadata_base.ALL_ELEMENTS, 3))), { - 'name': 'author', - 'structural_type': 'str', - 'semantic_types': [ - 'http://schema.org/Integer', - 'https://metadata.datadrivendiscovery.org/types/Attribute', - ], - 'foreign_key': { - 'type': 'COLUMN', - 'resource_id': 'authors', - 'column_index': 0, - }, - }) - - self.assertEqual(test_utils.convert_through_json(metadata.query(('learningData', metadata_base.ALL_ELEMENTS, 1))), { - 'name': 'code', - 'structural_type': 'str', - 'semantic_types': [ - 'https://metadata.datadrivendiscovery.org/types/CategoricalData', - 'https://metadata.datadrivendiscovery.org/types/Attribute', - ], - }) - - self.assertEqual(test_utils.convert_through_json(metadata.query(('learningData', metadata_base.ALL_ELEMENTS, 2))), { - 'name': 'name', - 'structural_type': 'str', - 'semantic_types': [ - 'http://schema.org/Text', - 'https://metadata.datadrivendiscovery.org/types/Attribute', - ], - }) - - self.assertEqual(test_utils.convert_through_json(metadata.query(('learningData', metadata_base.ALL_ELEMENTS, 4))), { - 'name': 'key', - 'structural_type': 'str', - 'semantic_types': [ - 'https://metadata.datadrivendiscovery.org/types/CategoricalData', - 'https://metadata.datadrivendiscovery.org/types/Attribute', - ], - }) - - self.assertEqual(test_utils.convert_through_json(metadata.query(('learningData', metadata_base.ALL_ELEMENTS, 5))), { - 'name': 'year', - 'structural_type': 'str', - 'semantic_types': [ - 'http://schema.org/DateTime', - 'https://metadata.datadrivendiscovery.org/types/Attribute', - ], - }) - - self.assertEqual(test_utils.convert_through_json(metadata.query(('learningData', metadata_base.ALL_ELEMENTS, 6))), { - 'name': 'value', - 'structural_type': 'str', - 'semantic_types': [ - 'http://schema.org/Float', - 'https://metadata.datadrivendiscovery.org/types/SuggestedTarget', - 'https://metadata.datadrivendiscovery.org/types/Attribute', - ], - }) - - def test_recursive(self): - dataset_doc_path = os.path.abspath(os.path.join(os.path.dirname(__file__), 'data', 'datasets', 'database_dataset_1', 'datasetDoc.json')) - - dataset = container.Dataset.load('file://{dataset_doc_path}'.format(dataset_doc_path=dataset_doc_path)) - - dataset_metadata_before = dataset.metadata.to_internal_json_structure() - - hyperparams_class = denormalize.DenormalizePrimitive.metadata.get_hyperparams() - - primitive = denormalize.DenormalizePrimitive(hyperparams=hyperparams_class.defaults().replace({ - 'recursive': True, - 'discard_not_joined_tabular_resources': False, - })) - - denormalized_dataset = primitive.produce(inputs=dataset).value - - self.assertIsInstance(denormalized_dataset, container.Dataset) - - self.assertEqual(len(denormalized_dataset), 4) - - self.assertEqual(denormalized_dataset['values'].shape[0], 64) - self.assertEqual(denormalized_dataset['learningData'].shape[1], 8) - - self.assertEqual(set(denormalized_dataset['learningData'].iloc[:, 1]), {'AAA', 'BBB', 'CCC'}) - self.assertEqual(set(denormalized_dataset['learningData'].iloc[:, 2]), {'AAA name', 'BBB name', 'CCC name'}) - self.assertEqual(set(denormalized_dataset['learningData'].iloc[:, 3]), {'1', '2', ''}) - self.assertEqual(set(denormalized_dataset['learningData'].iloc[:, 4]), {'1 name', '2 name', ''}) - self.assertEqual(set(denormalized_dataset['learningData'].iloc[:, 5]), {'aaa', 'bbb', 'ccc', 'ddd', 'eee'}) - self.assertEqual(set(denormalized_dataset['learningData'].iloc[:, 6]), {'1990', '2000', '2010'}) - - self._test_recursive_metadata(denormalized_dataset.metadata, dataset_doc_path) - - self.assertEqual(dataset.metadata.to_internal_json_structure(), dataset_metadata_before) - - def _test_recursive_metadata(self, metadata, dataset_doc_path): - self.maxDiff = None - - self.assertEqual(test_utils.convert_through_json(metadata.query(())), { - 'schema': metadata_base.CONTAINER_SCHEMA_VERSION, - 'structural_type': 'd3m.container.dataset.Dataset', - 'id': 'database_dataset_1', - 'version': '4.0.0', - 'name': 'A dataset simulating a database dump', - 'location_uris': [ - 'file://{dataset_doc_path}'.format(dataset_doc_path=dataset_doc_path), - ], - 'dimension': { - 'name': 'resources', - 'semantic_types': [ - 'https://metadata.datadrivendiscovery.org/types/DatasetResource', - ], - 'length': 4, - }, - 'digest': '68c435c6ba9a1c419c79507275c0d5710786dfe481e48f35591d87a7dbf5bb1a', - 'description': 'A synthetic dataset trying to be similar to a database dump, with tables with different relations between them.', - 'source': { - 'license': 'CC', - 'redacted': False, - }, - }) - - self.assertEqual(test_utils.convert_through_json(metadata.query(('learningData',))), { - 'structural_type': 'd3m.container.pandas.DataFrame', - 'semantic_types': [ - 'https://metadata.datadrivendiscovery.org/types/Table', - 'https://metadata.datadrivendiscovery.org/types/DatasetEntryPoint', - ], - 'dimension': { - 'name': 'rows', - 'semantic_types': [ - 'https://metadata.datadrivendiscovery.org/types/TabularRow', - ], - 'length': 45, - }, - }) - - self.assertEqual(test_utils.convert_through_json(metadata.query(('learningData', metadata_base.ALL_ELEMENTS,))), { - 'dimension': { - 'name': 'columns', - 'semantic_types': ['https://metadata.datadrivendiscovery.org/types/TabularColumn'], - 'length': 8, - } - }) - - self.assertEqual(test_utils.convert_through_json(metadata.query(('learningData', metadata_base.ALL_ELEMENTS, 0))), { - 'name': 'd3mIndex', - 'structural_type': 'str', - 'semantic_types': [ - 'http://schema.org/Integer', - 'https://metadata.datadrivendiscovery.org/types/PrimaryKey', - ], - }) - - self.assertEqual(test_utils.convert_through_json(metadata.query(('learningData', metadata_base.ALL_ELEMENTS, 3))), { - 'name': 'id', - 'structural_type': 'str', - 'semantic_types': [ - 'http://schema.org/Integer', - 'https://metadata.datadrivendiscovery.org/types/Attribute', - ], - }) - - self.assertEqual(test_utils.convert_through_json(metadata.query(('learningData', metadata_base.ALL_ELEMENTS, 1))), { - 'name': 'code', - 'structural_type': 'str', - 'semantic_types': [ - 'https://metadata.datadrivendiscovery.org/types/CategoricalData', - 'https://metadata.datadrivendiscovery.org/types/Attribute', - ], - }) - - for i in [2, 4]: - self.assertEqual(test_utils.convert_through_json(metadata.query(('learningData', metadata_base.ALL_ELEMENTS, i))), { - 'name': ['name', None, 'name'][i - 2], - 'structural_type': 'str', - 'semantic_types': [ - 'http://schema.org/Text', - 'https://metadata.datadrivendiscovery.org/types/Attribute', - ], - }, i) - - self.assertEqual(test_utils.convert_through_json(metadata.query(('learningData', metadata_base.ALL_ELEMENTS, 5))), { - 'name': 'key', - 'structural_type': 'str', - 'semantic_types': [ - 'https://metadata.datadrivendiscovery.org/types/CategoricalData', - 'https://metadata.datadrivendiscovery.org/types/Attribute', - ], - }) - - self.assertEqual(test_utils.convert_through_json(metadata.query(('learningData', metadata_base.ALL_ELEMENTS, 6))), { - 'name': 'year', - 'structural_type': 'str', - 'semantic_types': [ - 'http://schema.org/DateTime', - 'https://metadata.datadrivendiscovery.org/types/Attribute', - ], - }) - - self.assertEqual(test_utils.convert_through_json(metadata.query(('learningData', metadata_base.ALL_ELEMENTS, 7))), { - 'name': 'value', - 'structural_type': 'str', - 'semantic_types': [ - 'http://schema.org/Float', - 'https://metadata.datadrivendiscovery.org/types/SuggestedTarget', - 'https://metadata.datadrivendiscovery.org/types/Attribute', - ], - }) - - def test_row_order(self): - dataset_doc_path = os.path.abspath(os.path.join(os.path.dirname(__file__), 'data', 'datasets', 'image_dataset_1', 'datasetDoc.json')) - - dataset = container.Dataset.load('file://{dataset_doc_path}'.format(dataset_doc_path=dataset_doc_path)) - - dataset_metadata_before = dataset.metadata.to_internal_json_structure() - - hyperparams_class = denormalize.DenormalizePrimitive.metadata.get_hyperparams() - - primitive = denormalize.DenormalizePrimitive(hyperparams=hyperparams_class.defaults().replace({ - 'recursive': True, - 'discard_not_joined_tabular_resources': False, - })) - - denormalized_dataset = primitive.produce(inputs=dataset).value - - self.assertIsInstance(denormalized_dataset, container.Dataset) - - self.assertEqual(len(denormalized_dataset), 1) - - self.assertEqual(denormalized_dataset['learningData'].shape, (5, 3)) - - self.assertEqual(denormalized_dataset['learningData'].values.tolist(), [ - ['0', 'mnist_0_2.png', 'mnist'], - ['1', 'mnist_1_1.png', 'mnist'], - ['2', '001_HandPhoto_left_01.jpg', 'handgeometry'], - ['3', 'cifar10_bird_1.png', 'cifar'], - ['4', 'cifar10_bird_2.png', 'cifar'], - ]) - - self._test_row_order_metadata(denormalized_dataset.metadata, dataset_doc_path) - - self.assertEqual(dataset.metadata.to_internal_json_structure(), dataset_metadata_before) - - def _test_row_order_metadata(self, metadata, dataset_doc_path): - self.maxDiff = None - - self.assertEqual(test_utils.convert_through_json(metadata.query(())), { - 'schema': metadata_base.CONTAINER_SCHEMA_VERSION, - 'structural_type': 'd3m.container.dataset.Dataset', - 'id': 'image_dataset_1', - 'version': '4.0.0', - 'name': 'Image dataset to be used for tests', - 'location_uris': [ - 'file://{dataset_doc_path}'.format(dataset_doc_path=dataset_doc_path), - ], - 'dimension': { - 'name': 'resources', - 'semantic_types': [ - 'https://metadata.datadrivendiscovery.org/types/DatasetResource', - ], - 'length': 1, - }, - 'digest': '9b5553ce5ad84dfcefd379814dc6b11ef60a049479e3e91aa1251f7a5ef7409e', - 'description': 'There are a total of 5 image files, one is a left hand from the handgeometry dataset, two birds from cifar10 dataset and 2 figures from mnist dataset.', - 'source': { - 'license': 'Creative Commons Attribution-NonCommercial 4.0', - 'redacted': False, - }, - 'approximate_stored_size': 24000, - }) - - self.assertEqual(test_utils.convert_through_json(metadata.query(('learningData',))), { - 'structural_type': 'd3m.container.pandas.DataFrame', - 'semantic_types': [ - 'https://metadata.datadrivendiscovery.org/types/Table', - 'https://metadata.datadrivendiscovery.org/types/DatasetEntryPoint', - ], - 'dimension': { - 'name': 'rows', - 'semantic_types': [ - 'https://metadata.datadrivendiscovery.org/types/TabularRow', - ], - 'length': 5, - }, - }) - - self.assertEqual(test_utils.convert_through_json(metadata.query(('learningData', metadata_base.ALL_ELEMENTS,))), { - 'dimension': { - 'name': 'columns', - 'semantic_types': ['https://metadata.datadrivendiscovery.org/types/TabularColumn'], - 'length': 3, - } - }) - - self.assertEqual(test_utils.convert_through_json(metadata.query(('learningData', metadata_base.ALL_ELEMENTS, 0))), { - 'name': 'd3mIndex', - 'structural_type': 'str', - 'semantic_types': [ - 'http://schema.org/Integer', - 'https://metadata.datadrivendiscovery.org/types/PrimaryKey', - ], - }) - - self.assertEqual(test_utils.convert_through_json(metadata.query(('learningData', metadata_base.ALL_ELEMENTS, 1))), { - 'name': 'filename', - 'structural_type': 'str', - 'semantic_types': [ - 'https://metadata.datadrivendiscovery.org/types/FileName', - 'http://schema.org/ImageObject', - 'https://metadata.datadrivendiscovery.org/types/Attribute', - 'https://metadata.datadrivendiscovery.org/types/UniqueKey', - ], - 'location_base_uris': [ - 'file://{dataset_base_path}/media/'.format(dataset_base_path=os.path.dirname(dataset_doc_path)), - ], - 'media_types': [ - 'image/jpeg', - 'image/png', - ], - }) - - self.assertEqual(test_utils.convert_through_json(metadata.query(('learningData', metadata_base.ALL_ELEMENTS, 2))), { - 'name': 'class', - 'structural_type': 'str', - 'semantic_types': [ - 'https://metadata.datadrivendiscovery.org/types/CategoricalData', - 'https://metadata.datadrivendiscovery.org/types/SuggestedTarget', - 'https://metadata.datadrivendiscovery.org/types/Attribute', - ], - }) - - self.assertEqual(test_utils.convert_through_json(metadata.query(('learningData', 0, 1))), { - 'name': 'filename', - 'structural_type': 'str', - 'semantic_types': [ - 'https://metadata.datadrivendiscovery.org/types/FileName', - 'http://schema.org/ImageObject', - 'https://metadata.datadrivendiscovery.org/types/Attribute', - 'https://metadata.datadrivendiscovery.org/types/UniqueKey', - ], - 'location_base_uris': [ - 'file://{dataset_base_path}/media/'.format(dataset_base_path=os.path.dirname(dataset_doc_path)), - ], - 'media_types': [ - 'image/png', - ], - }) - - self.assertEqual(test_utils.convert_through_json(metadata.query(('learningData', 2, 1))), { - 'name': 'filename', - 'structural_type': 'str', - 'semantic_types': [ - 'https://metadata.datadrivendiscovery.org/types/FileName', - 'http://schema.org/ImageObject', - 'https://metadata.datadrivendiscovery.org/types/Attribute', - 'https://metadata.datadrivendiscovery.org/types/UniqueKey', - ], - 'location_base_uris': [ - 'file://{dataset_base_path}/media/'.format(dataset_base_path=os.path.dirname(dataset_doc_path)), - ], - 'media_types': [ - 'image/jpeg', - ], - }) - - -if __name__ == '__main__': - unittest.main() diff --git a/common-primitives/tests/test_extract_columns_semantic_types.py b/common-primitives/tests/test_extract_columns_semantic_types.py deleted file mode 100644 index aff2b59..0000000 --- a/common-primitives/tests/test_extract_columns_semantic_types.py +++ /dev/null @@ -1,203 +0,0 @@ -import os -import unittest - -from d3m import container -from d3m.metadata import base as metadata_base - -from common_primitives import dataset_to_dataframe, extract_columns_semantic_types - -import utils as test_utils - - -class ExtractColumnsBySemanticTypePrimitiveTestCase(unittest.TestCase): - def test_basic(self): - dataset_doc_path = os.path.abspath(os.path.join(os.path.dirname(__file__), 'data', 'datasets', 'iris_dataset_1', 'datasetDoc.json')) - - dataset = container.Dataset.load('file://{dataset_doc_path}'.format(dataset_doc_path=dataset_doc_path)) - - # We set semantic types like runtime would. - dataset.metadata = dataset.metadata.add_semantic_type(('learningData', metadata_base.ALL_ELEMENTS, 5), 'https://metadata.datadrivendiscovery.org/types/Target') - dataset.metadata = dataset.metadata.add_semantic_type(('learningData', metadata_base.ALL_ELEMENTS, 5), 'https://metadata.datadrivendiscovery.org/types/TrueTarget') - dataset.metadata = dataset.metadata.remove_semantic_type(('learningData', metadata_base.ALL_ELEMENTS, 5), 'https://metadata.datadrivendiscovery.org/types/Attribute') - - hyperparams_class = dataset_to_dataframe.DatasetToDataFramePrimitive.metadata.get_hyperparams() - - primitive = dataset_to_dataframe.DatasetToDataFramePrimitive(hyperparams=hyperparams_class.defaults()) - - call_metadata = primitive.produce(inputs=dataset) - - dataframe = call_metadata.value - - hyperparams_class = extract_columns_semantic_types.ExtractColumnsBySemanticTypesPrimitive.metadata.get_hyperparams() - - primitive = extract_columns_semantic_types.ExtractColumnsBySemanticTypesPrimitive(hyperparams=hyperparams_class.defaults().replace({'semantic_types': ('https://metadata.datadrivendiscovery.org/types/Attribute', 'https://metadata.datadrivendiscovery.org/types/PrimaryKey')})) - - call_metadata = primitive.produce(inputs=dataframe) - - dataframe = call_metadata.value - - self._test_metadata(dataframe.metadata) - - def _test_metadata(self, metadata): - self.maxDiff = None - - self.assertEqual(test_utils.convert_through_json(metadata.query(())), { - 'schema': metadata_base.CONTAINER_SCHEMA_VERSION, - 'structural_type': 'd3m.container.pandas.DataFrame', - 'semantic_types': [ - 'https://metadata.datadrivendiscovery.org/types/Table', - ], - 'dimension': { - 'name': 'rows', - 'semantic_types': ['https://metadata.datadrivendiscovery.org/types/TabularRow'], - 'length': 150, - } - }) - - self.assertEqual(test_utils.convert_through_json(metadata.query((metadata_base.ALL_ELEMENTS,))), { - 'dimension': { - 'name': 'columns', - 'semantic_types': ['https://metadata.datadrivendiscovery.org/types/TabularColumn'], - 'length': 5, - } - }) - - self.assertEqual(test_utils.convert_through_json(metadata.query((metadata_base.ALL_ELEMENTS, 0))), { - 'name': 'd3mIndex', - 'structural_type': 'str', - 'semantic_types': [ - 'http://schema.org/Integer', - 'https://metadata.datadrivendiscovery.org/types/PrimaryKey', - ], - }) - - for i in range(1, 5): - self.assertEqual(test_utils.convert_through_json(metadata.query((metadata_base.ALL_ELEMENTS, i))), { - 'name': ['sepalLength', 'sepalWidth', 'petalLength', 'petalWidth'][i - 1], - 'structural_type': 'str', - 'semantic_types': [ - 'http://schema.org/Float', - 'https://metadata.datadrivendiscovery.org/types/Attribute', - ], - }, i) - - self.assertTrue(metadata.get_elements((metadata_base.ALL_ELEMENTS,)) in [[0, 1, 2, 3, 4], [metadata_base.ALL_ELEMENTS, 0, 1, 2, 3, 4]]) - - def test_set(self): - dataset_doc_path = os.path.abspath( - os.path.join( - os.path.dirname(__file__), - "data", - "datasets", - "boston_dataset_1", - "datasetDoc.json", - ) - ) - - dataset = container.Dataset.load( - "file://{dataset_doc_path}".format(dataset_doc_path=dataset_doc_path) - ) - - # We set semantic types like runtime would. - dataset.metadata = dataset.metadata.add_semantic_type( - ("learningData", metadata_base.ALL_ELEMENTS, 14), - "https://metadata.datadrivendiscovery.org/types/Target", - ) - dataset.metadata = dataset.metadata.add_semantic_type( - ("learningData", metadata_base.ALL_ELEMENTS, 14), - "https://metadata.datadrivendiscovery.org/types/TrueTarget", - ) - dataset.metadata = dataset.metadata.remove_semantic_type( - ("learningData", metadata_base.ALL_ELEMENTS, 14), - "https://metadata.datadrivendiscovery.org/types/Attribute", - ) - - hyperparams_class = ( - dataset_to_dataframe.DatasetToDataFramePrimitive.metadata.get_hyperparams() - ) - - primitive = dataset_to_dataframe.DatasetToDataFramePrimitive( - hyperparams=hyperparams_class.defaults() - ) - - call_metadata = primitive.produce(inputs=dataset) - - dataframe = call_metadata.value - - hyperparams_class = ( - extract_columns_semantic_types.ExtractColumnsBySemanticTypesPrimitive.metadata.get_hyperparams() - ) - - primitive = extract_columns_semantic_types.ExtractColumnsBySemanticTypesPrimitive( - hyperparams=hyperparams_class.defaults().replace( - { - "semantic_types": ( - "https://metadata.datadrivendiscovery.org/types/Attribute", - "http://schema.org/Integer", - ), - "match_logic": "equal", - } - ) - ) - - call_metadata = primitive.produce(inputs=dataframe) - - dataframe = call_metadata.value - - self._test_equal_metadata(dataframe.metadata) - - def _test_equal_metadata(self, metadata): - self.maxDiff = None - - self.assertEqual( - test_utils.convert_through_json(metadata.query(())), - { - "structural_type": "d3m.container.pandas.DataFrame", - "semantic_types": [ - "https://metadata.datadrivendiscovery.org/types/Table" - ], - "dimension": { - "name": "rows", - "semantic_types": [ - "https://metadata.datadrivendiscovery.org/types/TabularRow" - ], - "length": 506, - }, - "schema": "https://metadata.datadrivendiscovery.org/schemas/v0/container.json", - }, - ) - - # only one column that should match - self.assertEqual( - test_utils.convert_through_json( - metadata.query((metadata_base.ALL_ELEMENTS,)) - ), - { - "dimension": { - "name": "columns", - "semantic_types": [ - "https://metadata.datadrivendiscovery.org/types/TabularColumn" - ], - "length": 1, - } - }, - ) - - self.assertEqual( - test_utils.convert_through_json( - metadata.query((metadata_base.ALL_ELEMENTS, 0)) - ), - { - "name": "TAX", - "structural_type": "str", - "semantic_types": [ - "http://schema.org/Integer", - "https://metadata.datadrivendiscovery.org/types/Attribute", - ], - "description": "full-value property-tax rate per $10,000", - }, - ) - - -if __name__ == '__main__': - unittest.main() diff --git a/common-primitives/tests/test_extract_columns_structural_types.py b/common-primitives/tests/test_extract_columns_structural_types.py deleted file mode 100644 index 2271181..0000000 --- a/common-primitives/tests/test_extract_columns_structural_types.py +++ /dev/null @@ -1,89 +0,0 @@ -import os -import unittest - -from d3m import container -from d3m.metadata import base as metadata_base - -from common_primitives import dataset_to_dataframe, extract_columns_structural_types, column_parser - -import utils as test_utils - - -class ExtractColumnsByStructuralTypesPrimitiveTestCase(unittest.TestCase): - def test_basic(self): - dataset_doc_path = os.path.abspath(os.path.join(os.path.dirname(__file__), 'data', 'datasets', 'iris_dataset_1', 'datasetDoc.json')) - - dataset = container.Dataset.load('file://{dataset_doc_path}'.format(dataset_doc_path=dataset_doc_path)) - - hyperparams_class = dataset_to_dataframe.DatasetToDataFramePrimitive.metadata.get_hyperparams() - - primitive = dataset_to_dataframe.DatasetToDataFramePrimitive(hyperparams=hyperparams_class.defaults()) - - call_metadata = primitive.produce(inputs=dataset) - - dataframe = call_metadata.value - - hyperparams_class = column_parser.ColumnParserPrimitive.metadata.get_hyperparams() - - primitive = column_parser.ColumnParserPrimitive(hyperparams=hyperparams_class.defaults()) - - call_metadata = primitive.produce(inputs=dataframe) - - dataframe = call_metadata.value - - hyperparams_class = extract_columns_structural_types.ExtractColumnsByStructuralTypesPrimitive.metadata.get_hyperparams() - - primitive = extract_columns_structural_types.ExtractColumnsByStructuralTypesPrimitive(hyperparams=hyperparams_class.defaults().replace({'structural_types': ('int',)})) - - call_metadata = primitive.produce(inputs=dataframe) - - dataframe = call_metadata.value - - self._test_metadata(dataframe.metadata) - - def _test_metadata(self, metadata): - self.maxDiff = None - - self.assertEqual(test_utils.convert_through_json(metadata.query(())), { - 'schema': metadata_base.CONTAINER_SCHEMA_VERSION, - 'structural_type': 'd3m.container.pandas.DataFrame', - 'semantic_types': [ - 'https://metadata.datadrivendiscovery.org/types/Table', - ], - 'dimension': { - 'name': 'rows', - 'semantic_types': ['https://metadata.datadrivendiscovery.org/types/TabularRow'], - 'length': 150, - } - }) - - self.assertEqual(test_utils.convert_through_json(metadata.query((metadata_base.ALL_ELEMENTS,))), { - 'dimension': { - 'name': 'columns', - 'semantic_types': ['https://metadata.datadrivendiscovery.org/types/TabularColumn'], - 'length': 2, - } - }) - - self.assertEqual(test_utils.convert_through_json(metadata.query((metadata_base.ALL_ELEMENTS, 0))), { - 'name': 'd3mIndex', - 'structural_type': 'int', - 'semantic_types': [ - 'http://schema.org/Integer', - 'https://metadata.datadrivendiscovery.org/types/PrimaryKey', - ], - }) - - self.assertEqual(test_utils.convert_through_json(metadata.query((metadata_base.ALL_ELEMENTS, 1))), { - 'name': 'species', - 'structural_type': 'int', - 'semantic_types': [ - 'https://metadata.datadrivendiscovery.org/types/CategoricalData', - 'https://metadata.datadrivendiscovery.org/types/SuggestedTarget', - 'https://metadata.datadrivendiscovery.org/types/Attribute', - ], - }) - - -if __name__ == '__main__': - unittest.main() diff --git a/common-primitives/tests/test_fixed_split.py b/common-primitives/tests/test_fixed_split.py deleted file mode 100644 index 7059ada..0000000 --- a/common-primitives/tests/test_fixed_split.py +++ /dev/null @@ -1,148 +0,0 @@ -import os -import pickle -import unittest - -from d3m import container -from d3m.metadata import base as metadata_base - -from common_primitives import fixed_split - - -class FixedSplitDatasetSplitPrimitiveTestCase(unittest.TestCase): - def test_produce_train_values(self): - dataset_doc_path = os.path.abspath(os.path.join(os.path.dirname(__file__), '..', 'tests', 'data', 'datasets', 'iris_dataset_1', 'datasetDoc.json')) - - dataset = container.Dataset.load('file://{dataset_doc_path}'.format(dataset_doc_path=dataset_doc_path)) - - # We set semantic types like runtime would. - dataset.metadata = dataset.metadata.add_semantic_type(('learningData', metadata_base.ALL_ELEMENTS, 5), 'https://metadata.datadrivendiscovery.org/types/Target') - dataset.metadata = dataset.metadata.add_semantic_type(('learningData', metadata_base.ALL_ELEMENTS, 5), 'https://metadata.datadrivendiscovery.org/types/TrueTarget') - dataset.metadata = dataset.metadata.remove_semantic_type(('learningData', metadata_base.ALL_ELEMENTS, 5), 'https://metadata.datadrivendiscovery.org/types/Attribute') - - hyperparams_class = fixed_split.FixedSplitDatasetSplitPrimitive.metadata.get_hyperparams() - - hyperparams = hyperparams_class.defaults().replace({ - 'primary_index_values': ['9', '11', '13'], - }) - - # We want to make sure "primary_index_values" is encoded just as a list and not - # a pickle because runtime populates this primitive as a list from a split file. - self.assertEqual(hyperparams.values_to_json_structure(), {'primary_index_values': ['9', '11', '13'], 'row_indices': [], 'delete_recursive': False}) - - primitive = fixed_split.FixedSplitDatasetSplitPrimitive(hyperparams=hyperparams) - - primitive.set_training_data(dataset=dataset) - primitive.fit() - - # To test that pickling works. - pickle.dumps(primitive) - - results = primitive.produce(inputs=container.List([0], generate_metadata=True)).value - - self.assertEqual(len(results), 1) - - for dataset in results: - self.assertEqual(len(dataset), 1) - - self.assertEqual(results[0]['learningData'].shape[0], 147) - self.assertEqual(list(results[0]['learningData'].iloc[:, 0]), [str(i) for i in range(150) if i not in [9, 11, 13]]) - - def test_produce_score_values(self): - dataset_doc_path = os.path.abspath(os.path.join(os.path.dirname(__file__), '..', 'tests', 'data', 'datasets', 'iris_dataset_1', 'datasetDoc.json')) - - dataset = container.Dataset.load('file://{dataset_doc_path}'.format(dataset_doc_path=dataset_doc_path)) - - # We set semantic types like runtime would. - dataset.metadata = dataset.metadata.add_semantic_type(('learningData', metadata_base.ALL_ELEMENTS, 5), 'https://metadata.datadrivendiscovery.org/types/Target') - dataset.metadata = dataset.metadata.add_semantic_type(('learningData', metadata_base.ALL_ELEMENTS, 5), 'https://metadata.datadrivendiscovery.org/types/TrueTarget') - dataset.metadata = dataset.metadata.remove_semantic_type(('learningData', metadata_base.ALL_ELEMENTS, 5), 'https://metadata.datadrivendiscovery.org/types/Attribute') - - hyperparams_class = fixed_split.FixedSplitDatasetSplitPrimitive.metadata.get_hyperparams() - - hyperparams = hyperparams_class.defaults().replace({ - 'primary_index_values': ['9', '11', '13'], - }) - - # We want to make sure "primary_index_values" is encoded just as a list and not - # a pickle because runtime populates this primitive as a list from a split file. - self.assertEqual(hyperparams.values_to_json_structure(), {'primary_index_values': ['9', '11', '13'], 'row_indices': [], 'delete_recursive': False}) - - primitive = fixed_split.FixedSplitDatasetSplitPrimitive(hyperparams=hyperparams) - - primitive.set_training_data(dataset=dataset) - primitive.fit() - - results = primitive.produce_score_data(inputs=container.List([0], generate_metadata=True)).value - - self.assertEqual(len(results), 1) - - for dataset in results: - self.assertEqual(len(dataset), 1) - - self.assertEqual(results[0]['learningData'].shape[0], 3) - self.assertEqual(list(results[0]['learningData'].iloc[:, 0]), [str(i) for i in range(150) if i in [9, 11, 13]]) - - def test_produce_train_indices(self): - dataset_doc_path = os.path.abspath(os.path.join(os.path.dirname(__file__), '..', 'tests', 'data', 'datasets', 'iris_dataset_1', 'datasetDoc.json')) - - dataset = container.Dataset.load('file://{dataset_doc_path}'.format(dataset_doc_path=dataset_doc_path)) - - # We set semantic types like runtime would. - dataset.metadata = dataset.metadata.add_semantic_type(('learningData', metadata_base.ALL_ELEMENTS, 5), 'https://metadata.datadrivendiscovery.org/types/Target') - dataset.metadata = dataset.metadata.add_semantic_type(('learningData', metadata_base.ALL_ELEMENTS, 5), 'https://metadata.datadrivendiscovery.org/types/TrueTarget') - dataset.metadata = dataset.metadata.remove_semantic_type(('learningData', metadata_base.ALL_ELEMENTS, 5), 'https://metadata.datadrivendiscovery.org/types/Attribute') - - hyperparams_class = fixed_split.FixedSplitDatasetSplitPrimitive.metadata.get_hyperparams() - - primitive = fixed_split.FixedSplitDatasetSplitPrimitive(hyperparams=hyperparams_class.defaults().replace({ - 'row_indices': [9, 11, 13], - })) - - primitive.set_training_data(dataset=dataset) - primitive.fit() - - # To test that pickling works. - pickle.dumps(primitive) - - results = primitive.produce(inputs=container.List([0], generate_metadata=True)).value - - self.assertEqual(len(results), 1) - - for dataset in results: - self.assertEqual(len(dataset), 1) - - self.assertEqual(results[0]['learningData'].shape[0], 147) - self.assertEqual(list(results[0]['learningData'].iloc[:, 0]), [str(i) for i in range(150) if i not in [9, 11, 13]]) - - def test_produce_score_indices(self): - dataset_doc_path = os.path.abspath(os.path.join(os.path.dirname(__file__), '..', 'tests', 'data', 'datasets', 'iris_dataset_1', 'datasetDoc.json')) - - dataset = container.Dataset.load('file://{dataset_doc_path}'.format(dataset_doc_path=dataset_doc_path)) - - # We set semantic types like runtime would. - dataset.metadata = dataset.metadata.add_semantic_type(('learningData', metadata_base.ALL_ELEMENTS, 5), 'https://metadata.datadrivendiscovery.org/types/Target') - dataset.metadata = dataset.metadata.add_semantic_type(('learningData', metadata_base.ALL_ELEMENTS, 5), 'https://metadata.datadrivendiscovery.org/types/TrueTarget') - dataset.metadata = dataset.metadata.remove_semantic_type(('learningData', metadata_base.ALL_ELEMENTS, 5), 'https://metadata.datadrivendiscovery.org/types/Attribute') - - hyperparams_class = fixed_split.FixedSplitDatasetSplitPrimitive.metadata.get_hyperparams() - - primitive = fixed_split.FixedSplitDatasetSplitPrimitive(hyperparams=hyperparams_class.defaults().replace({ - 'row_indices': [9, 11, 13], - })) - - primitive.set_training_data(dataset=dataset) - primitive.fit() - - results = primitive.produce_score_data(inputs=container.List([0], generate_metadata=True)).value - - self.assertEqual(len(results), 1) - - for dataset in results: - self.assertEqual(len(dataset), 1) - - self.assertEqual(results[0]['learningData'].shape[0], 3) - self.assertEqual(list(results[0]['learningData'].iloc[:, 0]), [str(i) for i in range(150) if i in [9, 11, 13]]) - - -if __name__ == '__main__': - unittest.main() diff --git a/common-primitives/tests/test_grouping_field_compose.py b/common-primitives/tests/test_grouping_field_compose.py deleted file mode 100644 index 5380be8..0000000 --- a/common-primitives/tests/test_grouping_field_compose.py +++ /dev/null @@ -1,56 +0,0 @@ -import math -import os.path -import unittest - -from d3m import container, utils -from d3m.metadata import base as metadata_base - -from common_primitives import dataset_to_dataframe, grouping_field_compose - -import utils as test_utils - - -class GroupingFieldComposePrimitiveTestCase(unittest.TestCase): - def test_compose_two_suggested_fields(self): - dataset_doc_path = os.path.abspath(os.path.join(os.path.dirname(__file__), 'data', 'datasets', 'timeseries_dataset_3', 'datasetDoc.json')) - dataset = container.Dataset.load('file://{dataset_doc_path}'.format(dataset_doc_path=dataset_doc_path)) - resource = test_utils.get_dataframe(dataset) - - compose_hyperparams_class = grouping_field_compose.GroupingFieldComposePrimitive.metadata.get_hyperparams() - hp = compose_hyperparams_class.defaults().replace({ - 'join_char': '-', - 'output_name': 'grouping' - }) - compose_primitive = grouping_field_compose.GroupingFieldComposePrimitive(hyperparams=hp) - new_dataframe = compose_primitive.produce(inputs=resource).value - - self.assertEqual(new_dataframe.shape, (40, 6)) - self.assertEqual('abbv-2013', new_dataframe['grouping'][0]) - - col_meta = new_dataframe.metadata.query((metadata_base.ALL_ELEMENTS, 5)) - self.assertEqual(col_meta['name'], 'grouping') - self.assertTrue('https://metadata.datadrivendiscovery.org/types/GroupingKey' in col_meta['semantic_types']) - - def test_compose_two_specified_fields(self): - dataset_doc_path = os.path.abspath(os.path.join(os.path.dirname(__file__), 'data', 'datasets', 'timeseries_dataset_3', 'datasetDoc.json')) - dataset = container.Dataset.load('file://{dataset_doc_path}'.format(dataset_doc_path=dataset_doc_path)) - resource = test_utils.get_dataframe(dataset) - - compose_hyperparams_class = grouping_field_compose.GroupingFieldComposePrimitive.metadata.get_hyperparams() - hp = compose_hyperparams_class.defaults().replace({ - 'columns': [1,3], - 'join_char': '-', - 'output_name': 'grouping' - }) - compose_primitive = grouping_field_compose.GroupingFieldComposePrimitive(hyperparams=hp) - new_dataframe = compose_primitive.produce(inputs=resource).value - - self.assertEqual(new_dataframe.shape, (40, 6)) - self.assertEqual('abbv-11-01', new_dataframe['grouping'][0]) - - col_meta = new_dataframe.metadata.query((metadata_base.ALL_ELEMENTS, 5)) - self.assertEqual(col_meta['name'], 'grouping') - self.assertTrue('https://metadata.datadrivendiscovery.org/types/GroupingKey' in col_meta['semantic_types']) - -if __name__ == '__main__': - unittest.main() diff --git a/common-primitives/tests/test_horizontal_concat.py b/common-primitives/tests/test_horizontal_concat.py deleted file mode 100644 index 0f8e78f..0000000 --- a/common-primitives/tests/test_horizontal_concat.py +++ /dev/null @@ -1,183 +0,0 @@ -import unittest -import os - -import numpy - -from d3m import container -from d3m.metadata import base as metadata_base - -from common_primitives import dataset_to_dataframe, extract_columns_semantic_types, horizontal_concat - - -class HorizontalConcatPrimitiveTestCase(unittest.TestCase): - def test_basic(self): - test_data_inputs = {'col1': [1.0, 2.0, 3.0]} - dataframe_inputs = container.DataFrame(data=test_data_inputs, generate_metadata=True) - - test_data_targets = {'col2': [1, 2 ,3]} - dataframe_targets = container.DataFrame(data=test_data_targets, generate_metadata=True) - - hyperparams_class = horizontal_concat.HorizontalConcatPrimitive.metadata.get_hyperparams() - - primitive = horizontal_concat.HorizontalConcatPrimitive(hyperparams=hyperparams_class.defaults()) - - call_result = primitive.produce(left=dataframe_inputs, right=dataframe_targets) - - dataframe_concat = call_result.value - - self.assertEqual(dataframe_concat.values.tolist(), [[1.0, 1.0], [2.0, 2.0], [3.0, 3.0]]) - - self._test_basic_metadata(dataframe_concat.metadata) - - def _test_basic_metadata(self, metadata): - self.assertEqual(metadata.query((metadata_base.ALL_ELEMENTS,))['dimension']['length'], 2) - self.assertEqual(metadata.query((metadata_base.ALL_ELEMENTS, 0))['name'], 'col1') - self.assertEqual(metadata.query((metadata_base.ALL_ELEMENTS, 0))['structural_type'], numpy.float64) - self.assertEqual(metadata.query((metadata_base.ALL_ELEMENTS, 1))['name'], 'col2') - self.assertEqual(metadata.query((metadata_base.ALL_ELEMENTS, 1))['structural_type'], numpy.int64) - - def _get_iris(self): - dataset_doc_path = os.path.abspath(os.path.join(os.path.dirname(__file__), 'data', 'datasets', 'iris_dataset_1', 'datasetDoc.json')) - - dataset = container.Dataset.load('file://{dataset_doc_path}'.format(dataset_doc_path=dataset_doc_path)) - - # We set semantic types like runtime would. - dataset.metadata = dataset.metadata.add_semantic_type(('learningData', metadata_base.ALL_ELEMENTS, 5), 'https://metadata.datadrivendiscovery.org/types/Target') - dataset.metadata = dataset.metadata.add_semantic_type(('learningData', metadata_base.ALL_ELEMENTS, 5), 'https://metadata.datadrivendiscovery.org/types/TrueTarget') - dataset.metadata = dataset.metadata.remove_semantic_type(('learningData', metadata_base.ALL_ELEMENTS, 5), 'https://metadata.datadrivendiscovery.org/types/Attribute') - - hyperparams_class = dataset_to_dataframe.DatasetToDataFramePrimitive.metadata.get_hyperparams() - - primitive = dataset_to_dataframe.DatasetToDataFramePrimitive(hyperparams=hyperparams_class.defaults()) - - call_metadata = primitive.produce(inputs=dataset) - - dataframe = call_metadata.value - - return dataframe - - def _get_iris_columns(self): - dataframe = self._get_iris() - - hyperparams_class = extract_columns_semantic_types.ExtractColumnsBySemanticTypesPrimitive.metadata.get_hyperparams() - - primitive = extract_columns_semantic_types.ExtractColumnsBySemanticTypesPrimitive(hyperparams=hyperparams_class.defaults().replace({'semantic_types': ('https://metadata.datadrivendiscovery.org/types/PrimaryKey',)})) - - call_metadata = primitive.produce(inputs=dataframe) - - index = call_metadata.value - - primitive = extract_columns_semantic_types.ExtractColumnsBySemanticTypesPrimitive(hyperparams=hyperparams_class.defaults().replace({'semantic_types': ('https://metadata.datadrivendiscovery.org/types/Attribute',)})) - - call_metadata = primitive.produce(inputs=dataframe) - - attributes = call_metadata.value - - primitive = extract_columns_semantic_types.ExtractColumnsBySemanticTypesPrimitive(hyperparams=hyperparams_class.defaults().replace({'semantic_types': ('https://metadata.datadrivendiscovery.org/types/SuggestedTarget',)})) - - call_metadata = primitive.produce(inputs=dataframe) - - targets = call_metadata.value - - return dataframe, index, attributes, targets - - def test_iris(self): - dataframe, index, attributes, targets = self._get_iris_columns() - - hyperparams_class = horizontal_concat.HorizontalConcatPrimitive.metadata.get_hyperparams() - - primitive = horizontal_concat.HorizontalConcatPrimitive(hyperparams=hyperparams_class.defaults()) - - call_metadata = primitive.produce(left=index, right=attributes) - - call_metadata = primitive.produce(left=call_metadata.value, right=targets) - - new_dataframe = call_metadata.value - - self.assertEqual(dataframe.values.tolist(), new_dataframe.values.tolist()) - - self._test_iris_metadata(dataframe.metadata, new_dataframe.metadata) - - def _test_iris_metadata(self, metadata, new_metadata): - self.assertEqual(metadata.query((metadata_base.ALL_ELEMENTS,))['dimension']['length'], new_metadata.query((metadata_base.ALL_ELEMENTS,))['dimension']['length']) - - for i in range(new_metadata.query((metadata_base.ALL_ELEMENTS,))['dimension']['length']): - self.assertEqual(metadata.query((metadata_base.ALL_ELEMENTS, i)), new_metadata.query((metadata_base.ALL_ELEMENTS, i)), i) - - def _get_iris_columns_with_index(self): - dataframe = self._get_iris() - - hyperparams_class = extract_columns_semantic_types.ExtractColumnsBySemanticTypesPrimitive.metadata.get_hyperparams() - - primitive = extract_columns_semantic_types.ExtractColumnsBySemanticTypesPrimitive(hyperparams=hyperparams_class.defaults().replace({'semantic_types': ('https://metadata.datadrivendiscovery.org/types/PrimaryKey',)})) - - call_metadata = primitive.produce(inputs=dataframe) - - index = call_metadata.value - - primitive = extract_columns_semantic_types.ExtractColumnsBySemanticTypesPrimitive(hyperparams=hyperparams_class.defaults().replace({'semantic_types': ('https://metadata.datadrivendiscovery.org/types/PrimaryKey', 'https://metadata.datadrivendiscovery.org/types/Attribute')})) - - call_metadata = primitive.produce(inputs=dataframe) - - attributes = call_metadata.value - - primitive = extract_columns_semantic_types.ExtractColumnsBySemanticTypesPrimitive(hyperparams=hyperparams_class.defaults().replace({'semantic_types': ('https://metadata.datadrivendiscovery.org/types/PrimaryKey', 'https://metadata.datadrivendiscovery.org/types/SuggestedTarget')})) - - call_metadata = primitive.produce(inputs=dataframe) - - targets = call_metadata.value - - return dataframe, index, attributes, targets - - def test_iris_with_index_removed(self): - dataframe, index, attributes, targets = self._get_iris_columns_with_index() - - hyperparams_class = horizontal_concat.HorizontalConcatPrimitive.metadata.get_hyperparams() - - primitive = horizontal_concat.HorizontalConcatPrimitive(hyperparams=hyperparams_class.defaults().replace({'use_index': False})) - - call_metadata = primitive.produce(left=index, right=attributes) - - call_metadata = primitive.produce(left=call_metadata.value, right=targets) - - new_dataframe = call_metadata.value - - self.assertEqual(dataframe.values.tolist(), new_dataframe.values.tolist()) - - self._test_iris_with_index_removed_metadata(dataframe.metadata, new_dataframe.metadata) - - def _test_iris_with_index_removed_metadata(self, metadata, new_metadata): - self.assertEqual(metadata.query((metadata_base.ALL_ELEMENTS,))['dimension']['length'], new_metadata.query((metadata_base.ALL_ELEMENTS,))['dimension']['length']) - - for i in range(new_metadata.query((metadata_base.ALL_ELEMENTS,))['dimension']['length']): - self.assertEqual(metadata.query((metadata_base.ALL_ELEMENTS, i)), new_metadata.query((metadata_base.ALL_ELEMENTS, i)), i) - - def test_iris_with_index_reorder(self): - dataframe, index, attributes, targets = self._get_iris_columns_with_index() - - # Let's make problems. - attributes = attributes.sort_values(by='sepalLength').reset_index(drop=True) - - hyperparams_class = horizontal_concat.HorizontalConcatPrimitive.metadata.get_hyperparams() - - primitive = horizontal_concat.HorizontalConcatPrimitive(hyperparams=hyperparams_class.defaults()) - - call_metadata = primitive.produce(left=index, right=attributes) - - call_metadata = primitive.produce(left=call_metadata.value, right=targets) - - new_dataframe = call_metadata.value - - self.assertEqual(dataframe.values.tolist(), new_dataframe.values.tolist()) - - self._test_iris_with_index_reorder_metadata(dataframe.metadata, new_dataframe.metadata) - - def _test_iris_with_index_reorder_metadata(self, metadata, new_metadata): - self.assertEqual(metadata.query((metadata_base.ALL_ELEMENTS,))['dimension']['length'], new_metadata.query((metadata_base.ALL_ELEMENTS,))['dimension']['length']) - - for i in range(new_metadata.query((metadata_base.ALL_ELEMENTS,))['dimension']['length']): - self.assertEqual(metadata.query((metadata_base.ALL_ELEMENTS, i)), new_metadata.query((metadata_base.ALL_ELEMENTS, i)), i) - - -if __name__ == '__main__': - unittest.main() diff --git a/common-primitives/tests/test_kfold_split.py b/common-primitives/tests/test_kfold_split.py deleted file mode 100644 index 9983a6e..0000000 --- a/common-primitives/tests/test_kfold_split.py +++ /dev/null @@ -1,100 +0,0 @@ -import os -import pickle -import unittest - -from d3m import container -from d3m.metadata import base as metadata_base - -from common_primitives import kfold_split - - -class KFoldDatasetSplitPrimitiveTestCase(unittest.TestCase): - def test_produce_train(self): - dataset_doc_path = os.path.abspath(os.path.join(os.path.dirname(__file__), '..', 'tests', 'data', 'datasets', 'database_dataset_1', 'datasetDoc.json')) - - dataset = container.Dataset.load('file://{dataset_doc_path}'.format(dataset_doc_path=dataset_doc_path)) - - # We set semantic types like runtime would. - dataset.metadata = dataset.metadata.add_semantic_type(('learningData', metadata_base.ALL_ELEMENTS, 4), 'https://metadata.datadrivendiscovery.org/types/Target') - dataset.metadata = dataset.metadata.add_semantic_type(('learningData', metadata_base.ALL_ELEMENTS, 4), 'https://metadata.datadrivendiscovery.org/types/TrueTarget') - dataset.metadata = dataset.metadata.remove_semantic_type(('learningData', metadata_base.ALL_ELEMENTS, 4), 'https://metadata.datadrivendiscovery.org/types/Attribute') - - hyperparams_class = kfold_split.KFoldDatasetSplitPrimitive.metadata.get_hyperparams() - - primitive = kfold_split.KFoldDatasetSplitPrimitive(hyperparams=hyperparams_class.defaults().replace({ - 'number_of_folds': 10, - 'shuffle': True, - 'delete_recursive': True, - })) - - primitive.set_training_data(dataset=dataset) - primitive.fit() - - # To test that pickling works. - pickle.dumps(primitive) - - results = primitive.produce(inputs=container.List([0, 1], generate_metadata=True)).value - - self.assertEqual(len(results), 2) - - for dataset in results: - self.assertEqual(len(dataset), 4) - - self.assertEqual(results[0]['codes'].shape[0], 3) - self.assertEqual(results[1]['codes'].shape[0], 3) - - self.assertEqual(set(results[0]['codes'].iloc[:, 0]), {'AAA', 'BBB', 'CCC'}) - self.assertEqual(len(results[0]['learningData'].iloc[:, 0]), 40) - self.assertEqual(set(results[0]['learningData'].iloc[:, 1]), {'AAA', 'BBB', 'CCC'}) - self.assertEqual(set(results[0]['learningData'].iloc[:, 2]), {'aaa', 'bbb', 'ccc', 'ddd', 'eee'}) - self.assertEqual(set(results[0]['learningData'].iloc[:, 3]), {'1990', '2000', '2010'}) - - self.assertEqual(set(results[1]['codes'].iloc[:, 0]), {'AAA', 'BBB', 'CCC'}) - self.assertEqual(len(results[1]['learningData'].iloc[:, 0]), 40) - self.assertEqual(set(results[1]['learningData'].iloc[:, 1]), {'AAA', 'BBB', 'CCC'}) - self.assertEqual(set(results[1]['learningData'].iloc[:, 2]), {'aaa', 'bbb', 'ccc', 'ddd', 'eee'}) - self.assertEqual(set(results[1]['learningData'].iloc[:, 3]), {'1990', '2000', '2010'}) - - def test_produce_score(self): - dataset_doc_path = os.path.abspath(os.path.join(os.path.dirname(__file__), '..', 'tests', 'data', 'datasets', 'database_dataset_1', 'datasetDoc.json')) - - dataset = container.Dataset.load('file://{dataset_doc_path}'.format(dataset_doc_path=dataset_doc_path)) - - # We set semantic types like runtime would. - dataset.metadata = dataset.metadata.add_semantic_type(('learningData', metadata_base.ALL_ELEMENTS, 4), 'https://metadata.datadrivendiscovery.org/types/Target') - dataset.metadata = dataset.metadata.add_semantic_type(('learningData', metadata_base.ALL_ELEMENTS, 4), 'https://metadata.datadrivendiscovery.org/types/TrueTarget') - dataset.metadata = dataset.metadata.remove_semantic_type(('learningData', metadata_base.ALL_ELEMENTS, 4), 'https://metadata.datadrivendiscovery.org/types/Attribute') - - hyperparams_class = kfold_split.KFoldDatasetSplitPrimitive.metadata.get_hyperparams() - - primitive = kfold_split.KFoldDatasetSplitPrimitive(hyperparams=hyperparams_class.defaults().replace({ - 'number_of_folds': 10, - 'shuffle': True, - 'delete_recursive': True, - })) - - primitive.set_training_data(dataset=dataset) - primitive.fit() - - results = primitive.produce_score_data(inputs=container.List([0, 1], generate_metadata=True)).value - - self.assertEqual(len(results), 2) - - for dataset in results: - self.assertEqual(len(dataset), 4) - - self.assertEqual(set(results[0]['codes'].iloc[:, 0]), {'AAA', 'BBB'}) - self.assertEqual(set(results[0]['learningData'].iloc[:, 0]), {'5', '11', '28', '31', '38'}) - self.assertEqual(set(results[0]['learningData'].iloc[:, 1]), {'AAA', 'BBB'}) - self.assertEqual(set(results[0]['learningData'].iloc[:, 2]), {'aaa', 'bbb', 'ddd', 'eee'}) - self.assertEqual(set(results[0]['learningData'].iloc[:, 3]), {'1990', '2000'}) - - self.assertEqual(set(results[1]['codes'].iloc[:, 0]), {'BBB', 'CCC'}) - self.assertEqual(set(results[1]['learningData'].iloc[:, 0]), {'12', '26', '29', '32', '39'}) - self.assertEqual(set(results[1]['learningData'].iloc[:, 1]), {'BBB', 'CCC'}) - self.assertEqual(set(results[1]['learningData'].iloc[:, 2]), {'bbb', 'ccc', 'ddd', 'eee'}) - self.assertEqual(set(results[1]['learningData'].iloc[:, 3]), {'1990', '2000', '2010'}) - - -if __name__ == '__main__': - unittest.main() diff --git a/common-primitives/tests/test_kfold_timeseries_split.py b/common-primitives/tests/test_kfold_timeseries_split.py deleted file mode 100644 index 885ab2e..0000000 --- a/common-primitives/tests/test_kfold_timeseries_split.py +++ /dev/null @@ -1,223 +0,0 @@ -import os -import pickle -import unittest - -from d3m import container -from d3m.metadata import base as metadata_base - -from common_primitives import kfold_split_timeseries - - -class KFoldTimeSeriesSplitPrimitiveTestCase(unittest.TestCase): - def test_produce_train_timeseries_1(self): - dataset_doc_path = os.path.abspath(os.path.join(os.path.dirname(__file__), '..', 'tests', 'data', 'datasets', 'timeseries_dataset_1', 'datasetDoc.json')) - - dataset = container.Dataset.load('file://{dataset_doc_path}'.format(dataset_doc_path=dataset_doc_path)) - - # We set semantic types like runtime would. - dataset.metadata = dataset.metadata.add_semantic_type(('learningData', metadata_base.ALL_ELEMENTS, 4), 'https://metadata.datadrivendiscovery.org/types/Target') - dataset.metadata = dataset.metadata.add_semantic_type(('learningData', metadata_base.ALL_ELEMENTS, 4), 'https://metadata.datadrivendiscovery.org/types/TrueTarget') - dataset.metadata = dataset.metadata.remove_semantic_type(('learningData', metadata_base.ALL_ELEMENTS, 4), 'https://metadata.datadrivendiscovery.org/types/Attribute') - - hyperparams_class = kfold_split_timeseries.KFoldTimeSeriesSplitPrimitive.metadata.get_hyperparams() - - folds = 5 - primitive = kfold_split_timeseries.KFoldTimeSeriesSplitPrimitive(hyperparams=hyperparams_class.defaults().replace({ - 'number_of_folds': folds, - 'number_of_window_folds': 1, - })) - - primitive.set_training_data(dataset=dataset) - primitive.fit() - - # To test that pickling works. - pickle.dumps(primitive) - - results = primitive.produce(inputs=container.List([0, 1], generate_metadata=True)).value - - self.assertEqual(len(results), 2) - - for dataset in results: - self.assertEqual(len(dataset), 1) - - self.assertEqual(len(results[0]['learningData'].iloc[:, 0]), 8) - self.assertEqual(set(results[0]['learningData'].iloc[:, 3]), {'2013-11-05', '2013-11-06', '2013-11-07', '2013-11-08', '2013-11-11', - '2013-11-12', '2013-11-13', '2013-11-14'}) - - self.assertEqual(len(results[1]['learningData'].iloc[:, 0]), 8) - self.assertEqual(set(results[1]['learningData'].iloc[:, 3]), {'2013-11-13', '2013-11-14', '2013-11-15', '2013-11-18', '2013-11-19', - '2013-11-20', '2013-11-21', '2013-11-22'}) - - def test_produce_score_timeseries_1(self): - dataset_doc_path = os.path.abspath(os.path.join(os.path.dirname(__file__), '..', 'tests', 'data', 'datasets', 'timeseries_dataset_1', 'datasetDoc.json')) - - dataset = container.Dataset.load('file://{dataset_doc_path}'.format(dataset_doc_path=dataset_doc_path)) - - # We set semantic types like runtime would. - dataset.metadata = dataset.metadata.add_semantic_type(('learningData', metadata_base.ALL_ELEMENTS, 4), 'https://metadata.datadrivendiscovery.org/types/Target') - dataset.metadata = dataset.metadata.add_semantic_type(('learningData', metadata_base.ALL_ELEMENTS, 4), 'https://metadata.datadrivendiscovery.org/types/TrueTarget') - dataset.metadata = dataset.metadata.remove_semantic_type(('learningData', metadata_base.ALL_ELEMENTS, 4), 'https://metadata.datadrivendiscovery.org/types/Attribute') - - hyperparams_class = kfold_split_timeseries.KFoldTimeSeriesSplitPrimitive.metadata.get_hyperparams() - - folds = 5 - primitive = kfold_split_timeseries.KFoldTimeSeriesSplitPrimitive(hyperparams=hyperparams_class.defaults().replace({ - 'number_of_folds': folds, - 'number_of_window_folds': 1, - })) - - primitive.set_training_data(dataset=dataset) - primitive.fit() - - results = primitive.produce_score_data(inputs=container.List([0, 1], generate_metadata=True)).value - - self.assertEqual(len(results), 2) - - for dataset in results: - self.assertEqual(len(dataset), 1) - - self.assertEqual(len(results[0]['learningData'].iloc[:, 0]), 6) - self.assertEqual(set(results[0]['learningData'].iloc[:, 3]), {'2013-11-15', '2013-11-18', '2013-11-19', - '2013-11-20', '2013-11-21', '2013-11-22'}) - - self.assertEqual(len(results[1]['learningData'].iloc[:, 0]), 6) - self.assertEqual(set(results[1]['learningData'].iloc[:, 3]), {'2013-11-25', '2013-11-26', '2013-11-27', - '2013-11-29', '2013-12-02', '2013-12-03'}) - - def test_produce_train(self): - dataset_doc_path = os.path.abspath(os.path.join(os.path.dirname(__file__), '..', 'tests', 'data', 'datasets', 'database_dataset_1', 'datasetDoc.json')) - - dataset = container.Dataset.load('file://{dataset_doc_path}'.format(dataset_doc_path=dataset_doc_path)) - - # We set semantic types like runtime would. - dataset.metadata = dataset.metadata.add_semantic_type(('learningData', metadata_base.ALL_ELEMENTS, 4), 'https://metadata.datadrivendiscovery.org/types/Target') - dataset.metadata = dataset.metadata.add_semantic_type(('learningData', metadata_base.ALL_ELEMENTS, 4), 'https://metadata.datadrivendiscovery.org/types/TrueTarget') - dataset.metadata = dataset.metadata.remove_semantic_type(('learningData', metadata_base.ALL_ELEMENTS, 4), 'https://metadata.datadrivendiscovery.org/types/Attribute') - - # We fake that the dataset is time-series. - dataset.metadata = dataset.metadata.add_semantic_type(('learningData', metadata_base.ALL_ELEMENTS, 3), 'https://metadata.datadrivendiscovery.org/types/Time') - - hyperparams_class = kfold_split_timeseries.KFoldTimeSeriesSplitPrimitive.metadata.get_hyperparams() - - folds = 5 - primitive = kfold_split_timeseries.KFoldTimeSeriesSplitPrimitive(hyperparams=hyperparams_class.defaults().replace({ - 'number_of_folds': folds, - 'number_of_window_folds': 1, - })) - - primitive.set_training_data(dataset=dataset) - primitive.fit() - - # To test that pickling works. - pickle.dumps(primitive) - - results = primitive.produce(inputs=container.List([0, 1], generate_metadata=True)).value - - self.assertEqual(len(results), 2) - - for dataset in results: - self.assertEqual(len(dataset), 4) - - self.assertEqual(results[0]['codes'].shape[0], 3) - self.assertEqual(results[1]['codes'].shape[0], 3) - - self.assertEqual(set(results[0]['codes'].iloc[:, 0]), {'AAA', 'BBB', 'CCC'}) - self.assertEqual(len(results[0]['learningData'].iloc[:, 0]), 9) - self.assertEqual(set(results[0]['learningData'].iloc[:, 1]), {'AAA', 'BBB', 'CCC'}) - self.assertEqual(set(results[0]['learningData'].iloc[:, 2]), {'bbb', 'ccc', 'ddd'}) - self.assertEqual(set(results[0]['learningData'].iloc[:, 3]), {'1990'}) - - self.assertEqual(set(results[1]['codes'].iloc[:, 0]), {'AAA', 'BBB', 'CCC'}) - self.assertEqual(len(results[1]['learningData'].iloc[:, 0]), 9) - self.assertEqual(set(results[1]['learningData'].iloc[:, 1]), {'AAA', 'BBB', 'CCC'}) - self.assertEqual(set(results[1]['learningData'].iloc[:, 2]), {'aaa', 'bbb', 'ddd', 'eee'}) - self.assertEqual(set(results[1]['learningData'].iloc[:, 3]), {'1990', '2000'}) - - def test_produce_score(self): - dataset_doc_path = os.path.abspath(os.path.join(os.path.dirname(__file__), '..', 'tests', 'data', 'datasets', 'database_dataset_1', 'datasetDoc.json')) - - dataset = container.Dataset.load('file://{dataset_doc_path}'.format(dataset_doc_path=dataset_doc_path)) - - # We set semantic types like runtime would. - dataset.metadata = dataset.metadata.add_semantic_type(('learningData', metadata_base.ALL_ELEMENTS, 4), 'https://metadata.datadrivendiscovery.org/types/Target') - dataset.metadata = dataset.metadata.add_semantic_type(('learningData', metadata_base.ALL_ELEMENTS, 4), 'https://metadata.datadrivendiscovery.org/types/TrueTarget') - dataset.metadata = dataset.metadata.remove_semantic_type(('learningData', metadata_base.ALL_ELEMENTS, 4), 'https://metadata.datadrivendiscovery.org/types/Attribute') - - # We fake that the dataset is time-series. - dataset.metadata = dataset.metadata.add_semantic_type(('learningData', metadata_base.ALL_ELEMENTS, 3), 'https://metadata.datadrivendiscovery.org/types/Time') - - hyperparams_class = kfold_split_timeseries.KFoldTimeSeriesSplitPrimitive.metadata.get_hyperparams() - - folds = 5 - primitive = kfold_split_timeseries.KFoldTimeSeriesSplitPrimitive(hyperparams=hyperparams_class.defaults().replace({ - 'number_of_folds': folds, - 'number_of_window_folds': 1, - })) - - primitive.set_training_data(dataset=dataset) - primitive.fit() - - results = primitive.produce_score_data(inputs=container.List([0, 1], generate_metadata=True)).value - - self.assertEqual(len(results), 2) - - for dataset in results: - self.assertEqual(len(dataset), 4) - - self.assertEqual(results[0]['codes'].shape[0], 3) - self.assertEqual(results[1]['codes'].shape[0], 3) - - self.assertEqual(set(results[0]['codes'].iloc[:, 0]), {'AAA', 'BBB', 'CCC'}) - self.assertEqual(set(results[0]['learningData'].iloc[:, 0]), {'2', '3', '32', '33', '37', '38', '39'}) - self.assertEqual(set(results[0]['learningData'].iloc[:, 1]), {'AAA', 'BBB', 'CCC'}) - self.assertEqual(set(results[0]['learningData'].iloc[:, 2]), {'aaa', 'ddd', 'eee'}) - self.assertEqual(set(results[0]['learningData'].iloc[:, 3]), {'1990', '2000'}) - - self.assertEqual(set(results[1]['codes'].iloc[:, 0]), {'AAA', 'BBB', 'CCC'}) - self.assertEqual(set(results[1]['learningData'].iloc[:, 0]), {'22', '23', '24', '31', '40', '41', '42'}) - self.assertEqual(set(results[1]['learningData'].iloc[:, 1]), {'AAA', 'BBB', 'CCC'}) - self.assertEqual(set(results[1]['learningData'].iloc[:, 2]), {'ccc', 'ddd', 'eee'}) - self.assertEqual(set(results[1]['learningData'].iloc[:, 3]), {'2000'}) - - def test_unsorted_datetimes_timeseries_4(self): - dataset_doc_path = os.path.abspath(os.path.join(os.path.dirname(__file__), '..', 'tests', 'data', 'datasets', 'timeseries_dataset_4', 'datasetDoc.json')) - - dataset = container.Dataset.load('file://{dataset_doc_path}'.format(dataset_doc_path=dataset_doc_path)) - - # We set semantic types like runtime would. - dataset.metadata = dataset.metadata.add_semantic_type(('learningData', metadata_base.ALL_ELEMENTS, 4), 'https://metadata.datadrivendiscovery.org/types/Target') - dataset.metadata = dataset.metadata.add_semantic_type(('learningData', metadata_base.ALL_ELEMENTS, 4), 'https://metadata.datadrivendiscovery.org/types/TrueTarget') - dataset.metadata = dataset.metadata.remove_semantic_type(('learningData', metadata_base.ALL_ELEMENTS, 4), 'https://metadata.datadrivendiscovery.org/types/Attribute') - - hyperparams_class = kfold_split_timeseries.KFoldTimeSeriesSplitPrimitive.metadata.get_hyperparams() - - folds = 5 - primitive = kfold_split_timeseries.KFoldTimeSeriesSplitPrimitive(hyperparams=hyperparams_class.defaults().replace({ - 'number_of_folds': folds, - 'number_of_window_folds': 1, - })) - - primitive.set_training_data(dataset=dataset) - primitive.fit() - - # To test that pickling works. - pickle.dumps(primitive) - - results = primitive.produce(inputs=container.List([0, 1], generate_metadata=True)).value - - self.assertEqual(len(results), 2) - - for dataset in results: - self.assertEqual(len(dataset), 1) - - self.assertEqual(len(results[0]['learningData'].iloc[:, 0]), 8) - self.assertEqual(set(results[0]['learningData'].iloc[:, 3]), {'2013-11-05', '2013-11-06', '2013-11-07', '2013-11-08', '2013-11-11', - '2013-11-12', '2013-11-13', '2013-11-14'}) - - self.assertEqual(len(results[1]['learningData'].iloc[:, 0]), 8) - self.assertEqual(set(results[1]['learningData'].iloc[:, 3]), {'2013-11-13', '2013-11-14', '2013-11-15', '2013-11-18', '2013-11-19', - '2013-11-20', '2013-11-21', '2013-11-22'}) - - -if __name__ == '__main__': - unittest.main() diff --git a/common-primitives/tests/test_lgbm_classifier.py b/common-primitives/tests/test_lgbm_classifier.py deleted file mode 100644 index 90d7d43..0000000 --- a/common-primitives/tests/test_lgbm_classifier.py +++ /dev/null @@ -1,571 +0,0 @@ -import os -import pickle -import unittest - -import numpy as np - -from d3m import container, utils -from d3m.metadata import base as metadata_base - -from common_primitives import dataset_to_dataframe, extract_columns_semantic_types, lgbm_classifier, column_parser - - -def _add_categorical_col(attributes): - rand_str = ['a', 'b', 'c', 'd', 'e'] - attributes = attributes.append_columns(container.DataFrame(data={ - 'mock_cat_col': np.random.choice(rand_str, attributes.shape[0]) - }, generate_metadata=True)) - attributes.metadata = attributes.metadata.add_semantic_type([metadata_base.ALL_ELEMENTS, attributes.shape[-1] - 1], - 'https://metadata.datadrivendiscovery.org/types/CategoricalData') - attributes.metadata = attributes.metadata.add_semantic_type([metadata_base.ALL_ELEMENTS, attributes.shape[-1] - 1], - 'https://metadata.datadrivendiscovery.org/types/Attribute') - return attributes - - -def _get_iris(): - dataset_doc_path = os.path.abspath( - os.path.join(os.path.dirname(__file__), 'data', 'datasets', 'iris_dataset_1', 'datasetDoc.json')) - - dataset = container.Dataset.load('file://{dataset_doc_path}'.format(dataset_doc_path=dataset_doc_path)) - - hyperparams_class = \ - dataset_to_dataframe.DatasetToDataFramePrimitive.metadata.query()['primitive_code']['class_type_arguments'][ - 'Hyperparams'] - primitive = dataset_to_dataframe.DatasetToDataFramePrimitive(hyperparams=hyperparams_class.defaults()) - - dataframe = primitive.produce(inputs=dataset).value - return dataframe - - -def _get_iris_columns(): - dataframe = _get_iris() - - # We set custom metadata on columns. - for column_index in range(1, 5): - dataframe.metadata = dataframe.metadata.update_column(column_index, {'custom_metadata': 'attributes'}) - for column_index in range(5, 6): - dataframe.metadata = dataframe.metadata.update_column(column_index, {'custom_metadata': 'targets'}) - - # We set semantic types like runtime would. - dataframe.metadata = dataframe.metadata.add_semantic_type((metadata_base.ALL_ELEMENTS, 5), - 'https://metadata.datadrivendiscovery.org/types/Target') - dataframe.metadata = dataframe.metadata.add_semantic_type((metadata_base.ALL_ELEMENTS, 5), - 'https://metadata.datadrivendiscovery.org/types/TrueTarget') - dataframe.metadata = dataframe.metadata.remove_semantic_type((metadata_base.ALL_ELEMENTS, 5), - 'https://metadata.datadrivendiscovery.org/types/Attribute') - dataframe = _add_categorical_col(dataframe) - - # Parsing. - hyperparams_class = \ - column_parser.ColumnParserPrimitive.metadata.query()['primitive_code']['class_type_arguments'][ - 'Hyperparams'] - primitive = column_parser.ColumnParserPrimitive(hyperparams=hyperparams_class.defaults()) - dataframe = primitive.produce(inputs=dataframe).value - - hyperparams_class = \ - extract_columns_semantic_types.ExtractColumnsBySemanticTypesPrimitive.metadata.query()['primitive_code'][ - 'class_type_arguments']['Hyperparams'] - - primitive = extract_columns_semantic_types.ExtractColumnsBySemanticTypesPrimitive( - hyperparams=hyperparams_class.defaults().replace( - {'semantic_types': ('https://metadata.datadrivendiscovery.org/types/Attribute',)})) - attributes = primitive.produce(inputs=dataframe).value - - primitive = extract_columns_semantic_types.ExtractColumnsBySemanticTypesPrimitive( - hyperparams=hyperparams_class.defaults().replace( - {'semantic_types': ('https://metadata.datadrivendiscovery.org/types/SuggestedTarget',)})) - targets = primitive.produce(inputs=dataframe).value - - return dataframe, attributes, targets - - -class LGBMTestCase(unittest.TestCase): - attributes: container.DataFrame = None - targets: container.DataFrame = None - dataframe: container.DataFrame = None - - @classmethod - def setUpClass(cls) -> None: - cls.dataframe, cls.attributes, cls.targets = _get_iris_columns() - cls.excp_attributes = cls.attributes.copy() - - def test_single_target(self): - self.assertEqual(list(self.targets.columns), ['species']) - - hyperparams_class = \ - lgbm_classifier.LightGBMClassifierPrimitive.metadata.query()['primitive_code']['class_type_arguments'][ - 'Hyperparams'] - primitive = lgbm_classifier.LightGBMClassifierPrimitive( - hyperparams=hyperparams_class.defaults().replace({'return_result': 'new', 'add_index_columns': False})) - - primitive.set_training_data(inputs=self.attributes, outputs=self.targets) - primitive.fit() - - predictions = primitive.produce(inputs=self.attributes).value - - self.assertEqual(list(predictions.columns), ['species']) - - self.assertEqual(predictions.shape, (150, 1)) - self.assertEqual(predictions.iloc[0, 0], 'Iris-setosa') - self.assertTrue(predictions.metadata.has_semantic_type((metadata_base.ALL_ELEMENTS, 0), - 'https://metadata.datadrivendiscovery.org/types/PredictedTarget')) - self.assertFalse(predictions.metadata.has_semantic_type((metadata_base.ALL_ELEMENTS, 0), - 'https://metadata.datadrivendiscovery.org/types/TrueTarget')) - self.assertEqual(predictions.metadata.query_column(0)['name'], 'species') - self.assertEqual(predictions.metadata.query_column(0)['custom_metadata'], 'targets') - - self._test_single_target_metadata(predictions.metadata) - - samples = primitive.sample(inputs=self.attributes).value - - self.assertEqual(list(samples[0].columns), ['species']) - - self.assertEqual(len(samples), 1) - self.assertEqual(samples[0].shape, (150, 1)) - self.assertEqual(samples[0].iloc[0, 0], 'Iris-setosa') - self.assertTrue(samples[0].metadata.has_semantic_type((metadata_base.ALL_ELEMENTS, 0), - 'https://metadata.datadrivendiscovery.org/types/PredictedTarget')) - self.assertFalse(samples[0].metadata.has_semantic_type((metadata_base.ALL_ELEMENTS, 0), - 'https://metadata.datadrivendiscovery.org/types/TrueTarget')) - self.assertEqual(samples[0].metadata.query_column(0)['name'], 'species') - self.assertEqual(samples[0].metadata.query_column(0)['custom_metadata'], 'targets') - - log_likelihoods = primitive.log_likelihoods(inputs=self.attributes, outputs=self.targets).value - - self.assertEqual(list(log_likelihoods.columns), ['species']) - - self.assertEqual(log_likelihoods.shape, (150, 1)) - self.assertEqual(log_likelihoods.metadata.query_column(0)['name'], 'species') - - log_likelihood = primitive.log_likelihood(inputs=self.attributes, outputs=self.targets).value - - self.assertEqual(list(log_likelihood.columns), ['species']) - - self.assertEqual(log_likelihood.shape, (1, 1)) - self.assertAlmostEqual(log_likelihood.iloc[0, 0], -6.338635478886032) - self.assertEqual(log_likelihood.metadata.query_column(0)['name'], 'species') - - def test_single_target_continue_fit(self): - hyperparams_class = \ - lgbm_classifier.LightGBMClassifierPrimitive.metadata.query()['primitive_code']['class_type_arguments'][ - 'Hyperparams'] - primitive = lgbm_classifier.LightGBMClassifierPrimitive( - hyperparams=hyperparams_class.defaults().replace({'return_result': 'new', 'add_index_columns': False})) - - primitive.set_training_data(inputs=self.attributes, outputs=self.targets) - primitive.fit() - # reset the training data to make continue_fit() work. - primitive.set_training_data(inputs=self.attributes, outputs=self.targets) - primitive.continue_fit() - params = primitive.get_params() - self.assertEqual(params['booster'].current_iteration(), - primitive.hyperparams['n_estimators'] + primitive.hyperparams['n_more_estimators']) - predictions = primitive.produce(inputs=self.attributes).value - - self.assertEqual(predictions.shape, (150, 1)) - self.assertEqual(predictions.iloc[0, 0], 'Iris-setosa') - self.assertTrue(predictions.metadata.has_semantic_type((metadata_base.ALL_ELEMENTS, 0), - 'https://metadata.datadrivendiscovery.org/types/PredictedTarget')) - self.assertFalse(predictions.metadata.has_semantic_type((metadata_base.ALL_ELEMENTS, 0), - 'https://metadata.datadrivendiscovery.org/types/TrueTarget')) - self.assertEqual(predictions.metadata.query_column(0)['name'], 'species') - self.assertEqual(predictions.metadata.query_column(0)['custom_metadata'], 'targets') - - self._test_single_target_metadata(predictions.metadata) - - samples = primitive.sample(inputs=self.attributes).value - - self.assertEqual(len(samples), 1) - self.assertEqual(samples[0].shape, (150, 1)) - self.assertEqual(samples[0].iloc[0, 0], 'Iris-setosa') - self.assertTrue(samples[0].metadata.has_semantic_type((metadata_base.ALL_ELEMENTS, 0), - 'https://metadata.datadrivendiscovery.org/types/PredictedTarget')) - self.assertFalse(samples[0].metadata.has_semantic_type((metadata_base.ALL_ELEMENTS, 0), - 'https://metadata.datadrivendiscovery.org/types/TrueTarget')) - self.assertEqual(samples[0].metadata.query_column(0)['name'], 'species') - self.assertEqual(samples[0].metadata.query_column(0)['custom_metadata'], 'targets') - - log_likelihoods = primitive.log_likelihoods(inputs=self.attributes, outputs=self.targets).value - - self.assertEqual(log_likelihoods.shape, (150, 1)) - self.assertEqual(log_likelihoods.metadata.query_column(0)['name'], 'species') - - log_likelihood = primitive.log_likelihood(inputs=self.attributes, outputs=self.targets).value - - self.assertEqual(log_likelihood.shape, (1, 1)) - self.assertAlmostEqual(log_likelihood.iloc[0, 0], -3.723258225143776) - self.assertEqual(log_likelihoods.metadata.query_column(0)['name'], 'species') - - def _test_single_target_metadata(self, predictions_metadata): - expected_metadata = [{ - 'selector': [], - 'metadata': { - 'schema': metadata_base.CONTAINER_SCHEMA_VERSION, - 'structural_type': 'd3m.container.pandas.DataFrame', - 'semantic_types': ['https://metadata.datadrivendiscovery.org/types/Table'], - 'dimension': { - 'name': 'rows', - 'semantic_types': ['https://metadata.datadrivendiscovery.org/types/TabularRow'], - 'length': 150, - }, - }, - }, { - 'selector': ['__ALL_ELEMENTS__'], - 'metadata': { - 'dimension': { - 'name': 'columns', - 'semantic_types': ['https://metadata.datadrivendiscovery.org/types/TabularColumn'], - 'length': 1, - }, - }, - }, { - 'selector': ['__ALL_ELEMENTS__', 0], - 'metadata': { - 'structural_type': 'str', - 'name': 'species', - 'semantic_types': ['https://metadata.datadrivendiscovery.org/types/CategoricalData', - 'https://metadata.datadrivendiscovery.org/types/SuggestedTarget', - 'https://metadata.datadrivendiscovery.org/types/Target', - 'https://metadata.datadrivendiscovery.org/types/PredictedTarget'], - 'custom_metadata': 'targets', - }, - }] - - self.assertEqual(utils.to_json_structure(predictions_metadata.to_internal_simple_structure()), expected_metadata) - - def test_semantic_types(self): - # dataframe, attributes, targets = self._get_iris_columns() - - hyperparams_class = \ - lgbm_classifier.LightGBMClassifierPrimitive.metadata.query()['primitive_code']['class_type_arguments'][ - 'Hyperparams'] - primitive = lgbm_classifier.LightGBMClassifierPrimitive( - hyperparams=hyperparams_class.defaults().replace({'return_result': 'new', 'add_index_columns': False})) - - primitive.set_training_data(inputs=self.dataframe, outputs=self.dataframe) - primitive.fit() - - predictions = primitive.produce(inputs=self.dataframe).value - - self.assertEqual(list(predictions.columns), ['species']) - - self.assertEqual(predictions.shape, (150, 1)) - self.assertEqual(predictions.iloc[0, 0], 'Iris-setosa') - self.assertTrue(predictions.metadata.has_semantic_type((metadata_base.ALL_ELEMENTS, 0), - 'https://metadata.datadrivendiscovery.org/types/PredictedTarget')) - self.assertFalse(predictions.metadata.has_semantic_type((metadata_base.ALL_ELEMENTS, 0), - 'https://metadata.datadrivendiscovery.org/types/TrueTarget')) - self.assertEqual(predictions.metadata.query_column(0)['name'], 'species') - self.assertEqual(predictions.metadata.query_column(0)['custom_metadata'], 'targets') - - samples = primitive.sample(inputs=self.dataframe).value - self.assertEqual(list(samples[0].columns), ['species']) - - self.assertEqual(len(samples), 1) - self.assertEqual(samples[0].shape, (150, 1)) - self.assertEqual(samples[0].iloc[0, 0], 'Iris-setosa') - self.assertTrue(samples[0].metadata.has_semantic_type((metadata_base.ALL_ELEMENTS, 0), - 'https://metadata.datadrivendiscovery.org/types/PredictedTarget')) - self.assertFalse(samples[0].metadata.has_semantic_type((metadata_base.ALL_ELEMENTS, 0), - 'https://metadata.datadrivendiscovery.org/types/TrueTarget')) - self.assertEqual(samples[0].metadata.query_column(0)['name'], 'species') - self.assertEqual(samples[0].metadata.query_column(0)['custom_metadata'], 'targets') - - log_likelihoods = primitive.log_likelihoods(inputs=self.dataframe, outputs=self.dataframe).value - self.assertEqual(list(log_likelihoods.columns), ['species']) - - self.assertEqual(log_likelihoods.shape, (150, 1)) - self.assertEqual(log_likelihoods.metadata.query_column(0)['name'], 'species') - - log_likelihood = primitive.log_likelihood(inputs=self.dataframe, outputs=self.dataframe).value - self.assertEqual(list(log_likelihood.columns), ['species']) - - self.assertEqual(log_likelihood.shape, (1, 1)) - self.assertAlmostEqual(log_likelihood.iloc[0, 0], -6.338635478886032) - self.assertEqual(log_likelihoods.metadata.query_column(0)['name'], 'species') - - feature_importances = primitive.produce_feature_importances().value - self.assertEqual(list(feature_importances), - ['sepalLength', 'sepalWidth', 'petalLength', 'petalWidth', 'mock_cat_col']) - self.assertEqual(feature_importances.metadata.query_column(0)['name'], 'sepalLength') - self.assertEqual(feature_importances.metadata.query_column(1)['name'], 'sepalWidth') - self.assertEqual(feature_importances.metadata.query_column(2)['name'], 'petalLength') - self.assertEqual(feature_importances.metadata.query_column(3)['name'], 'petalWidth') - - self.assertEqual(feature_importances.values.tolist(), - [[0.22740524781341107, 0.18513119533527697, 0.3323615160349854, 0.25510204081632654, 0.0]]) - - def test_return_append(self): - hyperparams_class = \ - lgbm_classifier.LightGBMClassifierPrimitive.metadata.query()['primitive_code']['class_type_arguments'][ - 'Hyperparams'] - primitive = lgbm_classifier.LightGBMClassifierPrimitive(hyperparams=hyperparams_class.defaults()) - - primitive.set_training_data(inputs=self.dataframe, outputs=self.dataframe) - primitive.fit() - - predictions = primitive.produce(inputs=self.dataframe).value - self.assertEqual(list(predictions.columns), [ - 'd3mIndex', - 'sepalLength', - 'sepalWidth', - 'petalLength', - 'petalWidth', - 'species', - 'mock_cat_col', - 'species', - ]) - self.assertEqual(predictions.shape, (150, 8)) - self.assertEqual(predictions.iloc[0, 7], 'Iris-setosa') - self.assertTrue(predictions.metadata.has_semantic_type((metadata_base.ALL_ELEMENTS, 7), - 'https://metadata.datadrivendiscovery.org/types/PredictedTarget')) - self.assertFalse(predictions.metadata.has_semantic_type((metadata_base.ALL_ELEMENTS, 7), - 'https://metadata.datadrivendiscovery.org/types/TrueTarget')) - self.assertEqual(predictions.metadata.query_column(7)['name'], 'species') - self.assertEqual(predictions.metadata.query_column(7)['custom_metadata'], 'targets') - - self._test_return_append_metadata(predictions.metadata) - - def _test_return_append_metadata(self, predictions_metadata): - self.assertEqual(utils.to_json_structure(predictions_metadata.to_internal_simple_structure()), [{ - 'metadata': {'dimension': {'length': 150, - 'name': 'rows', - 'semantic_types': [ - 'https://metadata.datadrivendiscovery.org/types/TabularRow']}, - 'schema': 'https://metadata.datadrivendiscovery.org/schemas/v0/container.json', - 'semantic_types': ['https://metadata.datadrivendiscovery.org/types/Table'], - 'structural_type': 'd3m.container.pandas.DataFrame'}, - 'selector': []}, - {'metadata': {'dimension': {'length': 8, - 'name': 'columns', - 'semantic_types': [ - 'https://metadata.datadrivendiscovery.org/types/TabularColumn']}}, - 'selector': ['__ALL_ELEMENTS__']}, - {'metadata': {'name': 'd3mIndex', - 'semantic_types': ['http://schema.org/Integer', - 'https://metadata.datadrivendiscovery.org/types/PrimaryKey'], - 'structural_type': 'int'}, - 'selector': ['__ALL_ELEMENTS__', 0]}, - {'metadata': {'custom_metadata': 'attributes', - 'name': 'sepalLength', - 'semantic_types': ['http://schema.org/Float', - 'https://metadata.datadrivendiscovery.org/types/Attribute'], - 'structural_type': 'float'}, - 'selector': ['__ALL_ELEMENTS__', 1]}, - {'metadata': {'custom_metadata': 'attributes', - 'name': 'sepalWidth', - 'semantic_types': ['http://schema.org/Float', - 'https://metadata.datadrivendiscovery.org/types/Attribute'], - 'structural_type': 'float'}, - 'selector': ['__ALL_ELEMENTS__', 2]}, - {'metadata': {'custom_metadata': 'attributes', - 'name': 'petalLength', - 'semantic_types': ['http://schema.org/Float', - 'https://metadata.datadrivendiscovery.org/types/Attribute'], - 'structural_type': 'float'}, - 'selector': ['__ALL_ELEMENTS__', 3]}, - {'metadata': {'custom_metadata': 'attributes', - 'name': 'petalWidth', - 'semantic_types': ['http://schema.org/Float', - 'https://metadata.datadrivendiscovery.org/types/Attribute'], - 'structural_type': 'float'}, - 'selector': ['__ALL_ELEMENTS__', 4]}, - {'metadata': {'custom_metadata': 'targets', - 'name': 'species', - 'semantic_types': [ - 'https://metadata.datadrivendiscovery.org/types/CategoricalData', - 'https://metadata.datadrivendiscovery.org/types/SuggestedTarget', - 'https://metadata.datadrivendiscovery.org/types/Target', - 'https://metadata.datadrivendiscovery.org/types/TrueTarget'], - 'structural_type': 'str'}, - 'selector': ['__ALL_ELEMENTS__', 5]}, - {'metadata': {'name': 'mock_cat_col', - 'semantic_types': [ - 'https://metadata.datadrivendiscovery.org/types/CategoricalData', - 'https://metadata.datadrivendiscovery.org/types/Attribute'], - 'structural_type': 'int'}, - 'selector': ['__ALL_ELEMENTS__', 6]}, - {'metadata': {'custom_metadata': 'targets', - 'name': 'species', - 'semantic_types': [ - 'https://metadata.datadrivendiscovery.org/types/CategoricalData', - 'https://metadata.datadrivendiscovery.org/types/SuggestedTarget', - 'https://metadata.datadrivendiscovery.org/types/Target', - 'https://metadata.datadrivendiscovery.org/types/PredictedTarget'], - 'structural_type': 'str'}, - 'selector': ['__ALL_ELEMENTS__', 7]}] - ) - - def test_return_new(self): - hyperparams_class = \ - lgbm_classifier.LightGBMClassifierPrimitive.metadata.query()['primitive_code']['class_type_arguments'][ - 'Hyperparams'] - primitive = lgbm_classifier.LightGBMClassifierPrimitive( - hyperparams=hyperparams_class.defaults().replace({'return_result': 'new'})) - - primitive.set_training_data(inputs=self.dataframe, outputs=self.dataframe) - primitive.fit() - - predictions = primitive.produce(inputs=self.dataframe).value - - self.assertEqual(list(predictions.columns), [ - 'd3mIndex', - 'species', - ]) - - self.assertEqual(predictions.shape, (150, 2)) - self.assertEqual(predictions.iloc[0, 1], 'Iris-setosa') - self.assertTrue(predictions.metadata.has_semantic_type((metadata_base.ALL_ELEMENTS, 1), - 'https://metadata.datadrivendiscovery.org/types/PredictedTarget')) - self.assertFalse(predictions.metadata.has_semantic_type((metadata_base.ALL_ELEMENTS, 1), - 'https://metadata.datadrivendiscovery.org/types/TrueTarget')) - self.assertEqual(predictions.metadata.query_column(1)['name'], 'species') - self.assertEqual(predictions.metadata.query_column(1)['custom_metadata'], 'targets') - - self._test_return_new_metadata(predictions.metadata) - - def _test_return_new_metadata(self, predictions_metadata): - expected_metadata = [{ - 'selector': [], - 'metadata': { - 'structural_type': 'd3m.container.pandas.DataFrame', - 'semantic_types': ['https://metadata.datadrivendiscovery.org/types/Table'], - 'dimension': { - 'name': 'rows', - 'semantic_types': ['https://metadata.datadrivendiscovery.org/types/TabularRow'], - 'length': 150, - }, - 'schema': 'https://metadata.datadrivendiscovery.org/schemas/v0/container.json', - }, - }, { - 'selector': ['__ALL_ELEMENTS__'], - 'metadata': { - 'dimension': { - 'name': 'columns', - 'semantic_types': ['https://metadata.datadrivendiscovery.org/types/TabularColumn'], - 'length': 2, - }, - }, - }, { - 'selector': ['__ALL_ELEMENTS__', 0], - 'metadata': { - 'name': 'd3mIndex', - 'structural_type': 'int', - 'semantic_types': ['http://schema.org/Integer', - 'https://metadata.datadrivendiscovery.org/types/PrimaryKey'], - }, - }, { - 'selector': ['__ALL_ELEMENTS__', 1], - 'metadata': { - 'structural_type': 'str', - 'name': 'species', - 'semantic_types': ['https://metadata.datadrivendiscovery.org/types/CategoricalData', - 'https://metadata.datadrivendiscovery.org/types/SuggestedTarget', - 'https://metadata.datadrivendiscovery.org/types/Target', - 'https://metadata.datadrivendiscovery.org/types/PredictedTarget'], - 'custom_metadata': 'targets', - }, - }] - - self.assertEqual(utils.to_json_structure(predictions_metadata.to_internal_simple_structure()), expected_metadata) - - def test_return_replace(self): - hyperparams_class = \ - lgbm_classifier.LightGBMClassifierPrimitive.metadata.query()['primitive_code']['class_type_arguments'][ - 'Hyperparams'] - primitive = lgbm_classifier.LightGBMClassifierPrimitive( - hyperparams=hyperparams_class.defaults().replace({'return_result': 'replace'})) - - primitive.set_training_data(inputs=self.dataframe, outputs=self.dataframe) - primitive.fit() - - predictions = primitive.produce(inputs=self.dataframe).value - self.assertEqual(list(predictions.columns), [ - 'd3mIndex', - 'species', - 'species', - ]) - self.assertEqual(predictions.shape, (150, 3)) - self.assertEqual(predictions.iloc[0, 1], 'Iris-setosa') - self.assertTrue(predictions.metadata.has_semantic_type((metadata_base.ALL_ELEMENTS, 1), - 'https://metadata.datadrivendiscovery.org/types/PredictedTarget')) - self.assertFalse(predictions.metadata.has_semantic_type((metadata_base.ALL_ELEMENTS, 1), - 'https://metadata.datadrivendiscovery.org/types/TrueTarget')) - self.assertEqual(predictions.metadata.query_column(1)['name'], 'species') - self.assertEqual(predictions.metadata.query_column(1)['custom_metadata'], 'targets') - - self._test_return_replace_metadata(predictions.metadata) - - def test_pickle_unpickle(self): - hyperparams_class = \ - lgbm_classifier.LightGBMClassifierPrimitive.metadata.query()['primitive_code']['class_type_arguments'][ - 'Hyperparams'] - primitive = lgbm_classifier.LightGBMClassifierPrimitive( - hyperparams=hyperparams_class.defaults().replace({'return_result': 'new', 'add_index_columns': False})) - - primitive.set_training_data(inputs=self.attributes, outputs=self.targets) - primitive.fit() - - before_pickled_prediction = primitive.produce(inputs=self.attributes).value - pickle_object = pickle.dumps(primitive) - primitive = pickle.loads(pickle_object) - after_unpickled_prediction = primitive.produce(inputs=self.attributes).value - self.assertTrue(container.DataFrame.equals(before_pickled_prediction, after_unpickled_prediction)) - - def _test_return_replace_metadata(self, predictions_metadata): - self.assertEqual(utils.to_json_structure(predictions_metadata.to_internal_simple_structure()), [{ - 'selector': [], - 'metadata': { - 'structural_type': 'd3m.container.pandas.DataFrame', - 'semantic_types': ['https://metadata.datadrivendiscovery.org/types/Table'], - 'dimension': { - 'name': 'rows', - 'semantic_types': ['https://metadata.datadrivendiscovery.org/types/TabularRow'], - 'length': 150, - }, - 'schema': 'https://metadata.datadrivendiscovery.org/schemas/v0/container.json', - }, - }, { - 'selector': ['__ALL_ELEMENTS__'], - 'metadata': { - 'dimension': { - 'name': 'columns', - 'semantic_types': ['https://metadata.datadrivendiscovery.org/types/TabularColumn'], - 'length': 3, - }, - }, - }, { - 'selector': ['__ALL_ELEMENTS__', 0], - 'metadata': { - 'name': 'd3mIndex', - 'structural_type': 'int', - 'semantic_types': ['http://schema.org/Integer', - 'https://metadata.datadrivendiscovery.org/types/PrimaryKey'], - }, - }, { - 'selector': ['__ALL_ELEMENTS__', 1], - 'metadata': { - 'structural_type': 'str', - 'name': 'species', - 'semantic_types': ['https://metadata.datadrivendiscovery.org/types/CategoricalData', - 'https://metadata.datadrivendiscovery.org/types/SuggestedTarget', - 'https://metadata.datadrivendiscovery.org/types/Target', - 'https://metadata.datadrivendiscovery.org/types/PredictedTarget'], - 'custom_metadata': 'targets', - }, - }, { - 'selector': ['__ALL_ELEMENTS__', 2], - 'metadata': { - 'name': 'species', - 'structural_type': 'str', - 'semantic_types': ['https://metadata.datadrivendiscovery.org/types/CategoricalData', - 'https://metadata.datadrivendiscovery.org/types/SuggestedTarget', - 'https://metadata.datadrivendiscovery.org/types/Target', - 'https://metadata.datadrivendiscovery.org/types/TrueTarget'], - 'custom_metadata': 'targets', - }, - }]) - - -if __name__ == '__main__': - unittest.main() diff --git a/common-primitives/tests/test_list_to_dataframe.py b/common-primitives/tests/test_list_to_dataframe.py deleted file mode 100644 index 0860981..0000000 --- a/common-primitives/tests/test_list_to_dataframe.py +++ /dev/null @@ -1,185 +0,0 @@ -import unittest - -import numpy - -from d3m import container, utils -from d3m.metadata import base as metadata_base - -from common_primitives import list_to_dataframe - - -class ListToDataFramePrimitiveTestCase(unittest.TestCase): - def test_basic(self): - data = container.List([container.List([1, 2, 3]), container.List([4, 5, 6])], generate_metadata=True) - - list_hyperparams_class = list_to_dataframe.ListToDataFramePrimitive.metadata.get_hyperparams() - list_primitive = list_to_dataframe.ListToDataFramePrimitive(hyperparams=list_hyperparams_class.defaults()) - dataframe = list_primitive.produce(inputs=data).value - - self._test_basic_metadata(dataframe.metadata, 'numpy.int64', True) - - def _test_basic_metadata(self, metadata, structural_type, add_individual_columns): - expected_metadata = [{ - 'selector': [], - 'metadata': { - 'schema': metadata_base.CONTAINER_SCHEMA_VERSION, - 'structural_type': 'd3m.container.pandas.DataFrame', - 'dimension': { - 'length': 2, - 'name': 'rows', - 'semantic_types': ['https://metadata.datadrivendiscovery.org/types/TabularRow'], - }, - 'semantic_types': ['https://metadata.datadrivendiscovery.org/types/Table'], - }, - }, { - 'selector': ['__ALL_ELEMENTS__'], - 'metadata': { - 'dimension': { - 'length': 3, - 'name': 'columns', - 'semantic_types': ['https://metadata.datadrivendiscovery.org/types/TabularColumn'], - }, - 'structural_type': '__NO_VALUE__', - }, - }, { - 'selector': ['__ALL_ELEMENTS__', '__ALL_ELEMENTS__'], - 'metadata': { - 'structural_type': 'int', - }, - }] - - if add_individual_columns: - expected_metadata.extend([{ - 'selector': ['__ALL_ELEMENTS__', 0], - 'metadata': { - 'structural_type': structural_type, - }, - }, { - 'selector': ['__ALL_ELEMENTS__', 1], - 'metadata': { - 'structural_type': structural_type, - }, - }, { - 'selector': ['__ALL_ELEMENTS__', 2], - 'metadata': { - 'structural_type': structural_type, - }, - }]) - - self.assertEqual(utils.to_json_structure(metadata.to_internal_simple_structure()), expected_metadata) - - def test_just_list(self): - data = container.List([1, 2, 3], generate_metadata=True) - - list_hyperparams_class = list_to_dataframe.ListToDataFramePrimitive.metadata.get_hyperparams() - list_primitive = list_to_dataframe.ListToDataFramePrimitive(hyperparams=list_hyperparams_class.defaults()) - dataframe = list_primitive.produce(inputs=data).value - - self._test_just_list_metadata(dataframe.metadata, 'numpy.int64', True) - - def _test_just_list_metadata(self, metadata, structural_type, use_individual_columns): - expected_metadata = [{ - 'selector': [], - 'metadata': { - 'schema': metadata_base.CONTAINER_SCHEMA_VERSION, - 'structural_type': 'd3m.container.pandas.DataFrame', - 'dimension': { - 'length': 3, - 'name': 'rows', - 'semantic_types': ['https://metadata.datadrivendiscovery.org/types/TabularRow'], - }, - 'semantic_types': ['https://metadata.datadrivendiscovery.org/types/Table'], - }, - }, { - 'selector': ['__ALL_ELEMENTS__'], - 'metadata': { - 'dimension': { - 'length': 1, - 'name': 'columns', - 'semantic_types': ['https://metadata.datadrivendiscovery.org/types/TabularColumn'], - }, - 'structural_type': '__NO_VALUE__', - }, - }, { - 'selector': ['__ALL_ELEMENTS__', '__ALL_ELEMENTS__'], - 'metadata': { - 'structural_type': structural_type, - }, - }] - - if use_individual_columns: - expected_metadata[-1]['selector'] = ['__ALL_ELEMENTS__', 0] - - self.assertEqual(utils.to_json_structure(metadata.to_internal_simple_structure()), expected_metadata) - - def test_list_ndarray(self): - data = container.List([container.ndarray(numpy.array([1, 2, 3])), container.ndarray(numpy.array([4, 5, 6]))], generate_metadata=True) - - list_hyperparams_class = list_to_dataframe.ListToDataFramePrimitive.metadata.get_hyperparams() - list_primitive = list_to_dataframe.ListToDataFramePrimitive(hyperparams=list_hyperparams_class.defaults()) - dataframe = list_primitive.produce(inputs=data).value - - self._test_list_ndarray_metadata(dataframe.metadata, True) - - def _test_list_ndarray_metadata(self, metadata, add_individual_columns): - expected_metadata = [{ - 'selector': [], - 'metadata': { - 'schema': metadata_base.CONTAINER_SCHEMA_VERSION, - 'structural_type': 'd3m.container.pandas.DataFrame', - 'dimension': { - 'length': 2, - 'name': 'rows', - 'semantic_types': ['https://metadata.datadrivendiscovery.org/types/TabularRow'], - }, - 'semantic_types': ['https://metadata.datadrivendiscovery.org/types/Table'], - }, - }, { - 'selector': ['__ALL_ELEMENTS__'], - 'metadata': { - 'dimension': { - 'length': 3, - 'name': 'columns', - 'semantic_types': ['https://metadata.datadrivendiscovery.org/types/TabularColumn'], - }, - 'structural_type': '__NO_VALUE__', - }, - }, { - 'selector': ['__ALL_ELEMENTS__', '__ALL_ELEMENTS__'], - 'metadata': { - 'structural_type': 'numpy.int64', - }, - }] - - if add_individual_columns: - expected_metadata.extend([{ - 'selector': ['__ALL_ELEMENTS__', 0], - 'metadata': { - 'structural_type': 'numpy.int64', - }, - }, { - 'selector': ['__ALL_ELEMENTS__', 1], - 'metadata': { - 'structural_type': 'numpy.int64', - }, - }, { - 'selector': ['__ALL_ELEMENTS__', 2], - 'metadata': { - 'structural_type': 'numpy.int64', - }, - }]) - - self.assertEqual(utils.to_json_structure(metadata.to_internal_simple_structure()), expected_metadata) - - def test_list_deeper_ndarray(self): - data = container.List([container.ndarray(numpy.array([[1, 2, 3], [11, 12, 13]])), container.ndarray(numpy.array([[4, 5, 6], [14, 15, 16]]))], generate_metadata=True) - - list_hyperparams_class = list_to_dataframe.ListToDataFramePrimitive.metadata.get_hyperparams() - list_primitive = list_to_dataframe.ListToDataFramePrimitive(hyperparams=list_hyperparams_class.defaults()) - - with self.assertRaisesRegex(ValueError, 'Must pass 2-d input'): - list_primitive.produce(inputs=data).value - - -if __name__ == '__main__': - unittest.main() diff --git a/common-primitives/tests/test_list_to_ndarray.py b/common-primitives/tests/test_list_to_ndarray.py deleted file mode 100644 index 07d6d23..0000000 --- a/common-primitives/tests/test_list_to_ndarray.py +++ /dev/null @@ -1,132 +0,0 @@ -import unittest - -import numpy - -from d3m import container, utils -from d3m.metadata import base as metadata_base - -from common_primitives import list_to_ndarray - - -class ListToNDRrrayPrimitiveTestCase(unittest.TestCase): - def test_basic(self): - data = container.List([container.List([1, 2, 3]), container.List([4, 5, 6])], generate_metadata=True) - - list_hyperparams_class = list_to_ndarray.ListToNDArrayPrimitive.metadata.get_hyperparams() - list_primitive = list_to_ndarray.ListToNDArrayPrimitive(hyperparams=list_hyperparams_class.defaults()) - array = list_primitive.produce(inputs=data).value - - self._test_basic_metadata(array.metadata, 'numpy.int64') - - def _test_basic_metadata(self, metadata, structural_type): - self.maxDiff = None - - self.assertEqual(utils.to_json_structure(metadata.to_internal_simple_structure()), [{ - 'selector': [], - 'metadata': { - 'schema': metadata_base.CONTAINER_SCHEMA_VERSION, - 'structural_type': 'd3m.container.numpy.ndarray', - 'dimension': { - 'length': 2, - 'name': 'rows', - 'semantic_types': ['https://metadata.datadrivendiscovery.org/types/TabularRow'], - }, - 'semantic_types': ['https://metadata.datadrivendiscovery.org/types/Table'], - }, - }, { - 'selector': ['__ALL_ELEMENTS__'], - 'metadata': { - 'dimension': { - 'length': 3, - 'name': 'columns', - 'semantic_types': ['https://metadata.datadrivendiscovery.org/types/TabularColumn'], - }, - 'structural_type': '__NO_VALUE__', - }, - }, { - 'selector': ['__ALL_ELEMENTS__', '__ALL_ELEMENTS__'], - 'metadata': { - 'structural_type': structural_type, - }, - }]) - - def test_just_list(self): - data = container.List([1, 2, 3], generate_metadata=True) - - list_hyperparams_class = list_to_ndarray.ListToNDArrayPrimitive.metadata.get_hyperparams() - list_primitive = list_to_ndarray.ListToNDArrayPrimitive(hyperparams=list_hyperparams_class.defaults()) - array = list_primitive.produce(inputs=data).value - - self._test_just_list_metadata(array.metadata, 'numpy.int64') - - def _test_just_list_metadata(self, metadata, structural_type): - self.assertEqual(utils.to_json_structure(metadata.to_internal_simple_structure()),[{ - 'selector': [], - 'metadata': { - 'schema': metadata_base.CONTAINER_SCHEMA_VERSION, - 'structural_type': 'd3m.container.numpy.ndarray', - 'dimension': { - 'length': 3, - }, - }, - }, { - 'selector': ['__ALL_ELEMENTS__'], - 'metadata': { - 'structural_type': structural_type, - }, - }]) - - def test_list_ndarray(self): - data = container.List([container.ndarray(numpy.array([[1, 2, 3], [11, 12, 13]])), container.ndarray(numpy.array([[4, 5, 6], [14, 15, 16]]))], generate_metadata=True) - - list_hyperparams_class = list_to_ndarray.ListToNDArrayPrimitive.metadata.get_hyperparams() - list_primitive = list_to_ndarray.ListToNDArrayPrimitive(hyperparams=list_hyperparams_class.defaults()) - array = list_primitive.produce(inputs=data).value - - self._test_list_ndarray_metadata(array.metadata) - - def _test_list_ndarray_metadata(self, metadata): - self.maxDiff = None - - self.assertEqual(utils.to_json_structure(metadata.to_internal_simple_structure()), [{ - 'selector': [], - 'metadata': { - 'schema': metadata_base.CONTAINER_SCHEMA_VERSION, - 'structural_type': 'd3m.container.numpy.ndarray', - 'semantic_types': ['https://metadata.datadrivendiscovery.org/types/Table'], - 'dimension': { - 'length': 2, - 'name': 'rows', - 'semantic_types': ['https://metadata.datadrivendiscovery.org/types/TabularRow'], - }, - }, - }, { - 'selector': ['__ALL_ELEMENTS__'], - 'metadata': { - 'semantic_types': '__NO_VALUE__', - 'dimension': { - 'length': 2, - 'name': 'columns', - 'semantic_types': ['https://metadata.datadrivendiscovery.org/types/TabularColumn'], - }, - 'structural_type': '__NO_VALUE__', - }, - }, { - 'selector': ['__ALL_ELEMENTS__', '__ALL_ELEMENTS__'], - 'metadata': { - 'dimension': { - 'length': 3, - 'semantic_types': '__NO_VALUE__', - 'name': '__NO_VALUE__', - }, - }, - }, { - 'selector': ['__ALL_ELEMENTS__', '__ALL_ELEMENTS__', '__ALL_ELEMENTS__'], - 'metadata': { - 'structural_type': 'numpy.int64', - }, - }]) - - -if __name__ == '__main__': - unittest.main() diff --git a/common-primitives/tests/test_ndarray_to_dataframe.py b/common-primitives/tests/test_ndarray_to_dataframe.py deleted file mode 100644 index d2987e2..0000000 --- a/common-primitives/tests/test_ndarray_to_dataframe.py +++ /dev/null @@ -1,99 +0,0 @@ -import unittest - -import numpy - -from d3m import container, utils -from d3m.metadata import base as metadata_base - -from common_primitives import dataframe_to_ndarray, dataset_to_dataframe, ndarray_to_dataframe - -import utils as test_utils - - -class NDArrayToDataFramePrimitiveTestCase(unittest.TestCase): - def test_basic(self): - # TODO: Find a less cumbersome way to get a numpy array loaded with a dataset - # load the iris dataset - dataset = test_utils.load_iris_metadata() - - # convert the dataset into a dataframe - dataset_hyperparams_class = dataset_to_dataframe.DatasetToDataFramePrimitive.metadata.get_hyperparams() - dataset_primitive = dataset_to_dataframe.DatasetToDataFramePrimitive(hyperparams=dataset_hyperparams_class.defaults()) - dataframe_dataset = dataset_primitive.produce(inputs=dataset).value - - # convert the dataframe into a numpy array - numpy_hyperparams_class = dataframe_to_ndarray.DataFrameToNDArrayPrimitive.metadata.get_hyperparams() - numpy_primitive = dataframe_to_ndarray.DataFrameToNDArrayPrimitive(hyperparams=numpy_hyperparams_class.defaults()) - numpy_array = numpy_primitive.produce(inputs=dataframe_dataset).value - - # convert the numpy array back into a dataframe - dataframe_hyperparams_class = ndarray_to_dataframe.NDArrayToDataFramePrimitive.metadata.get_hyperparams() - dataframe_primitive = ndarray_to_dataframe.NDArrayToDataFramePrimitive(hyperparams=dataframe_hyperparams_class.defaults()) - dataframe = dataframe_primitive.produce(inputs=numpy_array).value - - self.assertIsInstance(dataframe, container.DataFrame) - - # verify dimensions - self.assertEqual(len(dataframe), 150) - self.assertEqual(len(dataframe.iloc[0]), 6) - - # ensure column names added to dataframe - self.assertListEqual(list(dataframe.columns.values), ['d3mIndex', 'sepalLength', 'sepalWidth', 'petalLength', 'petalWidth', 'species']) - - # verify data type is unchanged - for row in dataframe: - for cell in row: - self.assertIsInstance(cell, str) - - # validate metadata - test_utils.test_iris_metadata(self, dataframe.metadata, 'd3m.container.pandas.DataFrame') - - def test_vector(self): - data = container.ndarray(numpy.array([1, 2, 3]), generate_metadata=True) - - dataframe_hyperparams_class = ndarray_to_dataframe.NDArrayToDataFramePrimitive.metadata.get_hyperparams() - dataframe_primitive = ndarray_to_dataframe.NDArrayToDataFramePrimitive(hyperparams=dataframe_hyperparams_class.defaults()) - dataframe = dataframe_primitive.produce(inputs=data).value - - self._test_vector_metadata(dataframe.metadata, True) - - def _test_vector_metadata(self, metadata, use_individual_columns): - self.maxDiff = None - - expected_metadata = [{ - 'selector': [], - 'metadata': { - 'schema': metadata_base.CONTAINER_SCHEMA_VERSION, - 'structural_type': 'd3m.container.pandas.DataFrame', - 'dimension': { - 'length': 3, - 'name': 'rows', - 'semantic_types': ['https://metadata.datadrivendiscovery.org/types/TabularRow'], - }, - 'semantic_types': ['https://metadata.datadrivendiscovery.org/types/Table'], - }, - }, { - 'selector': ['__ALL_ELEMENTS__'], - 'metadata': { - 'dimension': { - 'length': 1, - 'name': 'columns', - 'semantic_types': ['https://metadata.datadrivendiscovery.org/types/TabularColumn'], - }, - 'structural_type': '__NO_VALUE__', - }, - }, { - 'selector': ['__ALL_ELEMENTS__', '__ALL_ELEMENTS__'], - 'metadata': { - 'structural_type': 'numpy.int64', - }, - }] - - if use_individual_columns: - expected_metadata[-1]['selector'] = ['__ALL_ELEMENTS__', 0] - - self.assertEqual(utils.to_json_structure(metadata.to_internal_simple_structure()), expected_metadata) - - -if __name__ == '__main__': - unittest.main() diff --git a/common-primitives/tests/test_ndarray_to_list.py b/common-primitives/tests/test_ndarray_to_list.py deleted file mode 100644 index b2c6555..0000000 --- a/common-primitives/tests/test_ndarray_to_list.py +++ /dev/null @@ -1,116 +0,0 @@ -import unittest - -import numpy - -from d3m import container, utils -from d3m.metadata import base as metadata_base - -from common_primitives import dataframe_to_ndarray, dataset_to_dataframe, ndarray_to_list - -import utils as test_utils - - -class NDArrayToListPrimitiveTestCase(unittest.TestCase): - def test_basic(self): - # TODO: Find a less cumbersome way to get a numpy array loaded with a dataset - # load the iris dataset - dataset = test_utils.load_iris_metadata() - - # convert the dataset into a dataframe - dataset_hyperparams_class = dataset_to_dataframe.DatasetToDataFramePrimitive.metadata.get_hyperparams() - dataset_primitive = dataset_to_dataframe.DatasetToDataFramePrimitive(hyperparams=dataset_hyperparams_class.defaults()) - dataframe_dataset = dataset_primitive.produce(inputs=dataset).value - - # convert the dataframe into a numpy array - numpy_hyperparams_class = dataframe_to_ndarray.DataFrameToNDArrayPrimitive.metadata.get_hyperparams() - numpy_primitive = dataframe_to_ndarray.DataFrameToNDArrayPrimitive(hyperparams=numpy_hyperparams_class.defaults()) - numpy_array = numpy_primitive.produce(inputs=dataframe_dataset).value - - list_hyperparams_class = ndarray_to_list.NDArrayToListPrimitive.metadata.get_hyperparams() - list_primitive = ndarray_to_list.NDArrayToListPrimitive(hyperparams=list_hyperparams_class.defaults()) - list_value = list_primitive.produce(inputs=numpy_array).value - - self.assertIsInstance(list_value, container.List) - - # verify dimensions - self.assertEqual(len(list_value), 150) - self.assertEqual(len(list_value[0]), 6) - - # validate metadata - test_utils.test_iris_metadata(self, list_value.metadata, 'd3m.container.list.List', 'd3m.container.numpy.ndarray') - - def test_vector(self): - data = container.ndarray(numpy.array([1, 2, 3]), generate_metadata=True) - - list_hyperparams_class = ndarray_to_list.NDArrayToListPrimitive.metadata.get_hyperparams() - list_primitive = ndarray_to_list.NDArrayToListPrimitive(hyperparams=list_hyperparams_class.defaults()) - list_value = list_primitive.produce(inputs=data).value - - self._test_vector_metadata(list_value.metadata) - - def _test_vector_metadata(self, metadata): - self.assertEqual(utils.to_json_structure(metadata.to_internal_simple_structure()), [{ - 'selector': [], - 'metadata': { - 'schema': metadata_base.CONTAINER_SCHEMA_VERSION, - 'structural_type': 'd3m.container.list.List', - 'dimension': { - 'length': 3, - }, - }, - }, { - 'selector': ['__ALL_ELEMENTS__'], - 'metadata': { - 'structural_type': 'numpy.int64', - }, - }]) - - def test_deep_array(self): - data = container.ndarray(numpy.array(range(2 * 3 * 4)).reshape((2, 3, 4)), generate_metadata=True) - - list_hyperparams_class = ndarray_to_list.NDArrayToListPrimitive.metadata.get_hyperparams() - list_primitive = ndarray_to_list.NDArrayToListPrimitive(hyperparams=list_hyperparams_class.defaults()) - list_value = list_primitive.produce(inputs=data).value - - self._test_deep_vector_metadata(list_value.metadata) - - def _test_deep_vector_metadata(self, metadata): - self.assertEqual(utils.to_json_structure(metadata.to_internal_simple_structure()), [{ - 'selector': [], - 'metadata': { - 'schema': metadata_base.CONTAINER_SCHEMA_VERSION, - 'structural_type': 'd3m.container.list.List', - 'dimension': { - 'length': 2, - }, - }, - }, { - 'selector': ['__ALL_ELEMENTS__'], - 'metadata': { - 'dimension': { - 'length': 3, - 'name': 'rows', - 'semantic_types': ['https://metadata.datadrivendiscovery.org/types/TabularRow'], - }, - 'structural_type': 'd3m.container.numpy.ndarray', - 'semantic_types': ['https://metadata.datadrivendiscovery.org/types/Table'], - }, - }, { - 'selector': ['__ALL_ELEMENTS__', '__ALL_ELEMENTS__'], - 'metadata': { - 'dimension': { - 'length': 4, - 'name': 'columns', - 'semantic_types': ['https://metadata.datadrivendiscovery.org/types/TabularColumn'], - }, - }, - }, { - 'selector': ['__ALL_ELEMENTS__', '__ALL_ELEMENTS__', '__ALL_ELEMENTS__'], - 'metadata': { - 'structural_type': 'numpy.int64', - }, - }]) - - -if __name__ == '__main__': - unittest.main() diff --git a/common-primitives/tests/test_no_split.py b/common-primitives/tests/test_no_split.py deleted file mode 100644 index f61f476..0000000 --- a/common-primitives/tests/test_no_split.py +++ /dev/null @@ -1,71 +0,0 @@ -import os -import pickle -import unittest - -from d3m import container -from d3m.metadata import base as metadata_base - -from common_primitives import no_split - - -class NoSplitDatasetSplitPrimitiveTestCase(unittest.TestCase): - def test_produce_train(self): - dataset_doc_path = os.path.abspath(os.path.join(os.path.dirname(__file__), '..', 'tests', 'data', 'datasets', 'iris_dataset_1', 'datasetDoc.json')) - - dataset = container.Dataset.load('file://{dataset_doc_path}'.format(dataset_doc_path=dataset_doc_path)) - - # We set semantic types like runtime would. - dataset.metadata = dataset.metadata.add_semantic_type(('learningData', metadata_base.ALL_ELEMENTS, 5), 'https://metadata.datadrivendiscovery.org/types/Target') - dataset.metadata = dataset.metadata.add_semantic_type(('learningData', metadata_base.ALL_ELEMENTS, 5), 'https://metadata.datadrivendiscovery.org/types/TrueTarget') - dataset.metadata = dataset.metadata.remove_semantic_type(('learningData', metadata_base.ALL_ELEMENTS, 5), 'https://metadata.datadrivendiscovery.org/types/Attribute') - - hyperparams_class = no_split.NoSplitDatasetSplitPrimitive.metadata.get_hyperparams() - - primitive = no_split.NoSplitDatasetSplitPrimitive(hyperparams=hyperparams_class.defaults()) - - primitive.set_training_data(dataset=dataset) - primitive.fit() - - # To test that pickling works. - pickle.dumps(primitive) - - results = primitive.produce(inputs=container.List([0], generate_metadata=True)).value - - self.assertEqual(len(results), 1) - - for dataset in results: - self.assertEqual(len(dataset), 1) - - self.assertEqual(results[0]['learningData'].shape[0], 150) - self.assertEqual(list(results[0]['learningData'].iloc[:, 0]), [str(i) for i in range(150)]) - - def test_produce_score(self): - dataset_doc_path = os.path.abspath(os.path.join(os.path.dirname(__file__), '..', 'tests', 'data', 'datasets', 'iris_dataset_1', 'datasetDoc.json')) - - dataset = container.Dataset.load('file://{dataset_doc_path}'.format(dataset_doc_path=dataset_doc_path)) - - # We set semantic types like runtime would. - dataset.metadata = dataset.metadata.add_semantic_type(('learningData', metadata_base.ALL_ELEMENTS, 5), 'https://metadata.datadrivendiscovery.org/types/Target') - dataset.metadata = dataset.metadata.add_semantic_type(('learningData', metadata_base.ALL_ELEMENTS, 5), 'https://metadata.datadrivendiscovery.org/types/TrueTarget') - dataset.metadata = dataset.metadata.remove_semantic_type(('learningData', metadata_base.ALL_ELEMENTS, 5), 'https://metadata.datadrivendiscovery.org/types/Attribute') - - hyperparams_class = no_split.NoSplitDatasetSplitPrimitive.metadata.get_hyperparams() - - primitive = no_split.NoSplitDatasetSplitPrimitive(hyperparams=hyperparams_class.defaults()) - - primitive.set_training_data(dataset=dataset) - primitive.fit() - - results = primitive.produce_score_data(inputs=container.List([0], generate_metadata=True)).value - - self.assertEqual(len(results), 1) - - for dataset in results: - self.assertEqual(len(dataset), 1) - - self.assertEqual(results[0]['learningData'].shape[0], 150) - self.assertEqual(list(results[0]['learningData'].iloc[:, 0]), [str(i) for i in range(150)]) - - -if __name__ == '__main__': - unittest.main() diff --git a/common-primitives/tests/test_normalize_column_references.py b/common-primitives/tests/test_normalize_column_references.py deleted file mode 100644 index 363ecb0..0000000 --- a/common-primitives/tests/test_normalize_column_references.py +++ /dev/null @@ -1,597 +0,0 @@ -import os -import unittest - -from d3m import container, utils - -from common_primitives import normalize_column_references - -import utils as test_utils - - -class NormalizeColumnReferencesPrimitiveTestCase(unittest.TestCase): - def test_basic(self): - dataset_doc_path = os.path.abspath( - os.path.join(os.path.dirname(__file__), 'data', 'datasets', 'database_dataset_1', 'datasetDoc.json') - ) - - dataset = container.Dataset.load('file://{dataset_doc_path}'.format(dataset_doc_path=dataset_doc_path)) - - metadata_before = dataset.metadata.to_internal_json_structure() - - self._test_metadata_before(utils.to_json_structure(dataset.metadata.to_internal_simple_structure()), dataset_doc_path) - - hyperparams_class = normalize_column_references.NormalizeColumnReferencesPrimitive.metadata.get_hyperparams() - - primitive = normalize_column_references.NormalizeColumnReferencesPrimitive( - hyperparams=hyperparams_class.defaults() - ) - - normalized_dataset = primitive.produce(inputs=dataset).value - - self.assertIsInstance(normalized_dataset, container.Dataset) - - self._test_metadata_after(utils.to_json_structure(normalized_dataset.metadata.to_internal_simple_structure()), dataset_doc_path) - - self.assertEqual(metadata_before, dataset.metadata.to_internal_json_structure()) - - def _test_metadata_before(self, metadata, dataset_doc_path): - self.maxDiff = None - - self.assertEqual( - test_utils.convert_through_json(metadata), - [ - { - 'selector': [], - 'metadata': { - 'description': 'A synthetic dataset trying to be similar to a database dump, with tables with different relations between them.', - 'digest': '68c435c6ba9a1c419c79507275c0d5710786dfe481e48f35591d87a7dbf5bb1a', - 'dimension': { - 'length': 4, - 'name': 'resources', - 'semantic_types': ['https://metadata.datadrivendiscovery.org/types/DatasetResource'], - }, - 'id': 'database_dataset_1', - 'location_uris': [ - 'file://{dataset_doc_path}'.format(dataset_doc_path=dataset_doc_path), - ], - 'name': 'A dataset simulating a database dump', - 'schema': 'https://metadata.datadrivendiscovery.org/schemas/v0/container.json', - 'source': {'license': 'CC', 'redacted': False}, - 'structural_type': 'd3m.container.dataset.Dataset', - 'version': '4.0.0', - }, - }, - { - 'selector': ['authors'], - 'metadata': { - 'dimension': { - 'length': 3, - 'name': 'rows', - 'semantic_types': ['https://metadata.datadrivendiscovery.org/types/TabularRow'], - }, - 'semantic_types': ['https://metadata.datadrivendiscovery.org/types/Table'], - 'structural_type': 'd3m.container.pandas.DataFrame', - }, - }, - { - 'selector': ['authors', '__ALL_ELEMENTS__'], - 'metadata': { - 'dimension': { - 'length': 2, - 'name': 'columns', - 'semantic_types': ['https://metadata.datadrivendiscovery.org/types/TabularColumn'], - } - }, - }, - { - 'selector': ['authors', '__ALL_ELEMENTS__', 0], - 'metadata': { - 'name': 'id', - 'semantic_types': [ - 'http://schema.org/Integer', - 'https://metadata.datadrivendiscovery.org/types/PrimaryKey', - ], - 'structural_type': 'str', - }, - }, - { - 'selector': ['authors', '__ALL_ELEMENTS__', 1], - 'metadata': { - 'name': 'name', - 'semantic_types': [ - 'http://schema.org/Text', - 'https://metadata.datadrivendiscovery.org/types/Attribute', - ], - 'structural_type': 'str', - }, - }, - { - 'selector': ['codes'], - 'metadata': { - 'dimension': { - 'length': 3, - 'name': 'rows', - 'semantic_types': ['https://metadata.datadrivendiscovery.org/types/TabularRow'], - }, - 'semantic_types': ['https://metadata.datadrivendiscovery.org/types/Table'], - 'structural_type': 'd3m.container.pandas.DataFrame', - }, - }, - { - 'selector': ['codes', '__ALL_ELEMENTS__'], - 'metadata': { - 'dimension': { - 'length': 3, - 'name': 'columns', - 'semantic_types': ['https://metadata.datadrivendiscovery.org/types/TabularColumn'], - } - }, - }, - { - 'selector': ['codes', '__ALL_ELEMENTS__', 0], - 'metadata': { - 'name': 'code', - 'semantic_types': [ - 'https://metadata.datadrivendiscovery.org/types/CategoricalData', - 'https://metadata.datadrivendiscovery.org/types/PrimaryKey', - ], - 'structural_type': 'str', - }, - }, - { - 'selector': ['codes', '__ALL_ELEMENTS__', 1], - 'metadata': { - 'name': 'name', - 'semantic_types': [ - 'http://schema.org/Text', - 'https://metadata.datadrivendiscovery.org/types/Attribute', - ], - 'structural_type': 'str', - }, - }, - { - 'selector': ['codes', '__ALL_ELEMENTS__', 2], - 'metadata': { - 'foreign_key': {'column_index': 0, 'resource_id': 'authors', 'type': 'COLUMN'}, - 'name': 'author', - 'semantic_types': [ - 'http://schema.org/Integer', - 'https://metadata.datadrivendiscovery.org/types/Attribute', - ], - 'structural_type': 'str', - }, - }, - { - 'selector': ['learningData'], - 'metadata': { - 'dimension': { - 'length': 45, - 'name': 'rows', - 'semantic_types': ['https://metadata.datadrivendiscovery.org/types/TabularRow'], - }, - 'semantic_types': [ - 'https://metadata.datadrivendiscovery.org/types/Table', - 'https://metadata.datadrivendiscovery.org/types/DatasetEntryPoint', - ], - 'structural_type': 'd3m.container.pandas.DataFrame', - }, - }, - { - 'selector': ['learningData', '__ALL_ELEMENTS__'], - 'metadata': { - 'dimension': { - 'length': 5, - 'name': 'columns', - 'semantic_types': ['https://metadata.datadrivendiscovery.org/types/TabularColumn'], - } - }, - }, - { - 'selector': ['learningData', '__ALL_ELEMENTS__', 0], - 'metadata': { - 'name': 'd3mIndex', - 'semantic_types': [ - 'http://schema.org/Integer', - 'https://metadata.datadrivendiscovery.org/types/PrimaryKey', - ], - 'structural_type': 'str', - }, - }, - { - 'selector': ['learningData', '__ALL_ELEMENTS__', 1], - 'metadata': { - 'foreign_key': {'column_name': 'code', 'resource_id': 'codes', 'type': 'COLUMN'}, - 'name': 'code', - 'semantic_types': [ - 'https://metadata.datadrivendiscovery.org/types/CategoricalData', - 'https://metadata.datadrivendiscovery.org/types/Attribute', - ], - 'structural_type': 'str', - }, - }, - { - 'selector': ['learningData', '__ALL_ELEMENTS__', 2], - 'metadata': { - 'name': 'key', - 'semantic_types': [ - 'https://metadata.datadrivendiscovery.org/types/CategoricalData', - 'https://metadata.datadrivendiscovery.org/types/Attribute', - ], - 'structural_type': 'str', - }, - }, - { - 'selector': ['learningData', '__ALL_ELEMENTS__', 3], - 'metadata': { - 'name': 'year', - 'semantic_types': [ - 'http://schema.org/DateTime', - 'https://metadata.datadrivendiscovery.org/types/Attribute', - ], - 'structural_type': 'str', - }, - }, - { - 'selector': ['learningData', '__ALL_ELEMENTS__', 4], - 'metadata': { - 'name': 'value', - 'semantic_types': [ - 'http://schema.org/Float', - 'https://metadata.datadrivendiscovery.org/types/SuggestedTarget', - 'https://metadata.datadrivendiscovery.org/types/Attribute', - ], - 'structural_type': 'str', - }, - }, - { - 'selector': ['values'], - 'metadata': { - 'dimension': { - 'length': 64, - 'name': 'rows', - 'semantic_types': ['https://metadata.datadrivendiscovery.org/types/TabularRow'], - }, - 'semantic_types': ['https://metadata.datadrivendiscovery.org/types/Table'], - 'structural_type': 'd3m.container.pandas.DataFrame', - }, - }, - { - 'selector': ['values', '__ALL_ELEMENTS__'], - 'metadata': { - 'dimension': { - 'length': 4, - 'name': 'columns', - 'semantic_types': ['https://metadata.datadrivendiscovery.org/types/TabularColumn'], - } - }, - }, - { - 'selector': ['values', '__ALL_ELEMENTS__', 0], - 'metadata': { - 'foreign_key': {'column_name': 'code', 'resource_id': 'codes', 'type': 'COLUMN'}, - 'name': 'code', - 'semantic_types': [ - 'https://metadata.datadrivendiscovery.org/types/CategoricalData', - 'https://metadata.datadrivendiscovery.org/types/Attribute', - ], - 'structural_type': 'str', - }, - }, - { - 'selector': ['values', '__ALL_ELEMENTS__', 1], - 'metadata': { - 'name': 'key', - 'semantic_types': [ - 'https://metadata.datadrivendiscovery.org/types/CategoricalData', - 'https://metadata.datadrivendiscovery.org/types/Attribute', - ], - 'structural_type': 'str', - }, - }, - { - 'selector': ['values', '__ALL_ELEMENTS__', 2], - 'metadata': { - 'name': 'year', - 'semantic_types': [ - 'http://schema.org/DateTime', - 'https://metadata.datadrivendiscovery.org/types/Attribute', - ], - 'structural_type': 'str', - }, - }, - { - 'selector': ['values', '__ALL_ELEMENTS__', 3], - 'metadata': { - 'name': 'value', - 'semantic_types': [ - 'http://schema.org/Float', - 'https://metadata.datadrivendiscovery.org/types/Attribute', - ], - 'structural_type': 'str', - }, - }, - ], - ) - - def _test_metadata_after(self, metadata, dataset_doc_path): - self.maxDiff = None - - self.assertEqual( - test_utils.convert_through_json(metadata), - [ - { - 'selector': [], - 'metadata': { - 'description': 'A synthetic dataset trying to be similar to a database dump, with tables with different relations between them.', - 'digest': '68c435c6ba9a1c419c79507275c0d5710786dfe481e48f35591d87a7dbf5bb1a', - 'dimension': { - 'length': 4, - 'name': 'resources', - 'semantic_types': ['https://metadata.datadrivendiscovery.org/types/DatasetResource'], - }, - 'id': 'database_dataset_1', - 'location_uris': [ - 'file://{dataset_doc_path}'.format(dataset_doc_path=dataset_doc_path), - ], - 'name': 'A dataset simulating a database dump', - 'schema': 'https://metadata.datadrivendiscovery.org/schemas/v0/container.json', - 'source': {'license': 'CC', 'redacted': False}, - 'structural_type': 'd3m.container.dataset.Dataset', - 'version': '4.0.0', - }, - }, - { - 'selector': ['authors'], - 'metadata': { - 'dimension': { - 'length': 3, - 'name': 'rows', - 'semantic_types': ['https://metadata.datadrivendiscovery.org/types/TabularRow'], - }, - 'semantic_types': ['https://metadata.datadrivendiscovery.org/types/Table'], - 'structural_type': 'd3m.container.pandas.DataFrame', - }, - }, - { - 'selector': ['authors', '__ALL_ELEMENTS__'], - 'metadata': { - 'dimension': { - 'length': 2, - 'name': 'columns', - 'semantic_types': ['https://metadata.datadrivendiscovery.org/types/TabularColumn'], - } - }, - }, - { - 'selector': ['authors', '__ALL_ELEMENTS__', 0], - 'metadata': { - 'name': 'id', - 'semantic_types': [ - 'http://schema.org/Integer', - 'https://metadata.datadrivendiscovery.org/types/PrimaryKey', - ], - 'structural_type': 'str', - }, - }, - { - 'selector': ['authors', '__ALL_ELEMENTS__', 1], - 'metadata': { - 'name': 'name', - 'semantic_types': [ - 'http://schema.org/Text', - 'https://metadata.datadrivendiscovery.org/types/Attribute', - ], - 'structural_type': 'str', - }, - }, - { - 'selector': ['codes'], - 'metadata': { - 'dimension': { - 'length': 3, - 'name': 'rows', - 'semantic_types': ['https://metadata.datadrivendiscovery.org/types/TabularRow'], - }, - 'semantic_types': ['https://metadata.datadrivendiscovery.org/types/Table'], - 'structural_type': 'd3m.container.pandas.DataFrame', - }, - }, - { - 'selector': ['codes', '__ALL_ELEMENTS__'], - 'metadata': { - 'dimension': { - 'length': 3, - 'name': 'columns', - 'semantic_types': ['https://metadata.datadrivendiscovery.org/types/TabularColumn'], - } - }, - }, - { - 'selector': ['codes', '__ALL_ELEMENTS__', 0], - 'metadata': { - 'name': 'code', - 'semantic_types': [ - 'https://metadata.datadrivendiscovery.org/types/CategoricalData', - 'https://metadata.datadrivendiscovery.org/types/PrimaryKey', - ], - 'structural_type': 'str', - }, - }, - { - 'selector': ['codes', '__ALL_ELEMENTS__', 1], - 'metadata': { - 'name': 'name', - 'semantic_types': [ - 'http://schema.org/Text', - 'https://metadata.datadrivendiscovery.org/types/Attribute', - ], - 'structural_type': 'str', - }, - }, - { - 'selector': ['codes', '__ALL_ELEMENTS__', 2], - 'metadata': { - 'foreign_key': {'column_index': 0, 'resource_id': 'authors', 'type': 'COLUMN'}, - 'name': 'author', - 'semantic_types': [ - 'http://schema.org/Integer', - 'https://metadata.datadrivendiscovery.org/types/Attribute', - ], - 'structural_type': 'str', - }, - }, - { - 'selector': ['learningData'], - 'metadata': { - 'dimension': { - 'length': 45, - 'name': 'rows', - 'semantic_types': ['https://metadata.datadrivendiscovery.org/types/TabularRow'], - }, - 'semantic_types': [ - 'https://metadata.datadrivendiscovery.org/types/Table', - 'https://metadata.datadrivendiscovery.org/types/DatasetEntryPoint', - ], - 'structural_type': 'd3m.container.pandas.DataFrame', - }, - }, - { - 'selector': ['learningData', '__ALL_ELEMENTS__'], - 'metadata': { - 'dimension': { - 'length': 5, - 'name': 'columns', - 'semantic_types': ['https://metadata.datadrivendiscovery.org/types/TabularColumn'], - } - }, - }, - { - 'selector': ['learningData', '__ALL_ELEMENTS__', 0], - 'metadata': { - 'name': 'd3mIndex', - 'semantic_types': [ - 'http://schema.org/Integer', - 'https://metadata.datadrivendiscovery.org/types/PrimaryKey', - ], - 'structural_type': 'str', - }, - }, - { - 'selector': ['learningData', '__ALL_ELEMENTS__', 1], - 'metadata': { - 'foreign_key': {'column_index': 0, 'column_name': '__NO_VALUE__', 'resource_id': 'codes', 'type': 'COLUMN'}, - 'name': 'code', - 'semantic_types': [ - 'https://metadata.datadrivendiscovery.org/types/CategoricalData', - 'https://metadata.datadrivendiscovery.org/types/Attribute', - ], - 'structural_type': 'str', - }, - }, - { - 'selector': ['learningData', '__ALL_ELEMENTS__', 2], - 'metadata': { - 'name': 'key', - 'semantic_types': [ - 'https://metadata.datadrivendiscovery.org/types/CategoricalData', - 'https://metadata.datadrivendiscovery.org/types/Attribute', - ], - 'structural_type': 'str', - }, - }, - { - 'selector': ['learningData', '__ALL_ELEMENTS__', 3], - 'metadata': { - 'name': 'year', - 'semantic_types': [ - 'http://schema.org/DateTime', - 'https://metadata.datadrivendiscovery.org/types/Attribute', - ], - 'structural_type': 'str', - }, - }, - { - 'selector': ['learningData', '__ALL_ELEMENTS__', 4], - 'metadata': { - 'name': 'value', - 'semantic_types': [ - 'http://schema.org/Float', - 'https://metadata.datadrivendiscovery.org/types/SuggestedTarget', - 'https://metadata.datadrivendiscovery.org/types/Attribute', - ], - 'structural_type': 'str', - }, - }, - { - 'selector': ['values'], - 'metadata': { - 'dimension': { - 'length': 64, - 'name': 'rows', - 'semantic_types': ['https://metadata.datadrivendiscovery.org/types/TabularRow'], - }, - 'semantic_types': ['https://metadata.datadrivendiscovery.org/types/Table'], - 'structural_type': 'd3m.container.pandas.DataFrame', - }, - }, - { - 'selector': ['values', '__ALL_ELEMENTS__'], - 'metadata': { - 'dimension': { - 'length': 4, - 'name': 'columns', - 'semantic_types': ['https://metadata.datadrivendiscovery.org/types/TabularColumn'], - } - }, - }, - { - 'selector': ['values', '__ALL_ELEMENTS__', 0], - 'metadata': { - 'foreign_key': {'column_index': 0, 'column_name': '__NO_VALUE__', 'resource_id': 'codes', 'type': 'COLUMN'}, - 'name': 'code', - 'semantic_types': [ - 'https://metadata.datadrivendiscovery.org/types/CategoricalData', - 'https://metadata.datadrivendiscovery.org/types/Attribute', - ], - 'structural_type': 'str', - }, - }, - { - 'selector': ['values', '__ALL_ELEMENTS__', 1], - 'metadata': { - 'name': 'key', - 'semantic_types': [ - 'https://metadata.datadrivendiscovery.org/types/CategoricalData', - 'https://metadata.datadrivendiscovery.org/types/Attribute', - ], - 'structural_type': 'str', - }, - }, - { - 'selector': ['values', '__ALL_ELEMENTS__', 2], - 'metadata': { - 'name': 'year', - 'semantic_types': [ - 'http://schema.org/DateTime', - 'https://metadata.datadrivendiscovery.org/types/Attribute', - ], - 'structural_type': 'str', - }, - }, - { - 'selector': ['values', '__ALL_ELEMENTS__', 3], - 'metadata': { - 'name': 'value', - 'semantic_types': [ - 'http://schema.org/Float', - 'https://metadata.datadrivendiscovery.org/types/Attribute', - ], - 'structural_type': 'str', - }, - }, - ], - ) - - -if __name__ == '__main__': - unittest.main() diff --git a/common-primitives/tests/test_normalize_graphs.py b/common-primitives/tests/test_normalize_graphs.py deleted file mode 100644 index e6eeb8d..0000000 --- a/common-primitives/tests/test_normalize_graphs.py +++ /dev/null @@ -1,207 +0,0 @@ -import os -import unittest - -from d3m import container, utils -from d3m.metadata import base as metadata_base - -from common_primitives import normalize_graphs, denormalize, dataset_map, column_parser, normalize_column_references, simple_profiler - -import utils as test_utils - - -class NormalizeGraphsPrimitiveTestCase(unittest.TestCase): - def _parse_columns(self, dataset): - hyperparams_class = dataset_map.DataFrameDatasetMapPrimitive.metadata.get_hyperparams() - - primitive = dataset_map.DataFrameDatasetMapPrimitive( - # We have to make an instance of the primitive ourselves. - hyperparams=hyperparams_class.defaults().replace({ - 'primitive': column_parser.ColumnParserPrimitive( - hyperparams=column_parser.ColumnParserPrimitive.metadata.get_hyperparams().defaults(), - ), - 'resources': 'all', - }), - - ) - - return primitive.produce(inputs=dataset).value - - def _normalize_column_references(self, dataset): - hyperparams_class = normalize_column_references.NormalizeColumnReferencesPrimitive.metadata.get_hyperparams() - - primitive = normalize_column_references.NormalizeColumnReferencesPrimitive( - hyperparams=hyperparams_class.defaults(), - ) - - return primitive.produce(inputs=dataset).value - - def _get_dataset_1(self): - dataset_doc_path = os.path.abspath( - os.path.join(os.path.dirname(__file__), 'data', 'datasets', 'graph_dataset_1', 'datasetDoc.json') - ) - - dataset = container.Dataset.load('file://{dataset_doc_path}'.format(dataset_doc_path=dataset_doc_path)) - - # We set semantic types like runtime would. - dataset.metadata = dataset.metadata.add_semantic_type(('learningData', metadata_base.ALL_ELEMENTS, 2), 'https://metadata.datadrivendiscovery.org/types/Target') - dataset.metadata = dataset.metadata.add_semantic_type(('learningData', metadata_base.ALL_ELEMENTS, 2), 'https://metadata.datadrivendiscovery.org/types/TrueTarget') - dataset.metadata = dataset.metadata.remove_semantic_type(('learningData', metadata_base.ALL_ELEMENTS, 2), 'https://metadata.datadrivendiscovery.org/types/Attribute') - - metadata_before = dataset.metadata.to_internal_json_structure() - - normalized_dataset = self._normalize_column_references(dataset) - - hyperparams_class = normalize_graphs.NormalizeGraphsPrimitive.metadata.get_hyperparams() - - primitive = normalize_graphs.NormalizeGraphsPrimitive( - hyperparams=hyperparams_class.defaults(), - ) - - normalized_dataset = primitive.produce(inputs=normalized_dataset).value - - hyperparams_class = dataset_map.DataFrameDatasetMapPrimitive.metadata.get_hyperparams() - - primitive = dataset_map.DataFrameDatasetMapPrimitive( - # We have to make an instance of the primitive ourselves. - hyperparams=hyperparams_class.defaults().replace({ - 'primitive': simple_profiler.SimpleProfilerPrimitive( - hyperparams=simple_profiler.SimpleProfilerPrimitive.metadata.get_hyperparams().defaults().replace({ - 'detect_semantic_types': [ - 'http://schema.org/Boolean', 'https://metadata.datadrivendiscovery.org/types/CategoricalData', - 'http://schema.org/Integer', 'http://schema.org/Float', 'http://schema.org/Text', - 'https://metadata.datadrivendiscovery.org/types/FloatVector', 'http://schema.org/DateTime', - 'https://metadata.datadrivendiscovery.org/types/Attribute', - 'https://metadata.datadrivendiscovery.org/types/Time', - 'https://metadata.datadrivendiscovery.org/types/UnknownType', - ], - }), - ), - 'resources': 'all', - }), - - ) - - primitive.set_training_data(inputs=normalized_dataset) - primitive.fit() - normalized_dataset = primitive.produce(inputs=normalized_dataset).value - - normalized_dataset = self._parse_columns(normalized_dataset) - - hyperparams_class = denormalize.DenormalizePrimitive.metadata.get_hyperparams() - - primitive = denormalize.DenormalizePrimitive( - hyperparams=hyperparams_class.defaults(), - ) - - normalized_dataset = primitive.produce(inputs=normalized_dataset).value - - # To make metadata match in recorded structural types. - normalized_dataset.metadata = normalized_dataset.metadata.generate(normalized_dataset) - - self.assertEqual(metadata_before, dataset.metadata.to_internal_json_structure()) - - return normalized_dataset - - def _get_dataset_2(self): - dataset_doc_path = os.path.abspath( - os.path.join(os.path.dirname(__file__), 'data', 'datasets', 'graph_dataset_2', 'datasetDoc.json') - ) - - dataset = container.Dataset.load('file://{dataset_doc_path}'.format(dataset_doc_path=dataset_doc_path)) - - # We set semantic types like runtime would. - dataset.metadata = dataset.metadata.add_semantic_type(('learningData', metadata_base.ALL_ELEMENTS, 4), 'https://metadata.datadrivendiscovery.org/types/Target') - dataset.metadata = dataset.metadata.add_semantic_type(('learningData', metadata_base.ALL_ELEMENTS, 4), 'https://metadata.datadrivendiscovery.org/types/TrueTarget') - dataset.metadata = dataset.metadata.remove_semantic_type(('learningData', metadata_base.ALL_ELEMENTS, 4), 'https://metadata.datadrivendiscovery.org/types/Attribute') - - metadata_before = dataset.metadata.to_internal_json_structure() - - normalized_dataset = self._normalize_column_references(dataset) - - hyperparams_class = normalize_graphs.NormalizeGraphsPrimitive.metadata.get_hyperparams() - - primitive = normalize_graphs.NormalizeGraphsPrimitive( - hyperparams=hyperparams_class.defaults(), - ) - - normalized_dataset = primitive.produce(inputs=normalized_dataset).value - - normalized_dataset = self._parse_columns(normalized_dataset) - - # To make metadata match in recorded structural types. - normalized_dataset.metadata = normalized_dataset.metadata.generate(normalized_dataset) - - self.assertEqual(metadata_before, dataset.metadata.to_internal_json_structure()) - - return normalized_dataset - - def test_basic(self): - self.maxDiff = None - - dataset_1 = self._get_dataset_1() - dataset_2 = self._get_dataset_2() - - # Making some changes to make resulting datasets the same. - dataset_2['G1_edges'] = dataset_2['edgeList'] - del dataset_2['edgeList'] - - dataset_2.metadata = dataset_2.metadata.copy_to(dataset_2.metadata, ('edgeList',), ('G1_edges',)) - dataset_2.metadata = dataset_2.metadata.remove(('edgeList',), recursive=True) - - for field in ['description', 'digest', 'id', 'location_uris', 'name']: - dataset_1.metadata = dataset_1.metadata.update((), {field: metadata_base.NO_VALUE}) - dataset_2.metadata = dataset_2.metadata.update((), {field: metadata_base.NO_VALUE}) - - dataset_1_metadata = test_utils.effective_metadata(dataset_1.metadata) - dataset_2_metadata = test_utils.effective_metadata(dataset_2.metadata) - - # Removing an ALL_ELEMENTS selector which does not really apply to any element anymore - # (it is overridden by more specific selectors). - del dataset_1_metadata[3] - - self.assertEqual(dataset_1_metadata, dataset_2_metadata) - - self.assertCountEqual(dataset_1.keys(), dataset_2.keys()) - - for resource_id in dataset_1.keys(): - self.assertTrue(dataset_1[resource_id].equals(dataset_2[resource_id]), resource_id) - - def test_idempotent_dataset_1(self): - dataset = self._get_dataset_1() - - hyperparams_class = normalize_graphs.NormalizeGraphsPrimitive.metadata.get_hyperparams() - - primitive = normalize_graphs.NormalizeGraphsPrimitive( - hyperparams=hyperparams_class.defaults(), - ) - - normalized_dataset = primitive.produce(inputs=dataset).value - - self.assertEqual(utils.to_json_structure(dataset.metadata.to_internal_simple_structure()), normalized_dataset.metadata.to_internal_json_structure()) - - self.assertCountEqual(dataset.keys(), normalized_dataset.keys()) - - for resource_id in dataset.keys(): - self.assertTrue(dataset[resource_id].equals(normalized_dataset[resource_id]), resource_id) - - def test_idempotent_dataset_2(self): - dataset = self._get_dataset_2() - - hyperparams_class = normalize_graphs.NormalizeGraphsPrimitive.metadata.get_hyperparams() - - primitive = normalize_graphs.NormalizeGraphsPrimitive( - hyperparams=hyperparams_class.defaults(), - ) - - normalized_dataset = primitive.produce(inputs=dataset).value - - self.assertEqual(dataset.metadata.to_internal_json_structure(), normalized_dataset.metadata.to_internal_json_structure()) - - self.assertCountEqual(dataset.keys(), normalized_dataset.keys()) - - for resource_id in dataset.keys(): - self.assertTrue(dataset[resource_id].equals(normalized_dataset[resource_id]), resource_id) - - -if __name__ == '__main__': - unittest.main() diff --git a/common-primitives/tests/test_numeric_range_filter.py b/common-primitives/tests/test_numeric_range_filter.py deleted file mode 100644 index df340af..0000000 --- a/common-primitives/tests/test_numeric_range_filter.py +++ /dev/null @@ -1,143 +0,0 @@ -import unittest -import os - -from common_primitives import numeric_range_filter -from d3m import container - -import utils as test_utils - - -class NumericRangeFilterPrimitiveTestCase(unittest.TestCase): - def test_inclusive_strict(self): - # load the iris dataset - dataset = test_utils.load_iris_metadata() - resource = test_utils.get_dataframe(dataset) - - filter_hyperparams_class = numeric_range_filter.NumericRangeFilterPrimitive.metadata.get_hyperparams() - hp = filter_hyperparams_class({ - 'column': 1, - 'min': 6.5, - 'max': 6.7, - 'strict': True, - 'inclusive': True - }) - filter_primitive = numeric_range_filter.NumericRangeFilterPrimitive(hyperparams=hp) - new_dataframe = filter_primitive.produce(inputs=resource).value - - self.assertGreater(new_dataframe['sepalLength'].astype(float).min(), 6.5) - self.assertLess(new_dataframe['sepalLength'].astype(float).max(), 6.7) - - def test_inclusive_permissive(self): - # load the iris dataset - dataset = test_utils.load_iris_metadata() - resource = test_utils.get_dataframe(dataset) - - filter_hyperparams_class = numeric_range_filter.NumericRangeFilterPrimitive.metadata.get_hyperparams() - hp = filter_hyperparams_class({ - 'column': 1, - 'min': 6.5, - 'max': 6.7, - 'strict': False, - 'inclusive': True - }) - filter_primitive = numeric_range_filter.NumericRangeFilterPrimitive(hyperparams=hp) - new_dataframe = filter_primitive.produce(inputs=resource).value - - self.assertGreaterEqual(new_dataframe['sepalLength'].astype(float).min(), 6.5) - self.assertLessEqual(new_dataframe['sepalLength'].astype(float).max(), 6.7) - - def test_exclusive_strict(self): - # load the iris dataset - dataset = test_utils.load_iris_metadata() - resource = test_utils.get_dataframe(dataset) - - filter_hyperparams_class = numeric_range_filter \ - .NumericRangeFilterPrimitive.metadata.get_hyperparams() - hp = filter_hyperparams_class({ - 'column': 1, - 'min': 6.5, - 'max': 6.7, - 'strict': True, - 'inclusive': False - }) - filter_primitive = numeric_range_filter.NumericRangeFilterPrimitive(hyperparams=hp) - new_dataframe = filter_primitive.produce(inputs=resource).value - - self.assertEqual( - len(new_dataframe.loc[ - (new_dataframe['sepalLength'].astype(float) >= 6.5) & - (new_dataframe['sepalLength'].astype(float) <= 6.7)]), 0) - - def test_exclusive_permissive(self): - # load the iris dataset - dataset = test_utils.load_iris_metadata() - resource = test_utils.get_dataframe(dataset) - - filter_hyperparams_class = numeric_range_filter \ - .NumericRangeFilterPrimitive.metadata.get_hyperparams() - hp = filter_hyperparams_class({ - 'column': 1, - 'min': 6.5, - 'max': 6.7, - 'strict': False, - 'inclusive': False - }) - filter_primitive = numeric_range_filter.NumericRangeFilterPrimitive(hyperparams=hp) - new_dataframe = filter_primitive.produce(inputs=resource).value - - self.assertEqual( - len(new_dataframe.loc[ - (new_dataframe['sepalLength'].astype(float) > 6.5) & - (new_dataframe['sepalLength'].astype(float) < 6.7)]), 0) - - def test_row_metadata_removal(self): - # load the iris dataset - dataset = test_utils.load_iris_metadata() - - # add metadata for rows 0 and 1 - dataset.metadata = dataset.metadata.update(('learningData', 0), {'a': 0}) - dataset.metadata = dataset.metadata.update(('learningData', 5), {'b': 1}) - - resource = test_utils.get_dataframe(dataset) - - # apply filter that removes rows 0 and 1 - filter_hyperparams_class = numeric_range_filter.NumericRangeFilterPrimitive.metadata.get_hyperparams() - hp = filter_hyperparams_class({ - 'column': 0, - 'min': 1, - 'max': 4, - 'strict': True, - 'inclusive': False - }) - filter_primitive = numeric_range_filter.NumericRangeFilterPrimitive(hyperparams=hp) - new_df = filter_primitive.produce(inputs=resource).value - - # verify that the length is correct - self.assertEqual(len(new_df), new_df.metadata.query(())['dimension']['length']) - - # verify that the rows were re-indexed in the metadata - self.assertEqual(new_df.metadata.query((0,))['a'], 0) - self.assertEqual(new_df.metadata.query((1,))['b'], 1) - self.assertFalse('b' in new_df.metadata.query((5,))) - - def test_bad_type_handling(self): - dataset_doc_path = os.path.abspath(os.path.join(os.path.dirname(__file__), 'data', 'datasets', 'timeseries_dataset_1', 'datasetDoc.json')) - dataset = container.Dataset.load('file://{dataset_doc_path}'.format(dataset_doc_path=dataset_doc_path)) - resource = test_utils.get_dataframe(dataset) - - filter_hyperparams_class = numeric_range_filter \ - .NumericRangeFilterPrimitive.metadata.get_hyperparams() - hp = filter_hyperparams_class({ - 'column': 1, - 'min': 6.5, - 'max': 6.7, - 'strict': False, - 'inclusive': False - }) - filter_primitive = numeric_range_filter.NumericRangeFilterPrimitive(hyperparams=hp) - with self.assertRaises(ValueError): - filter_primitive.produce(inputs=resource) - - -if __name__ == '__main__': - unittest.main() diff --git a/common-primitives/tests/test_one_hot_maker.py b/common-primitives/tests/test_one_hot_maker.py deleted file mode 100644 index 245fd70..0000000 --- a/common-primitives/tests/test_one_hot_maker.py +++ /dev/null @@ -1,516 +0,0 @@ -import os -import time -import unittest -import numpy as np -import pickle -from d3m import container, exceptions, utils -from d3m.metadata import base as metadata_base - -from common_primitives import dataset_to_dataframe, extract_columns_semantic_types, one_hot_maker, column_parser - - -def _copy_target_as_categorical_feature(attributes, targets): - attributes = targets.append_columns(attributes) - for column_name in targets.columns.values: - column_mask = attributes.columns.get_loc(column_name) - if isinstance(column_mask, int): - column_index = column_mask - else: - column_index = np.where(column_mask)[0][-1].item() - attributes.metadata = attributes.metadata.remove_semantic_type( - (metadata_base.ALL_ELEMENTS, column_index), - 'https://metadata.datadrivendiscovery.org/types/SuggestedTarget') - attributes.metadata = attributes.metadata.remove_semantic_type( - (metadata_base.ALL_ELEMENTS, column_index), - 'https://metadata.datadrivendiscovery.org/types/Target') - attributes.metadata = attributes.metadata.remove_semantic_type( - (metadata_base.ALL_ELEMENTS, column_index), - 'https://metadata.datadrivendiscovery.org/types/TrueTarget') - attributes.metadata = attributes.metadata.add_semantic_type( - (metadata_base.ALL_ELEMENTS, column_index), - 'https://metadata.datadrivendiscovery.org/types/CategoricalData') - attributes.metadata = attributes.metadata.add_semantic_type( - (metadata_base.ALL_ELEMENTS, column_index), - 'https://metadata.datadrivendiscovery.org/types/Attribute') - attributes.metadata = attributes.metadata.update_column(column_index, - {'custom_metadata': metadata_base.NO_VALUE}) - return attributes - - -def _get_iris(): - dataset_doc_path = os.path.abspath( - os.path.join(os.path.dirname(__file__), 'data', 'datasets', 'iris_dataset_1', 'datasetDoc.json')) - - dataset = container.Dataset.load('file://{dataset_doc_path}'.format(dataset_doc_path=dataset_doc_path)) - - hyperparams_class = \ - dataset_to_dataframe.DatasetToDataFramePrimitive.metadata.query()['primitive_code']['class_type_arguments'][ - 'Hyperparams'] - primitive = dataset_to_dataframe.DatasetToDataFramePrimitive(hyperparams=hyperparams_class.defaults()) - - dataframe = primitive.produce(inputs=dataset).value - - return dataframe - - -def _get_iris_columns(): - dataframe = _get_iris() - - # We set custom metadata on columns. - for column_index in range(1, 5): - dataframe.metadata = dataframe.metadata.update_column(column_index, {'custom_metadata': 'attributes'}) - for column_index in range(5, 6): - dataframe.metadata = dataframe.metadata.update_column(column_index, {'custom_metadata': 'targets'}) - - # We set semantic types like runtime would. - dataframe.metadata = dataframe.metadata.add_semantic_type((metadata_base.ALL_ELEMENTS, 5), - 'https://metadata.datadrivendiscovery.org/types/Target') - dataframe.metadata = dataframe.metadata.add_semantic_type((metadata_base.ALL_ELEMENTS, 5), - 'https://metadata.datadrivendiscovery.org/types/TrueTarget') - dataframe.metadata = dataframe.metadata.remove_semantic_type((metadata_base.ALL_ELEMENTS, 5), - 'https://metadata.datadrivendiscovery.org/types/Attribute') - - # Parsing. - hyperparams_class = \ - column_parser.ColumnParserPrimitive.metadata.query()['primitive_code']['class_type_arguments'][ - 'Hyperparams'] - primitive = column_parser.ColumnParserPrimitive(hyperparams=hyperparams_class.defaults()) - dataframe = primitive.produce(inputs=dataframe).value - - hyperparams_class = \ - extract_columns_semantic_types.ExtractColumnsBySemanticTypesPrimitive.metadata.query()['primitive_code'][ - 'class_type_arguments']['Hyperparams'] - - primitive = extract_columns_semantic_types.ExtractColumnsBySemanticTypesPrimitive( - hyperparams=hyperparams_class.defaults().replace( - {'semantic_types': ('https://metadata.datadrivendiscovery.org/types/Attribute',)})) - attributes = primitive.produce(inputs=dataframe).value - - primitive = extract_columns_semantic_types.ExtractColumnsBySemanticTypesPrimitive( - hyperparams=hyperparams_class.defaults().replace( - {'semantic_types': ('https://metadata.datadrivendiscovery.org/types/SuggestedTarget',)})) - targets = primitive.produce(inputs=dataframe).value - - return dataframe, attributes, targets - - -class OneHotTestCase(unittest.TestCase): - attributes: container.DataFrame = None - excp_attributes: container.DataFrame = None - targets: container.DataFrame = None - dataframe: container.DataFrame = None - unseen_species: str = 'Unseen-Species' - missing_value: float = np.NaN - - @classmethod - def setUpClass(cls) -> None: - cls.dataframe, cls.attributes, cls.targets = _get_iris_columns() - cls.attributes = _copy_target_as_categorical_feature(attributes=cls.attributes, targets=cls.targets) - cls.excp_attributes = cls.attributes.copy() - - def tearDown(self): - self.attributes.iloc[:3, 0] = 'Iris-setosa' - self.excp_attributes.iloc[:3, 0] = 'Iris-setosa' - - def test_fit_produce(self): - attributes = _copy_target_as_categorical_feature(self.attributes, - self.targets.rename(columns={'species': '2-species'})) - attributes.metadata = attributes.metadata.update_column(1, { - 'name': '2-species' - }) - - hyperparams_class = \ - one_hot_maker.OneHotMakerPrimitive.metadata.query()['primitive_code']['class_type_arguments'][ - 'Hyperparams'] - primitive = one_hot_maker.OneHotMakerPrimitive( - hyperparams=hyperparams_class.defaults().replace({'return_result': 'replace'})) - - primitive.set_training_data(inputs=attributes) - primitive.fit() - after_onehot = primitive.produce(inputs=attributes).value - # 1 for the original, so we remove it. - self.assertEqual(after_onehot.shape[1], 2 * (len(self.targets['species'].unique()) - 1) + attributes.shape[1]) - self.assertEqual(after_onehot.shape[0], self.targets.shape[0]) - # 3 unique value for 2 (species, 2-species) 3 * 2 = 6 - self.assertTrue(all(dtype == 'uint8' for dtype in after_onehot.dtypes[:6])) - self.assertEqual(list(after_onehot.columns.values), [ - 'species.Iris-setosa', 'species.Iris-versicolor', 'species.Iris-virginica', - '2-species.Iris-setosa', '2-species.Iris-versicolor', '2-species.Iris-virginica', - 'sepalLength', 'sepalWidth', 'petalLength', 'petalWidth']) - self._test_metadata_return_replace(after_onehot.metadata) - - def test_error_unseen_categories_ignore(self): - # default(ignore) case - self.excp_attributes.iloc[0, 0] = self.unseen_species - self.excp_attributes.iloc[1, 0] = self.unseen_species + '-2' - self.excp_attributes.iloc[2, 0] = np.NaN - hyperparams_class = \ - one_hot_maker.OneHotMakerPrimitive.metadata.query()['primitive_code']['class_type_arguments'][ - 'Hyperparams'] - primitive = one_hot_maker.OneHotMakerPrimitive( - hyperparams=hyperparams_class.defaults().replace({'return_result': 'replace'})) - - primitive.set_training_data(inputs=self.attributes) - primitive.fit() - one_hot_result = primitive.produce(inputs=self.excp_attributes).value - self.assertEqual(one_hot_result.shape[1], len(self.targets['species'].unique()) + self.attributes.shape[1] - 1) - self.assertEqual(one_hot_result.shape[0], self.targets.shape[0]) - self.assertTrue(all(dtype == 'uint8' for dtype in one_hot_result.dtypes[:3])) - - def test_error_unseen_categories_error(self): - # error case - self.excp_attributes.iloc[0, 0] = self.unseen_species - self.excp_attributes.iloc[1, 0] = self.unseen_species + '-2' - self.excp_attributes.iloc[2, 0] = np.NaN - hyperparams_class = \ - one_hot_maker.OneHotMakerPrimitive.metadata.query()['primitive_code']['class_type_arguments'][ - 'Hyperparams'] - primitive = one_hot_maker.OneHotMakerPrimitive( - hyperparams=hyperparams_class.defaults().replace({'return_result': 'replace', 'handle_unseen': 'error'})) - - primitive.set_training_data(inputs=self.attributes) - primitive.fit() - self.assertRaises(exceptions.UnexpectedValueError, primitive.produce, inputs=self.excp_attributes) - - def test_unseen_categories_handle(self): - # handle case - self.excp_attributes.iloc[0, 0] = self.unseen_species - self.excp_attributes.iloc[1, 0] = self.unseen_species + '-2' - self.excp_attributes.iloc[2, 0] = np.NaN - hyperparams_class = \ - one_hot_maker.OneHotMakerPrimitive.metadata.query()['primitive_code']['class_type_arguments'][ - 'Hyperparams'] - primitive = one_hot_maker.OneHotMakerPrimitive( - hyperparams=hyperparams_class.defaults().replace({'return_result': 'replace', 'handle_unseen': 'column'})) - - primitive.set_training_data(inputs=self.attributes) - primitive.fit() - one_hot_result = primitive.produce(inputs=self.excp_attributes).value - self.assertEqual(one_hot_result.shape[1], - len(self.targets['species'].unique()) + self.attributes.shape[1] - 1 + 1) - # unseen cell should be 1 - self.assertEqual(one_hot_result.iloc[0, 3], 1) - self.assertEqual(one_hot_result.shape[0], self.targets.shape[0]) - self.assertTrue(all(dtype == 'uint8' for dtype in one_hot_result.dtypes[:3])) - self.assertEqual(set(one_hot_result.columns.values), {'petalLength', - 'petalWidth', - 'sepalLength', - 'sepalWidth', - 'species.Iris-setosa', - 'species.Iris-versicolor', - 'species.Iris-virginica', - 'species.Unseen'}) - self._test_metadata_unseen_handle_return_replace(one_hot_result.metadata) - - def test_missing_value_ignore(self): - self.excp_attributes.iloc[0, 0] = self.missing_value - self.excp_attributes.iloc[1, 0] = self.missing_value - - # missing present during fit - hyperparams_class = \ - one_hot_maker.OneHotMakerPrimitive.metadata.query()['primitive_code']['class_type_arguments'][ - 'Hyperparams'] - primitive = one_hot_maker.OneHotMakerPrimitive( - hyperparams=hyperparams_class.defaults().replace({'return_result': 'replace'})) - - primitive.set_training_data(inputs=self.excp_attributes) - primitive.fit() - one_hot_result = primitive.produce(inputs=self.excp_attributes).value - self.assertEqual(one_hot_result.shape[1], len(self.targets['species'].unique()) + self.attributes.shape[1] - 1) - self.assertEqual(one_hot_result.shape[0], self.targets.shape[0]) - self.assertTrue(all(dtype == 'uint8' for dtype in one_hot_result.dtypes[:3])) - self.assertEqual(set(one_hot_result.columns.values), { - 'species.Iris-setosa', 'species.Iris-versicolor', 'species.Iris-virginica', - 'sepalLength', 'sepalWidth', 'petalLength', 'petalWidth'}) - - hyperparams_class = \ - one_hot_maker.OneHotMakerPrimitive.metadata.query()['primitive_code']['class_type_arguments'][ - 'Hyperparams'] - primitive = one_hot_maker.OneHotMakerPrimitive( - hyperparams=hyperparams_class.defaults().replace({'return_result': 'replace'})) - - primitive.set_training_data(inputs=self.attributes) - primitive.fit() - one_hot_result = primitive.produce(inputs=self.excp_attributes).value - self.assertEqual(one_hot_result.shape[1], len(self.targets['species'].unique()) + self.attributes.shape[1] - 1) - self.assertEqual(one_hot_result.shape[0], self.targets.shape[0]) - self.assertTrue(all(dtype == 'uint8' for dtype in one_hot_result.dtypes[:3])) - self.assertEqual(set(one_hot_result.columns.values), { - 'species.Iris-setosa', 'species.Iris-versicolor', 'species.Iris-virginica', - 'sepalLength', 'sepalWidth', 'petalLength', 'petalWidth'}) - - def test_missing_value_error(self): - self.excp_attributes.iloc[0, 0] = np.NaN - self.excp_attributes.iloc[1, 0] = None - # error - hyperparams_class = \ - one_hot_maker.OneHotMakerPrimitive.metadata.query()['primitive_code']['class_type_arguments'][ - 'Hyperparams'] - primitive = one_hot_maker.OneHotMakerPrimitive( - hyperparams=hyperparams_class.defaults().replace({ - 'return_result': 'replace', - 'handle_missing_value': 'error', - })) - - primitive.set_training_data(inputs=self.excp_attributes) - self.assertRaises(exceptions.MissingValueError, primitive.fit) - - def test_missing_value_column(self): - self.excp_attributes.iloc[0, 0] = np.NaN - self.excp_attributes.iloc[1, 0] = np.NaN - self.excp_attributes.iloc[2, 0] = 'Unseen-Species' - # column - hyperparams_class = \ - one_hot_maker.OneHotMakerPrimitive.metadata.query()['primitive_code']['class_type_arguments'][ - 'Hyperparams'] - primitive = one_hot_maker.OneHotMakerPrimitive( - hyperparams=hyperparams_class.defaults().replace({ - 'return_result': 'replace', - 'handle_missing_value': 'column', - })) - - primitive.set_training_data(inputs=self.attributes) - primitive.fit() - one_hot_result = primitive.produce(inputs=self.excp_attributes).value - self.assertEqual(one_hot_result.shape[1], - len(self.targets['species'].unique()) + 1 + self.attributes.shape[1] - 1) - self.assertEqual(one_hot_result.shape[0], self.targets.shape[0]) - self.assertTrue(all(dtype == 'uint8' for dtype in one_hot_result.dtypes[:4])) - self.assertEqual(set(one_hot_result.columns.values), {'petalLength', - 'petalWidth', - 'sepalLength', - 'sepalWidth', - 'species.Iris-setosa', - 'species.Iris-versicolor', - 'species.Iris-virginica', - 'species.Missing'}) - - def test_unseen_column_and_missing_value_column(self): - self.excp_attributes.iloc[0, 0] = np.NaN - self.excp_attributes.iloc[1, 0] = np.NaN - self.excp_attributes.iloc[2, 0] = 'Unseen-Species' - # column - hyperparams_class = \ - one_hot_maker.OneHotMakerPrimitive.metadata.query()['primitive_code']['class_type_arguments'][ - 'Hyperparams'] - primitive = one_hot_maker.OneHotMakerPrimitive( - hyperparams=hyperparams_class.defaults().replace({ - 'return_result': 'replace', - 'handle_missing_value': 'column', - 'handle_unseen': 'column' - })) - - primitive.set_training_data(inputs=self.attributes) - primitive.fit() - one_hot_result = primitive.produce(inputs=self.excp_attributes).value - self.assertEqual(one_hot_result.shape[1], - len(self.targets['species'].unique()) + 2 + self.attributes.shape[1] - 1) - self.assertEqual(one_hot_result.shape[0], self.targets.shape[0]) - self.assertTrue(all(dtype == 'uint8' for dtype in one_hot_result.dtypes[:4])) - self.assertEqual(set(one_hot_result.columns.values), {'petalLength', - 'petalWidth', - 'sepalLength', - 'sepalWidth', - 'species.Iris-setosa', - 'species.Iris-versicolor', - 'species.Iris-virginica', - 'species.Missing', - 'species.Unseen'}) - - def test_pickle_unpickle(self): - hyperparams_class = \ - one_hot_maker.OneHotMakerPrimitive.metadata.query()['primitive_code']['class_type_arguments'][ - 'Hyperparams'] - primitive = one_hot_maker.OneHotMakerPrimitive( - hyperparams=hyperparams_class.defaults().replace({ - 'return_result': 'replace', - 'handle_missing_value': 'column', - 'handle_unseen': 'column' - })) - - primitive.set_training_data(inputs=self.attributes) - primitive.fit() - - before_pickled_prediction = primitive.produce(inputs=self.attributes).value - pickle_object = pickle.dumps(primitive) - primitive = pickle.loads(pickle_object) - after_unpickled_prediction = primitive.produce(inputs=self.attributes).value - self.assertTrue(container.DataFrame.equals(before_pickled_prediction, after_unpickled_prediction)) - - def _test_metadata_unseen_handle_return_replace(self, after_onehot_metadata): - self.assertEqual(utils.to_json_structure(after_onehot_metadata.to_internal_simple_structure()), [{ - 'metadata': { - 'dimension': { - 'length': 150, - 'name': 'rows', - 'semantic_types': ['https://metadata.datadrivendiscovery.org/types/TabularRow'] - }, - 'schema': 'https://metadata.datadrivendiscovery.org/schemas/v0/container.json', - 'semantic_types': ['https://metadata.datadrivendiscovery.org/types/Table'], - 'structural_type': 'd3m.container.pandas.DataFrame' - }, - 'selector': [] - }, - { - 'metadata': { - 'dimension': { - 'length': 8, - 'name': 'columns', - 'semantic_types': ['https://metadata.datadrivendiscovery.org/types/TabularColumn'] - } - }, - 'selector': ['__ALL_ELEMENTS__'] - }, - { - 'metadata': { - 'custom_metadata': '__NO_VALUE__', - 'name': 'species.Iris-setosa', - 'semantic_types': ['https://metadata.datadrivendiscovery.org/types/Attribute'], - 'structural_type': 'numpy.uint8' - }, - 'selector': ['__ALL_ELEMENTS__', 0] - }, - { - 'metadata': { - 'custom_metadata': '__NO_VALUE__', - 'name': 'species.Iris-versicolor', - 'semantic_types': ['https://metadata.datadrivendiscovery.org/types/Attribute'], - 'structural_type': 'numpy.uint8' - }, - 'selector': ['__ALL_ELEMENTS__', 1]}, - { - 'metadata': { - 'custom_metadata': '__NO_VALUE__', - 'name': 'species.Iris-virginica', - 'semantic_types': ['https://metadata.datadrivendiscovery.org/types/Attribute'], - 'structural_type': 'numpy.uint8' - }, - 'selector': ['__ALL_ELEMENTS__', 2]}, - { - 'metadata': {'custom_metadata': '__NO_VALUE__', - 'name': 'species.Unseen', - 'semantic_types': ['https://metadata.datadrivendiscovery.org/types/Attribute'], - 'structural_type': 'numpy.uint8'}, - 'selector': ['__ALL_ELEMENTS__', 3] - }, - { - 'metadata': { - 'custom_metadata': 'attributes', - 'name': 'sepalLength', - 'semantic_types': ['http://schema.org/Float', - 'https://metadata.datadrivendiscovery.org/types/Attribute' - ], - 'structural_type': 'float' - }, - 'selector': ['__ALL_ELEMENTS__', 4] - }, - { - 'metadata': { - 'custom_metadata': 'attributes', - 'name': 'sepalWidth', - 'semantic_types': ['http://schema.org/Float', - 'https://metadata.datadrivendiscovery.org/types/Attribute' - ], - 'structural_type': 'float' - }, - 'selector': ['__ALL_ELEMENTS__', 5] - }, - { - 'metadata': { - 'custom_metadata': 'attributes', - 'name': 'petalLength', - 'semantic_types': ['http://schema.org/Float', - 'https://metadata.datadrivendiscovery.org/types/Attribute' - ], - 'structural_type': 'float' - }, - 'selector': ['__ALL_ELEMENTS__', 6] - }, - { - 'metadata': { - 'custom_metadata': 'attributes', - 'name': 'petalWidth', - 'semantic_types': ['http://schema.org/Float', - 'https://metadata.datadrivendiscovery.org/types/Attribute' - ], - 'structural_type': 'float' - }, - 'selector': ['__ALL_ELEMENTS__', 7] - } - ]) - - def _test_metadata_return_replace(self, after_onehot_metadata): - self.assertEqual( - utils.to_json_structure(after_onehot_metadata.to_internal_simple_structure()), - [{'metadata': {'dimension': {'length': 150, - 'name': 'rows', - 'semantic_types': [ - 'https://metadata.datadrivendiscovery.org/types/TabularRow']}, - 'schema': 'https://metadata.datadrivendiscovery.org/schemas/v0/container.json', - 'semantic_types': ['https://metadata.datadrivendiscovery.org/types/Table'], - 'structural_type': 'd3m.container.pandas.DataFrame'}, - 'selector': []}, - {'metadata': {'dimension': {'length': 10, - 'name': 'columns', - 'semantic_types': [ - 'https://metadata.datadrivendiscovery.org/types/TabularColumn']}}, - 'selector': ['__ALL_ELEMENTS__']}, - {'metadata': {'custom_metadata': '__NO_VALUE__', - 'name': 'species.Iris-setosa', - 'semantic_types': ['https://metadata.datadrivendiscovery.org/types/Attribute'], - 'structural_type': 'numpy.uint8'}, - 'selector': ['__ALL_ELEMENTS__', 0]}, - {'metadata': {'custom_metadata': '__NO_VALUE__', - 'name': 'species.Iris-versicolor', - 'semantic_types': ['https://metadata.datadrivendiscovery.org/types/Attribute'], - 'structural_type': 'numpy.uint8'}, - 'selector': ['__ALL_ELEMENTS__', 1]}, - {'metadata': {'custom_metadata': '__NO_VALUE__', - 'name': 'species.Iris-virginica', - 'semantic_types': ['https://metadata.datadrivendiscovery.org/types/Attribute'], - 'structural_type': 'numpy.uint8'}, - 'selector': ['__ALL_ELEMENTS__', 2]}, - {'metadata': {'custom_metadata': '__NO_VALUE__', - 'name': '2-species.Iris-setosa', - 'semantic_types': ['https://metadata.datadrivendiscovery.org/types/Attribute'], - 'structural_type': 'numpy.uint8'}, - 'selector': ['__ALL_ELEMENTS__', 3]}, - {'metadata': {'custom_metadata': '__NO_VALUE__', - 'name': '2-species.Iris-versicolor', - 'semantic_types': ['https://metadata.datadrivendiscovery.org/types/Attribute'], - 'structural_type': 'numpy.uint8'}, - 'selector': ['__ALL_ELEMENTS__', 4]}, - {'metadata': {'custom_metadata': '__NO_VALUE__', - 'name': '2-species.Iris-virginica', - 'semantic_types': ['https://metadata.datadrivendiscovery.org/types/Attribute'], - 'structural_type': 'numpy.uint8'}, - 'selector': ['__ALL_ELEMENTS__', 5]}, - {'metadata': {'custom_metadata': 'attributes', - 'name': 'sepalLength', - 'semantic_types': ['http://schema.org/Float', - 'https://metadata.datadrivendiscovery.org/types/Attribute'], - 'structural_type': 'float'}, - 'selector': ['__ALL_ELEMENTS__', 6]}, - {'metadata': {'custom_metadata': 'attributes', - 'name': 'sepalWidth', - 'semantic_types': ['http://schema.org/Float', - 'https://metadata.datadrivendiscovery.org/types/Attribute'], - 'structural_type': 'float'}, - 'selector': ['__ALL_ELEMENTS__', 7]}, - {'metadata': {'custom_metadata': 'attributes', - 'name': 'petalLength', - 'semantic_types': ['http://schema.org/Float', - 'https://metadata.datadrivendiscovery.org/types/Attribute'], - 'structural_type': 'float'}, - 'selector': ['__ALL_ELEMENTS__', 8]}, - {'metadata': {'custom_metadata': 'attributes', - 'name': 'petalWidth', - 'semantic_types': ['http://schema.org/Float', - 'https://metadata.datadrivendiscovery.org/types/Attribute'], - 'structural_type': 'float'}, - 'selector': ['__ALL_ELEMENTS__', 9]}] - ) - - -if __name__ == '__main__': - unittest.main() diff --git a/common-primitives/tests/test_pandas_onehot_encoder.py b/common-primitives/tests/test_pandas_onehot_encoder.py deleted file mode 100644 index d7b4b30..0000000 --- a/common-primitives/tests/test_pandas_onehot_encoder.py +++ /dev/null @@ -1,178 +0,0 @@ -import unittest -import pandas as pd - -from d3m import container, utils -from common_primitives.pandas_onehot_encoder import PandasOneHotEncoderPrimitive -from d3m.metadata import base as metadata_base - -import utils as test_utils - - -class PandasOneHotEncoderPrimitiveTestCase(unittest.TestCase): - def test_basic(self): - training = pd.DataFrame({'Name': ['Henry', 'Diane', 'Kitty', 'Peter']}) - training = container.DataFrame(training, generate_metadata=True) - training.metadata = training.metadata.add_semantic_type((metadata_base.ALL_ELEMENTS, 0), 'https://metadata.datadrivendiscovery.org/types/CategoricalData',) - training.metadata = training.metadata.add_semantic_type((metadata_base.ALL_ELEMENTS, 0), 'https://metadata.datadrivendiscovery.org/types/Attribute',) - - testing = pd.DataFrame({'Name': ['John', 'Alex','Henry','Diane']}) - testing = container.DataFrame(testing, generate_metadata=True) - testing.metadata = testing.metadata.add_semantic_type((metadata_base.ALL_ELEMENTS, 0), 'https://metadata.datadrivendiscovery.org/types/CategoricalData') - testing.metadata = testing.metadata.add_semantic_type((metadata_base.ALL_ELEMENTS, 0), 'https://metadata.datadrivendiscovery.org/types/Attribute',) - testing.metadata = testing.metadata.update_column(0, { - 'custom_metadata': 42, - }) - - Hyperparams = PandasOneHotEncoderPrimitive.metadata.get_hyperparams() - ht = PandasOneHotEncoderPrimitive(hyperparams=Hyperparams.defaults()) - - ht.set_training_data(inputs=training) - ht.fit() - - result_df = ht.produce(inputs=testing).value - - self.assertEqual(list(result_df.columns), ['Name_Diane', 'Name_Henry', 'Name_Kitty', 'Name_Peter']) - - self.assertEqual(list(result_df['Name_Henry']), [0, 0, 1, 0]) - self.assertEqual(list(result_df['Name_Diane']), [0, 0, 0, 1]) - self.assertEqual(list(result_df['Name_Kitty']), [0, 0, 0, 0]) - self.assertEqual(list(result_df['Name_Peter']), [0, 0, 0, 0]) - - self.assertEqual(test_utils.convert_metadata(utils.to_json_structure(result_df.metadata.to_internal_simple_structure())), [{ - 'selector': [], - 'metadata': { - 'dimension': { - 'length': 4, - 'name': 'rows', - 'semantic_types': ['https://metadata.datadrivendiscovery.org/types/TabularRow'], - }, - 'schema': 'https://metadata.datadrivendiscovery.org/schemas/v0/container.json', - 'semantic_types': ['https://metadata.datadrivendiscovery.org/types/Table'], - 'structural_type': 'd3m.container.pandas.DataFrame', - }, - }, { - 'selector': ['__ALL_ELEMENTS__'], - 'metadata': { - 'dimension': { - 'length': 4, - 'name': 'columns', - 'semantic_types': ['https://metadata.datadrivendiscovery.org/types/TabularColumn'], - }, - }, - }, { - 'selector': ['__ALL_ELEMENTS__', 0], - 'metadata': { - 'custom_metadata': 42, - 'name': 'Name_Diane', - 'semantic_types': ['https://metadata.datadrivendiscovery.org/types/Attribute'], - 'structural_type': 'numpy.uint8', - }, - }, { - 'selector': ['__ALL_ELEMENTS__', 1], - 'metadata': { - 'custom_metadata': 42, - 'name': 'Name_Henry', - 'semantic_types': ['https://metadata.datadrivendiscovery.org/types/Attribute'], - 'structural_type': 'numpy.uint8', - }, - }, { - 'selector': ['__ALL_ELEMENTS__', 2], - 'metadata': { - 'custom_metadata': 42, - 'name': 'Name_Kitty', - 'semantic_types': ['https://metadata.datadrivendiscovery.org/types/Attribute'], - 'structural_type': 'numpy.uint8', - }, - }, { - 'selector': ['__ALL_ELEMENTS__', 3], - 'metadata': { - 'custom_metadata': 42, - 'name': 'Name_Peter', - 'semantic_types': ['https://metadata.datadrivendiscovery.org/types/Attribute'], - 'structural_type': 'numpy.uint8', - }, - }]) - - ht = PandasOneHotEncoderPrimitive(hyperparams=Hyperparams.defaults().replace({ - 'dummy_na': True, - })) - - ht.set_training_data(inputs=training) - ht.fit() - - result_df = ht.produce(inputs=testing).value - - self.assertEqual(list(result_df.columns), ['Name_Diane', 'Name_Henry', 'Name_Kitty', 'Name_Peter', 'Name_nan']) - - self.assertEqual(list(result_df['Name_Henry']), [0, 0, 1, 0]) - self.assertEqual(list(result_df['Name_Diane']), [0, 0, 0, 1]) - self.assertEqual(list(result_df['Name_Kitty']), [0, 0, 0, 0]) - self.assertEqual(list(result_df['Name_Peter']), [0, 0, 0, 0]) - self.assertEqual(list(result_df['Name_nan']), [1, 1, 0, 0]) - - self.assertEqual(test_utils.convert_metadata(utils.to_json_structure(result_df.metadata.to_internal_simple_structure())), [{ - 'selector': [], - 'metadata': { - 'dimension': { - 'length': 4, - 'name': 'rows', - 'semantic_types': ['https://metadata.datadrivendiscovery.org/types/TabularRow'], - }, - 'schema': 'https://metadata.datadrivendiscovery.org/schemas/v0/container.json', - 'semantic_types': ['https://metadata.datadrivendiscovery.org/types/Table'], - 'structural_type': 'd3m.container.pandas.DataFrame', - }, - }, { - 'selector': ['__ALL_ELEMENTS__'], - 'metadata': { - 'dimension': { - 'length': 5, - 'name': 'columns', - 'semantic_types': ['https://metadata.datadrivendiscovery.org/types/TabularColumn'], - }, - }, - }, { - 'selector': ['__ALL_ELEMENTS__', 0], - 'metadata': { - 'custom_metadata': 42, - 'name': 'Name_Diane', - 'semantic_types': ['https://metadata.datadrivendiscovery.org/types/Attribute'], - 'structural_type': 'numpy.uint8', - }, - }, { - 'selector': ['__ALL_ELEMENTS__', 1], - 'metadata': { - 'custom_metadata': 42, - 'name': 'Name_Henry', - 'semantic_types': ['https://metadata.datadrivendiscovery.org/types/Attribute'], - 'structural_type': 'numpy.uint8', - }, - }, { - 'selector': ['__ALL_ELEMENTS__', 2], - 'metadata': { - 'custom_metadata': 42, - 'name': 'Name_Kitty', - 'semantic_types': ['https://metadata.datadrivendiscovery.org/types/Attribute'], - 'structural_type': 'numpy.uint8', - }, - }, { - 'selector': ['__ALL_ELEMENTS__', 3], - 'metadata': { - 'custom_metadata': 42, - 'name': 'Name_Peter', - 'semantic_types': ['https://metadata.datadrivendiscovery.org/types/Attribute'], - 'structural_type': 'numpy.uint8', - }, - }, { - 'selector': ['__ALL_ELEMENTS__', 4], - 'metadata': { - 'custom_metadata': 42, - 'name': 'Name_nan', - 'semantic_types': ['https://metadata.datadrivendiscovery.org/types/Attribute'], - 'structural_type': 'numpy.uint8', - }, - }]) - - -if __name__ == '__main__': - unittest.main() diff --git a/common-primitives/tests/test_random_forest.py b/common-primitives/tests/test_random_forest.py deleted file mode 100644 index 5daee9c..0000000 --- a/common-primitives/tests/test_random_forest.py +++ /dev/null @@ -1,701 +0,0 @@ -import logging -import os -import pickle -import unittest - -from d3m import container, utils -from d3m.metadata import base as metadata_base - -from common_primitives import dataset_to_dataframe, extract_columns_semantic_types, random_forest, column_parser - - -class RandomForestTestCase(unittest.TestCase): - def _get_iris(self): - dataset_doc_path = os.path.abspath(os.path.join(os.path.dirname(__file__), 'data', 'datasets', 'iris_dataset_1', 'datasetDoc.json')) - - dataset = container.Dataset.load('file://{dataset_doc_path}'.format(dataset_doc_path=dataset_doc_path)) - - hyperparams_class = dataset_to_dataframe.DatasetToDataFramePrimitive.metadata.get_hyperparams() - primitive = dataset_to_dataframe.DatasetToDataFramePrimitive(hyperparams=hyperparams_class.defaults()) - - dataframe = primitive.produce(inputs=dataset).value - - return dataframe - - def _get_iris_columns(self): - dataframe = self._get_iris() - - # We set custom metadata on columns. - for column_index in range(1, 5): - dataframe.metadata = dataframe.metadata.update_column(column_index, {'custom_metadata': 'attributes'}) - for column_index in range(5, 6): - dataframe.metadata = dataframe.metadata.update_column(column_index, {'custom_metadata': 'targets'}) - - # We set semantic types like runtime would. - dataframe.metadata = dataframe.metadata.add_semantic_type((metadata_base.ALL_ELEMENTS, 5), 'https://metadata.datadrivendiscovery.org/types/Target') - dataframe.metadata = dataframe.metadata.add_semantic_type((metadata_base.ALL_ELEMENTS, 5), 'https://metadata.datadrivendiscovery.org/types/TrueTarget') - dataframe.metadata = dataframe.metadata.remove_semantic_type((metadata_base.ALL_ELEMENTS, 5), 'https://metadata.datadrivendiscovery.org/types/Attribute') - - # Parsing. - hyperparams_class = column_parser.ColumnParserPrimitive.metadata.get_hyperparams() - primitive = column_parser.ColumnParserPrimitive(hyperparams=hyperparams_class.defaults()) - dataframe = primitive.produce(inputs=dataframe).value - - hyperparams_class = extract_columns_semantic_types.ExtractColumnsBySemanticTypesPrimitive.metadata.get_hyperparams() - - primitive = extract_columns_semantic_types.ExtractColumnsBySemanticTypesPrimitive(hyperparams=hyperparams_class.defaults().replace({'semantic_types': ('https://metadata.datadrivendiscovery.org/types/Attribute',)})) - attributes = primitive.produce(inputs=dataframe).value - - primitive = extract_columns_semantic_types.ExtractColumnsBySemanticTypesPrimitive(hyperparams=hyperparams_class.defaults().replace({'semantic_types': ('https://metadata.datadrivendiscovery.org/types/SuggestedTarget',)})) - targets = primitive.produce(inputs=dataframe).value - - return dataframe, attributes, targets - - def test_single_target(self): - dataframe, attributes, targets = self._get_iris_columns() - - self.assertEqual(list(targets.columns), ['species']) - - hyperparams_class = random_forest.RandomForestClassifierPrimitive.metadata.get_hyperparams() - primitive = random_forest.RandomForestClassifierPrimitive(hyperparams=hyperparams_class.defaults().replace({'return_result': 'new', 'add_index_columns': False})) - - primitive.set_training_data(inputs=attributes, outputs=targets) - primitive.fit() - - predictions = primitive.produce(inputs=attributes).value - - self.assertEqual(list(predictions.columns), ['species']) - - self.assertEqual(predictions.shape, (150, 1)) - self.assertEqual(predictions.iloc[0, 0], 'Iris-setosa') - self.assertTrue(predictions.metadata.has_semantic_type((metadata_base.ALL_ELEMENTS, 0), 'https://metadata.datadrivendiscovery.org/types/PredictedTarget')) - self.assertFalse(predictions.metadata.has_semantic_type((metadata_base.ALL_ELEMENTS, 0), 'https://metadata.datadrivendiscovery.org/types/TrueTarget')) - self.assertEqual(predictions.metadata.query_column(0)['name'], 'species') - self.assertEqual(predictions.metadata.query_column(0)['custom_metadata'], 'targets') - - self._test_single_target_metadata(predictions.metadata) - - samples = primitive.sample(inputs=attributes).value - - self.assertEqual(list(samples[0].columns), ['species']) - - self.assertEqual(len(samples), 1) - self.assertEqual(samples[0].shape, (150, 1)) - self.assertEqual(samples[0].iloc[0, 0], 'Iris-setosa') - self.assertTrue(samples[0].metadata.has_semantic_type((metadata_base.ALL_ELEMENTS, 0), 'https://metadata.datadrivendiscovery.org/types/PredictedTarget')) - self.assertFalse(samples[0].metadata.has_semantic_type((metadata_base.ALL_ELEMENTS, 0), 'https://metadata.datadrivendiscovery.org/types/TrueTarget')) - self.assertEqual(samples[0].metadata.query_column(0)['name'], 'species') - self.assertEqual(samples[0].metadata.query_column(0)['custom_metadata'], 'targets') - - log_likelihoods = primitive.log_likelihoods(inputs=attributes, outputs=targets).value - - self.assertEqual(list(log_likelihoods.columns), ['species']) - - self.assertEqual(log_likelihoods.shape, (150, 1)) - self.assertEqual(log_likelihoods.metadata.query_column(0)['name'], 'species') - - log_likelihood = primitive.log_likelihood(inputs=attributes, outputs=targets).value - - self.assertEqual(list(log_likelihood.columns), ['species']) - - self.assertEqual(log_likelihood.shape, (1, 1)) - self.assertAlmostEqual(log_likelihood.iloc[0, 0], -3.72702785304761) - self.assertEqual(log_likelihoods.metadata.query_column(0)['name'], 'species') - - feature_importances = primitive.produce_feature_importances().value - - self.assertEqual(list(feature_importances), ['sepalLength', 'sepalWidth', 'petalLength', 'petalWidth']) - self.assertEqual(feature_importances.metadata.query_column(0)['name'], 'sepalLength') - self.assertEqual(feature_importances.metadata.query_column(1)['name'], 'sepalWidth') - self.assertEqual(feature_importances.metadata.query_column(2)['name'], 'petalLength') - self.assertEqual(feature_importances.metadata.query_column(3)['name'], 'petalWidth') - - self.assertEqual(feature_importances.values.tolist(), [[0.09090795402103087, - 0.024531041234715757, - 0.46044473961715215, - 0.42411626512710127, - ]]) - - def _test_single_target_metadata(self, predictions_metadata): - expected_metadata = [{ - 'selector': [], - 'metadata': { - 'schema': metadata_base.CONTAINER_SCHEMA_VERSION, - 'structural_type': 'd3m.container.pandas.DataFrame', - 'semantic_types': ['https://metadata.datadrivendiscovery.org/types/Table'], - 'dimension': { - 'name': 'rows', - 'semantic_types': ['https://metadata.datadrivendiscovery.org/types/TabularRow'], - 'length': 150, - }, - }, - }, { - 'selector': ['__ALL_ELEMENTS__'], - 'metadata': { - 'dimension': { - 'name': 'columns', - 'semantic_types': ['https://metadata.datadrivendiscovery.org/types/TabularColumn'], - 'length': 1, - }, - }, - }, { - 'selector': ['__ALL_ELEMENTS__', 0], - 'metadata': { - 'structural_type': 'str', - 'name': 'species', - 'semantic_types': ['https://metadata.datadrivendiscovery.org/types/CategoricalData', 'https://metadata.datadrivendiscovery.org/types/SuggestedTarget', 'https://metadata.datadrivendiscovery.org/types/Target', 'https://metadata.datadrivendiscovery.org/types/PredictedTarget'], - 'custom_metadata': 'targets', - }, - }] - - self.assertEqual(utils.to_json_structure(predictions_metadata.to_internal_simple_structure()), expected_metadata) - - def test_multiple_targets(self): - dataframe, attributes, targets = self._get_iris_columns() - - targets = targets.append_columns(targets) - - self.assertEqual(list(targets.columns), ['species', 'species']) - - hyperparams_class = random_forest.RandomForestClassifierPrimitive.metadata.get_hyperparams() - primitive = random_forest.RandomForestClassifierPrimitive(hyperparams=hyperparams_class.defaults().replace({'return_result': 'new', 'add_index_columns': False})) - - primitive.set_training_data(inputs=attributes, outputs=targets) - primitive.fit() - - predictions = primitive.produce(inputs=attributes).value - - self.assertEqual(list(predictions.columns), ['species', 'species']) - - self.assertEqual(predictions.shape, (150, 2)) - for column_index in range(2): - self.assertEqual(predictions.iloc[0, column_index], 'Iris-setosa') - self.assertTrue(predictions.metadata.has_semantic_type((metadata_base.ALL_ELEMENTS, column_index), 'https://metadata.datadrivendiscovery.org/types/PredictedTarget')) - self.assertFalse(predictions.metadata.has_semantic_type((metadata_base.ALL_ELEMENTS, column_index), 'https://metadata.datadrivendiscovery.org/types/TrueTarget')) - self.assertEqual(predictions.metadata.query_column(column_index)['name'], 'species') - self.assertEqual(predictions.metadata.query_column(column_index)['custom_metadata'], 'targets') - - samples = primitive.sample(inputs=attributes).value - - self.assertEqual(list(samples[0].columns), ['species', 'species']) - - self.assertEqual(len(samples), 1) - self.assertEqual(samples[0].shape, (150, 2)) - for column_index in range(2): - self.assertEqual(samples[0].iloc[0, column_index], 'Iris-setosa') - self.assertTrue(samples[0].metadata.has_semantic_type((metadata_base.ALL_ELEMENTS, column_index), 'https://metadata.datadrivendiscovery.org/types/PredictedTarget')) - self.assertFalse(samples[0].metadata.has_semantic_type((metadata_base.ALL_ELEMENTS, column_index), 'https://metadata.datadrivendiscovery.org/types/TrueTarget')) - self.assertEqual(samples[0].metadata.query_column(column_index)['name'], 'species') - self.assertEqual(samples[0].metadata.query_column(column_index)['custom_metadata'], 'targets') - - log_likelihoods = primitive.log_likelihoods(inputs=attributes, outputs=targets).value - - self.assertEqual(list(log_likelihoods.columns), ['species', 'species']) - - self.assertEqual(log_likelihoods.shape, (150, 2)) - for column_index in range(2): - self.assertEqual(log_likelihoods.metadata.query_column(column_index)['name'], 'species') - - log_likelihood = primitive.log_likelihood(inputs=attributes, outputs=targets).value - - self.assertEqual(list(log_likelihood.columns), ['species', 'species']) - - self.assertEqual(log_likelihood.shape, (1, 2)) - for column_index in range(2): - self.assertAlmostEqual(log_likelihood.iloc[0, column_index], -3.72702785304761) - self.assertEqual(log_likelihoods.metadata.query_column(column_index)['name'], 'species') - - feature_importances = primitive.produce_feature_importances().value - - self.assertEqual(list(feature_importances), ['sepalLength', 'sepalWidth', 'petalLength', 'petalWidth']) - self.assertEqual(feature_importances.metadata.query_column(0)['name'], 'sepalLength') - self.assertEqual(feature_importances.metadata.query_column(1)['name'], 'sepalWidth') - self.assertEqual(feature_importances.metadata.query_column(2)['name'], 'petalLength') - self.assertEqual(feature_importances.metadata.query_column(3)['name'], 'petalWidth') - - self.assertEqual(feature_importances.values.tolist(), [[0.09090795402103087, - 0.024531041234715757, - 0.46044473961715215, - 0.42411626512710127, - ]]) - - def test_semantic_types(self): - dataframe, attributes, targets = self._get_iris_columns() - - hyperparams_class = random_forest.RandomForestClassifierPrimitive.metadata.get_hyperparams() - primitive = random_forest.RandomForestClassifierPrimitive(hyperparams=hyperparams_class.defaults().replace({'return_result': 'new', 'add_index_columns': False})) - - primitive.set_training_data(inputs=dataframe, outputs=dataframe) - primitive.fit() - - predictions = primitive.produce(inputs=dataframe).value - - self.assertEqual(list(predictions.columns), ['species']) - - self.assertEqual(predictions.shape, (150, 1)) - self.assertEqual(predictions.iloc[0, 0], 'Iris-setosa') - self.assertTrue(predictions.metadata.has_semantic_type((metadata_base.ALL_ELEMENTS, 0), 'https://metadata.datadrivendiscovery.org/types/PredictedTarget')) - self.assertFalse(predictions.metadata.has_semantic_type((metadata_base.ALL_ELEMENTS, 0), 'https://metadata.datadrivendiscovery.org/types/TrueTarget')) - self.assertEqual(predictions.metadata.query_column(0)['name'], 'species') - self.assertEqual(predictions.metadata.query_column(0)['custom_metadata'], 'targets') - - samples = primitive.sample(inputs=dataframe).value - - self.assertEqual(list(samples[0].columns), ['species']) - - self.assertEqual(len(samples), 1) - self.assertEqual(samples[0].shape, (150, 1)) - self.assertEqual(samples[0].iloc[0, 0], 'Iris-setosa') - self.assertTrue(samples[0].metadata.has_semantic_type((metadata_base.ALL_ELEMENTS, 0), 'https://metadata.datadrivendiscovery.org/types/PredictedTarget')) - self.assertFalse(samples[0].metadata.has_semantic_type((metadata_base.ALL_ELEMENTS, 0), 'https://metadata.datadrivendiscovery.org/types/TrueTarget')) - self.assertEqual(samples[0].metadata.query_column(0)['name'], 'species') - self.assertEqual(samples[0].metadata.query_column(0)['custom_metadata'], 'targets') - - log_likelihoods = primitive.log_likelihoods(inputs=dataframe, outputs=dataframe).value - - self.assertEqual(list(log_likelihoods.columns), ['species']) - - self.assertEqual(log_likelihoods.shape, (150, 1)) - self.assertEqual(log_likelihoods.metadata.query_column(0)['name'], 'species') - - log_likelihood = primitive.log_likelihood(inputs=dataframe, outputs=dataframe).value - - self.assertEqual(list(log_likelihood.columns), ['species']) - - self.assertEqual(log_likelihood.shape, (1, 1)) - self.assertAlmostEqual(log_likelihood.iloc[0, 0], -3.72702785304761) - self.assertEqual(log_likelihoods.metadata.query_column(0)['name'], 'species') - - feature_importances = primitive.produce_feature_importances().value - - self.assertEqual(list(feature_importances), ['sepalLength', 'sepalWidth', 'petalLength', 'petalWidth']) - self.assertEqual(feature_importances.metadata.query_column(0)['name'], 'sepalLength') - self.assertEqual(feature_importances.metadata.query_column(1)['name'], 'sepalWidth') - self.assertEqual(feature_importances.metadata.query_column(2)['name'], 'petalLength') - self.assertEqual(feature_importances.metadata.query_column(3)['name'], 'petalWidth') - - self.assertEqual(feature_importances.values.tolist(), [[0.09090795402103087, - 0.024531041234715757, - 0.46044473961715215, - 0.42411626512710127, - ]]) - - def test_return_append(self): - dataframe, attributes, targets = self._get_iris_columns() - - hyperparams_class = random_forest.RandomForestClassifierPrimitive.metadata.get_hyperparams() - primitive = random_forest.RandomForestClassifierPrimitive(hyperparams=hyperparams_class.defaults()) - - primitive.set_training_data(inputs=dataframe, outputs=dataframe) - primitive.fit() - - predictions = primitive.produce(inputs=dataframe).value - - self.assertEqual(list(predictions.columns), [ - 'd3mIndex', - 'sepalLength', - 'sepalWidth', - 'petalLength', - 'petalWidth', - 'species', - 'species', - ]) - - self.assertEqual(predictions.shape, (150, 7)) - self.assertEqual(predictions.iloc[0, 6], 'Iris-setosa') - self.assertTrue(predictions.metadata.has_semantic_type((metadata_base.ALL_ELEMENTS, 6), 'https://metadata.datadrivendiscovery.org/types/PredictedTarget')) - self.assertFalse(predictions.metadata.has_semantic_type((metadata_base.ALL_ELEMENTS, 6), 'https://metadata.datadrivendiscovery.org/types/TrueTarget')) - self.assertEqual(predictions.metadata.query_column(6)['name'], 'species') - self.assertEqual(predictions.metadata.query_column(6)['custom_metadata'], 'targets') - - self._test_return_append_metadata(predictions.metadata) - - def _test_return_append_metadata(self, predictions_metadata): - self.assertEqual(utils.to_json_structure(predictions_metadata.to_internal_simple_structure()), [{ - 'selector': [], - 'metadata': { - 'structural_type': 'd3m.container.pandas.DataFrame', - 'semantic_types': ['https://metadata.datadrivendiscovery.org/types/Table'], - 'dimension': { - 'name': 'rows', - 'semantic_types': ['https://metadata.datadrivendiscovery.org/types/TabularRow'], - 'length': 150, - }, - 'schema': 'https://metadata.datadrivendiscovery.org/schemas/v0/container.json', - }, - }, { - 'selector': ['__ALL_ELEMENTS__'], - 'metadata': { - 'dimension': { - 'name': 'columns', - 'semantic_types': ['https://metadata.datadrivendiscovery.org/types/TabularColumn'], - 'length': 7, - }, - }, - }, { - 'selector': ['__ALL_ELEMENTS__', 0], - 'metadata': { - 'name': 'd3mIndex', - 'structural_type': 'int', - 'semantic_types': ['http://schema.org/Integer', 'https://metadata.datadrivendiscovery.org/types/PrimaryKey'], - }, - }, { - 'selector': ['__ALL_ELEMENTS__', 1], - 'metadata': { - 'name': 'sepalLength', - 'structural_type': 'float', - 'semantic_types': ['http://schema.org/Float', 'https://metadata.datadrivendiscovery.org/types/Attribute'], - 'custom_metadata': 'attributes', - }, - }, { - 'selector': ['__ALL_ELEMENTS__', 2], - 'metadata': { - 'name': 'sepalWidth', - 'structural_type': 'float', - 'semantic_types': ['http://schema.org/Float', 'https://metadata.datadrivendiscovery.org/types/Attribute'], - 'custom_metadata': 'attributes', - }, - }, { - 'selector': ['__ALL_ELEMENTS__', 3], - 'metadata': { - 'name': 'petalLength', - 'structural_type': 'float', - 'semantic_types': ['http://schema.org/Float', 'https://metadata.datadrivendiscovery.org/types/Attribute'], - 'custom_metadata': 'attributes', - }, - }, { - 'selector': ['__ALL_ELEMENTS__', 4], - 'metadata': { - 'name': 'petalWidth', - 'structural_type': 'float', - 'semantic_types': ['http://schema.org/Float', 'https://metadata.datadrivendiscovery.org/types/Attribute'], - 'custom_metadata': 'attributes', - }, - }, { - 'selector': ['__ALL_ELEMENTS__', 5], - 'metadata': { - 'name': 'species', - 'structural_type': 'str', - 'semantic_types': ['https://metadata.datadrivendiscovery.org/types/CategoricalData', 'https://metadata.datadrivendiscovery.org/types/SuggestedTarget', 'https://metadata.datadrivendiscovery.org/types/Target', 'https://metadata.datadrivendiscovery.org/types/TrueTarget'], - 'custom_metadata': 'targets', - }, - }, { - 'selector': ['__ALL_ELEMENTS__', 6], - 'metadata': { - 'structural_type': 'str', - 'name': 'species', - 'semantic_types': ['https://metadata.datadrivendiscovery.org/types/CategoricalData', 'https://metadata.datadrivendiscovery.org/types/SuggestedTarget', 'https://metadata.datadrivendiscovery.org/types/Target', 'https://metadata.datadrivendiscovery.org/types/PredictedTarget'], - 'custom_metadata': 'targets', - }, - }]) - - def test_return_new(self): - dataframe, attributes, targets = self._get_iris_columns() - - hyperparams_class = random_forest.RandomForestClassifierPrimitive.metadata.get_hyperparams() - primitive = random_forest.RandomForestClassifierPrimitive(hyperparams=hyperparams_class.defaults().replace({'return_result': 'new'})) - - primitive.set_training_data(inputs=dataframe, outputs=dataframe) - primitive.fit() - - predictions = primitive.produce(inputs=dataframe).value - - self.assertEqual(list(predictions.columns), [ - 'd3mIndex', - 'species', - ]) - - self.assertEqual(predictions.shape, (150, 2)) - self.assertEqual(predictions.iloc[0, 1], 'Iris-setosa') - self.assertTrue(predictions.metadata.has_semantic_type((metadata_base.ALL_ELEMENTS, 1), 'https://metadata.datadrivendiscovery.org/types/PredictedTarget')) - self.assertFalse(predictions.metadata.has_semantic_type((metadata_base.ALL_ELEMENTS, 1), 'https://metadata.datadrivendiscovery.org/types/TrueTarget')) - self.assertEqual(predictions.metadata.query_column(1)['name'], 'species') - self.assertEqual(predictions.metadata.query_column(1)['custom_metadata'], 'targets') - - self._test_return_new_metadata(predictions.metadata) - - def _test_return_new_metadata(self, predictions_metadata): - expected_metadata = [{ - 'selector': [], - 'metadata': { - 'structural_type': 'd3m.container.pandas.DataFrame', - 'semantic_types': ['https://metadata.datadrivendiscovery.org/types/Table'], - 'dimension': { - 'name': 'rows', - 'semantic_types': ['https://metadata.datadrivendiscovery.org/types/TabularRow'], - 'length': 150, - }, - 'schema': 'https://metadata.datadrivendiscovery.org/schemas/v0/container.json', - }, - }, { - 'selector': ['__ALL_ELEMENTS__'], - 'metadata': { - 'dimension': { - 'name': 'columns', - 'semantic_types': ['https://metadata.datadrivendiscovery.org/types/TabularColumn'], - 'length': 2, - }, - }, - }, { - 'selector': ['__ALL_ELEMENTS__', 0], - 'metadata': { - 'name': 'd3mIndex', - 'structural_type': 'int', - 'semantic_types': ['http://schema.org/Integer', 'https://metadata.datadrivendiscovery.org/types/PrimaryKey'], - }, - }, { - 'selector': ['__ALL_ELEMENTS__', 1], - 'metadata': { - 'structural_type': 'str', - 'name': 'species', - 'semantic_types': ['https://metadata.datadrivendiscovery.org/types/CategoricalData', 'https://metadata.datadrivendiscovery.org/types/SuggestedTarget', 'https://metadata.datadrivendiscovery.org/types/Target', 'https://metadata.datadrivendiscovery.org/types/PredictedTarget'], - 'custom_metadata': 'targets', - }, - }] - - self.assertEqual(utils.to_json_structure(predictions_metadata.to_internal_simple_structure()), expected_metadata) - - def test_return_replace(self): - dataframe, attributes, targets = self._get_iris_columns() - - hyperparams_class = random_forest.RandomForestClassifierPrimitive.metadata.get_hyperparams() - primitive = random_forest.RandomForestClassifierPrimitive(hyperparams=hyperparams_class.defaults().replace({'return_result': 'replace'})) - - primitive.set_training_data(inputs=dataframe, outputs=dataframe) - primitive.fit() - - predictions = primitive.produce(inputs=dataframe).value - - self.assertEqual(list(predictions.columns), [ - 'd3mIndex', - 'species', - 'species', - ]) - - self.assertEqual(predictions.shape, (150, 3)) - self.assertEqual(predictions.iloc[0, 1], 'Iris-setosa') - self.assertTrue(predictions.metadata.has_semantic_type((metadata_base.ALL_ELEMENTS, 1), 'https://metadata.datadrivendiscovery.org/types/PredictedTarget')) - self.assertFalse(predictions.metadata.has_semantic_type((metadata_base.ALL_ELEMENTS, 1), 'https://metadata.datadrivendiscovery.org/types/TrueTarget')) - self.assertEqual(predictions.metadata.query_column(1)['name'], 'species') - self.assertEqual(predictions.metadata.query_column(1)['custom_metadata'], 'targets') - - self._test_return_replace_metadata(predictions.metadata) - - def test_get_set_params(self): - dataframe, attributes, targets = self._get_iris_columns() - - hyperparams_class = random_forest.RandomForestClassifierPrimitive.metadata.get_hyperparams() - primitive = random_forest.RandomForestClassifierPrimitive( - hyperparams=hyperparams_class.defaults().replace({'return_result': 'new', 'add_index_columns': False})) - - primitive.set_training_data(inputs=attributes, outputs=targets) - primitive.fit() - - before_set_prediction = primitive.produce(inputs=attributes).value - params = primitive.get_params() - primitive.set_params(params=params) - after_set_prediction = primitive.produce(inputs=attributes).value - self.assertTrue(container.DataFrame.equals(before_set_prediction, after_set_prediction)) - - def test_pickle_unpickle(self): - dataframe, attributes, targets = self._get_iris_columns() - - hyperparams_class = random_forest.RandomForestClassifierPrimitive.metadata.get_hyperparams() - primitive = random_forest.RandomForestClassifierPrimitive( - hyperparams=hyperparams_class.defaults().replace({'return_result': 'new', 'add_index_columns': False})) - - primitive.set_training_data(inputs=attributes, outputs=targets) - primitive.fit() - - before_pickled_prediction = primitive.produce(inputs=attributes).value - pickle_object = pickle.dumps(primitive) - primitive = pickle.loads(pickle_object) - after_unpickled_prediction = primitive.produce(inputs=attributes).value - self.assertTrue(container.DataFrame.equals(before_pickled_prediction, after_unpickled_prediction)) - - def _test_return_replace_metadata(self, predictions_metadata): - self.assertEqual(utils.to_json_structure(predictions_metadata.to_internal_simple_structure()), [{ - 'selector': [], - 'metadata': { - 'structural_type': 'd3m.container.pandas.DataFrame', - 'semantic_types': ['https://metadata.datadrivendiscovery.org/types/Table'], - 'dimension': { - 'name': 'rows', - 'semantic_types': ['https://metadata.datadrivendiscovery.org/types/TabularRow'], - 'length': 150, - }, - 'schema': 'https://metadata.datadrivendiscovery.org/schemas/v0/container.json', - }, - }, { - 'selector': ['__ALL_ELEMENTS__'], - 'metadata': { - 'dimension': { - 'name': 'columns', - 'semantic_types': ['https://metadata.datadrivendiscovery.org/types/TabularColumn'], - 'length': 3, - }, - }, - }, { - 'selector': ['__ALL_ELEMENTS__', 0], - 'metadata': { - 'name': 'd3mIndex', - 'structural_type': 'int', - 'semantic_types': ['http://schema.org/Integer', 'https://metadata.datadrivendiscovery.org/types/PrimaryKey'], - }, - }, { - 'selector': ['__ALL_ELEMENTS__', 1], - 'metadata': { - 'structural_type': 'str', - 'name': 'species', - 'semantic_types': ['https://metadata.datadrivendiscovery.org/types/CategoricalData', 'https://metadata.datadrivendiscovery.org/types/SuggestedTarget', 'https://metadata.datadrivendiscovery.org/types/Target', 'https://metadata.datadrivendiscovery.org/types/PredictedTarget'], - 'custom_metadata': 'targets', - }, - }, { - 'selector': ['__ALL_ELEMENTS__', 2], - 'metadata': { - 'name': 'species', - 'structural_type': 'str', - 'semantic_types': ['https://metadata.datadrivendiscovery.org/types/CategoricalData', 'https://metadata.datadrivendiscovery.org/types/SuggestedTarget', 'https://metadata.datadrivendiscovery.org/types/Target', 'https://metadata.datadrivendiscovery.org/types/TrueTarget'], - 'custom_metadata': 'targets', - }, - }]) - - def test_empty_data(self): - dataframe, attributes, targets = self._get_iris_columns() - - hyperparams_class = random_forest.RandomForestClassifierPrimitive.metadata.get_hyperparams() - primitive = random_forest.RandomForestClassifierPrimitive(hyperparams=hyperparams_class.defaults()) - - just_index_dataframe = dataframe.select_columns([0]) - no_attributes_dataframe = dataframe.select_columns([0, 5]) - - primitive.set_training_data(inputs=just_index_dataframe, outputs=just_index_dataframe) - - with self.assertRaises(Exception): - primitive.fit() - - primitive.set_training_data(inputs=no_attributes_dataframe, outputs=no_attributes_dataframe) - - with self.assertRaises(Exception): - primitive.fit() - - primitive = random_forest.RandomForestClassifierPrimitive(hyperparams=hyperparams_class.defaults().replace({ - 'error_on_no_columns': False, - 'return_result': 'replace', - })) - - primitive.set_training_data(inputs=just_index_dataframe, outputs=just_index_dataframe) - - with self.assertLogs(primitive.logger, level=logging.WARNING) as cm: - primitive.fit() - - self.assertEqual(len(cm.records), 2) - self.assertEqual(cm.records[0].msg, "No inputs columns.") - self.assertEqual(cm.records[1].msg, "No outputs columns.") - - # Test pickling. - pickle_object = pickle.dumps(primitive) - pickle.loads(pickle_object) - - with self.assertLogs(primitive.logger, level=logging.WARNING) as cm: - predictions = primitive.produce(inputs=just_index_dataframe).value - - self.assertEqual(len(cm.records), 1) - self.assertEqual(cm.records[0].msg, "No inputs columns.") - - self.assertEqual(list(predictions.columns), [ - 'd3mIndex', - ]) - self.assertEqual(predictions.shape, (150, 1)) - - self.assertEqual(predictions.metadata.to_internal_json_structure(), just_index_dataframe.metadata.to_internal_json_structure()) - - primitive = random_forest.RandomForestClassifierPrimitive(hyperparams=hyperparams_class.defaults().replace({ - 'error_on_no_columns': False, - 'return_result': 'replace', - })) - - primitive.set_training_data(inputs=no_attributes_dataframe, outputs=no_attributes_dataframe) - - with self.assertLogs(primitive.logger, level=logging.WARNING) as cm: - primitive.fit() - - self.assertEqual(len(cm.records), 1) - self.assertEqual(cm.records[0].msg, "No inputs columns.") - - # Test pickling. - pickle_object = pickle.dumps(primitive) - pickle.loads(pickle_object) - - with self.assertLogs(primitive.logger, level=logging.WARNING) as cm: - predictions = primitive.produce(inputs=no_attributes_dataframe).value - - self.assertEqual(len(cm.records), 1) - self.assertEqual(cm.records[0].msg, "No inputs columns.") - - self.assertEqual(list(predictions.columns), [ - 'd3mIndex', - 'species', - ]) - self.assertEqual(predictions.shape, (150, 2)) - - self.assertEqual(predictions.metadata.to_internal_json_structure(), no_attributes_dataframe.metadata.to_internal_json_structure()) - - primitive = random_forest.RandomForestClassifierPrimitive(hyperparams=hyperparams_class.defaults().replace({ - 'error_on_no_columns': False, - 'return_result': 'new', - })) - - primitive.set_training_data(inputs=no_attributes_dataframe, outputs=no_attributes_dataframe) - - with self.assertLogs(primitive.logger, level=logging.WARNING) as cm: - primitive.fit() - - self.assertEqual(len(cm.records), 1) - self.assertEqual(cm.records[0].msg, "No inputs columns.") - - # Test pickling. - pickle_object = pickle.dumps(primitive) - pickle.loads(pickle_object) - - with self.assertLogs(primitive.logger, level=logging.WARNING) as cm: - with self.assertRaises(ValueError): - primitive.produce(inputs=no_attributes_dataframe) - - self.assertEqual(len(cm.records), 1) - self.assertEqual(cm.records[0].msg, "No inputs columns.") - - primitive = random_forest.RandomForestClassifierPrimitive(hyperparams=hyperparams_class.defaults().replace({ - 'error_on_no_columns': False, - 'return_result': 'append', - })) - - primitive.set_training_data(inputs=no_attributes_dataframe, outputs=no_attributes_dataframe) - - with self.assertLogs(primitive.logger, level=logging.WARNING) as cm: - primitive.fit() - - # Test pickling. - pickle_object = pickle.dumps(primitive) - pickle.loads(pickle_object) - - self.assertEqual(len(cm.records), 1) - self.assertEqual(cm.records[0].msg, "No inputs columns.") - - with self.assertLogs(primitive.logger, level=logging.WARNING) as cm: - predictions = primitive.produce(inputs=no_attributes_dataframe).value - - self.assertEqual(len(cm.records), 1) - self.assertEqual(cm.records[0].msg, "No inputs columns.") - - self.assertEqual(list(predictions.columns), [ - 'd3mIndex', - 'species', - ]) - self.assertEqual(predictions.shape, (150, 2)) - - self.assertEqual(predictions.metadata.to_internal_json_structure(), no_attributes_dataframe.metadata.to_internal_json_structure()) - - -if __name__ == '__main__': - unittest.main() diff --git a/common-primitives/tests/test_ravel.py b/common-primitives/tests/test_ravel.py deleted file mode 100644 index 33d11ac..0000000 --- a/common-primitives/tests/test_ravel.py +++ /dev/null @@ -1,125 +0,0 @@ -import unittest - -from d3m import container, utils - -from common_primitives import ravel - - -class RavelAsRowPrimitiveTestCase(unittest.TestCase): - def _get_data(self): - data = container.DataFrame({ - 'a': [1, 2, 3], - 'b': [container.ndarray([2, 3, 4]), container.ndarray([5, 6, 7]), container.ndarray([8, 9, 10])] - }, { - 'top_level': 'foobar1', - }, generate_metadata=True) - - data.metadata = data.metadata.update_column(1, { - 'semantic_types': ['https://metadata.datadrivendiscovery.org/types/Attribute'], - }) - - return data - - def test_basic(self): - dataframe = container.DataFrame({ - 'a': [1, 2, 3], - 'b': ['a', 'b', 'c'] - }, { - 'top_level': 'foobar1', - }, generate_metadata=True) - - self.assertEqual(dataframe.shape, (3, 2)) - - for row_index in range(len(dataframe)): - for column_index in range(len(dataframe.columns)): - dataframe.metadata = dataframe.metadata.update((row_index, column_index), { - 'location': (row_index, column_index), - }) - - dataframe.metadata.check(dataframe) - - hyperparams = ravel.RavelAsRowPrimitive.metadata.get_hyperparams() - primitive = ravel.RavelAsRowPrimitive(hyperparams=hyperparams.defaults()) - dataframe = primitive.produce(inputs=dataframe).value - - self.assertEqual(dataframe.shape, (1, 6)) - - self.assertEqual(dataframe.values.tolist(), [[1, 'a', 2, 'b', 3, 'c']]) - self.assertEqual(list(dataframe.columns), ['a', 'b', 'a', 'b', 'a', 'b']) - - self.assertEqual(utils.to_json_structure(dataframe.metadata.to_internal_simple_structure()), [{ - 'selector': [], - 'metadata': { - 'dimension': { - 'length': 1, - 'name': 'rows', - 'semantic_types': ['https://metadata.datadrivendiscovery.org/types/TabularRow'], - }, - 'schema': 'https://metadata.datadrivendiscovery.org/schemas/v0/container.json', - 'semantic_types': ['https://metadata.datadrivendiscovery.org/types/Table'], - 'structural_type': 'd3m.container.pandas.DataFrame', - 'top_level': 'foobar1', - }, - }, - { - 'selector': ['__ALL_ELEMENTS__'], - 'metadata': { - 'dimension': { - 'length': 6, - 'name': 'columns', - 'semantic_types': ['https://metadata.datadrivendiscovery.org/types/TabularColumn'], - }, - }, - }, - { - 'selector': [0, 0], - 'metadata': { - 'location': [0, 0], - 'name': 'a', - 'structural_type': 'numpy.int64', - }, - }, - { - 'selector': [0, 1], - 'metadata': { - 'location': [0, 1], - 'name': 'b', - 'structural_type': 'str', - }, - }, - { - 'selector': [0, 2], - 'metadata': { - 'location': [1, 0], - 'name': 'a', - 'structural_type': 'numpy.int64', - }, - }, - { - 'selector': [0, 3], - 'metadata': { - 'location': [1, 1], - 'name': 'b', - 'structural_type': 'str', - }, - }, - { - 'selector': [0, 4], - 'metadata': { - 'location': [2, 0], - 'name': 'a', - 'structural_type': 'numpy.int64', - }, - }, - { - 'selector': [0, 5], - 'metadata': { - 'location': [2, 1], - 'name': 'b', - 'structural_type': 'str', - }, - }]) - - -if __name__ == '__main__': - unittest.main() diff --git a/common-primitives/tests/test_redact_columns.py b/common-primitives/tests/test_redact_columns.py deleted file mode 100644 index 5bd5df0..0000000 --- a/common-primitives/tests/test_redact_columns.py +++ /dev/null @@ -1,173 +0,0 @@ -import os -import unittest - -from d3m import container, utils -from d3m.metadata import base as metadata_base - -from common_primitives import redact_columns - - -class RedactColumnsPrimitiveTestCase(unittest.TestCase): - def _get_datasets(self): - dataset_doc_path = os.path.abspath(os.path.join(os.path.dirname(__file__), 'data', 'datasets', 'iris_dataset_1', 'datasetDoc.json')) - - dataset = container.Dataset.load('file://{dataset_doc_path}'.format(dataset_doc_path=dataset_doc_path)) - - # We set semantic types like runtime would. - dataset.metadata = dataset.metadata.add_semantic_type(('learningData', metadata_base.ALL_ELEMENTS, 5), 'https://metadata.datadrivendiscovery.org/types/Target') - dataset.metadata = dataset.metadata.add_semantic_type(('learningData', metadata_base.ALL_ELEMENTS, 5), 'https://metadata.datadrivendiscovery.org/types/TrueTarget') - dataset.metadata = dataset.metadata.remove_semantic_type(('learningData', metadata_base.ALL_ELEMENTS, 5), 'https://metadata.datadrivendiscovery.org/types/Attribute') - - datasets = container.List([dataset], { - 'schema': metadata_base.CONTAINER_SCHEMA_VERSION, - 'structural_type': container.List, - 'dimension': { - 'length': 1, - }, - }, generate_metadata=False) - - # We update metadata based on metadata of each dataset. - # TODO: In the future this might be done automatically by generate_metadata. - # See: https://gitlab.com/datadrivendiscovery/d3m/issues/119 - for index, dataset in enumerate(datasets): - datasets.metadata = dataset.metadata.copy_to(datasets.metadata, (), (index,)) - - return dataset_doc_path, datasets - - def test_basic(self): - dataset_doc_path, datasets = self._get_datasets() - - hyperparams_class = redact_columns.RedactColumnsPrimitive.metadata.get_hyperparams() - - primitive = redact_columns.RedactColumnsPrimitive(hyperparams=hyperparams_class.defaults().replace({ - 'semantic_types': ('https://metadata.datadrivendiscovery.org/types/TrueTarget',), - 'add_semantic_types': ('https://metadata.datadrivendiscovery.org/types/RedactedTarget', 'https://metadata.datadrivendiscovery.org/types/MissingData'), - })) - redacted_datasets = primitive.produce(inputs=datasets).value - - self.assertTrue(len(redacted_datasets), 1) - - redacted_dataset = redacted_datasets[0] - - self.assertIsInstance(redacted_dataset, container.Dataset) - self.assertEqual(redacted_dataset['learningData']['species'].values.tolist(), [''] * 150) - - self._test_metadata(redacted_datasets.metadata, dataset_doc_path, True) - self._test_metadata(redacted_dataset.metadata, dataset_doc_path, False) - - def _test_metadata(self, metadata, dataset_doc_path, is_list): - top_metadata = { - 'structural_type': 'd3m.container.dataset.Dataset', - 'id': 'iris_dataset_1', - 'version': '4.0.0', - 'name': 'Iris Dataset', - 'location_uris': [ - 'file://{dataset_doc_path}'.format(dataset_doc_path=dataset_doc_path), - ], - 'dimension': { - 'name': 'resources', - 'semantic_types': ['https://metadata.datadrivendiscovery.org/types/DatasetResource'], - 'length': 1, - }, - 'digest': '49404bf166238fbdac2b6d6baa899a0d1bf8ed5976525fa7353fd732ac218a85', - 'source': { - 'license': 'CC', - 'redacted': False, - 'human_subjects_research': False, - }, - } - - if is_list: - prefix = [0] - list_metadata = [{ - 'selector': [], - 'metadata': { - 'dimension': { - 'length': 1, - }, - 'schema': metadata_base.CONTAINER_SCHEMA_VERSION, - 'structural_type': 'd3m.container.list.List', - }, - }] - else: - prefix = [] - list_metadata = [] - top_metadata['schema'] = metadata_base.CONTAINER_SCHEMA_VERSION - - self.assertEqual(utils.to_json_structure(metadata.to_internal_simple_structure()), list_metadata + [{ - 'selector': prefix + [], - 'metadata': top_metadata, - }, { - 'selector': prefix + ['learningData'], - 'metadata': { - 'structural_type': 'd3m.container.pandas.DataFrame', - 'semantic_types': ['https://metadata.datadrivendiscovery.org/types/Table', 'https://metadata.datadrivendiscovery.org/types/DatasetEntryPoint'], - 'dimension': { - 'name': 'rows', - 'semantic_types': ['https://metadata.datadrivendiscovery.org/types/TabularRow'], - 'length': 150, - }, - }, - }, { - 'selector': prefix + ['learningData', '__ALL_ELEMENTS__'], - 'metadata': { - 'dimension': { - 'name': 'columns', - 'semantic_types': ['https://metadata.datadrivendiscovery.org/types/TabularColumn'], - 'length': 6, - }, - }, - }, { - 'selector': prefix + ['learningData', '__ALL_ELEMENTS__', 0], - 'metadata': { - 'name': 'd3mIndex', - 'structural_type': 'str', - 'semantic_types': ['http://schema.org/Integer', 'https://metadata.datadrivendiscovery.org/types/PrimaryKey'], - }, - }, { - 'selector': prefix + ['learningData', '__ALL_ELEMENTS__', 1], - 'metadata': { - 'name': 'sepalLength', - 'structural_type': 'str', - 'semantic_types': ['http://schema.org/Float', 'https://metadata.datadrivendiscovery.org/types/Attribute'], - }, - }, { - 'selector': prefix + ['learningData', '__ALL_ELEMENTS__', 2], - 'metadata': { - 'name': 'sepalWidth', - 'structural_type': 'str', - 'semantic_types': ['http://schema.org/Float', 'https://metadata.datadrivendiscovery.org/types/Attribute'], - }, - }, { - 'selector': prefix + ['learningData', '__ALL_ELEMENTS__', 3], - 'metadata': { - 'name': 'petalLength', - 'structural_type': 'str', - 'semantic_types': ['http://schema.org/Float', 'https://metadata.datadrivendiscovery.org/types/Attribute'], - }, - }, { - 'selector': prefix + ['learningData', '__ALL_ELEMENTS__', 4], - 'metadata': { - 'name': 'petalWidth', - 'structural_type': 'str', - 'semantic_types': ['http://schema.org/Float', 'https://metadata.datadrivendiscovery.org/types/Attribute'], - }, - }, { - 'selector': prefix + ['learningData', '__ALL_ELEMENTS__', 5], - 'metadata': { - 'name': 'species', - 'structural_type': 'str', - 'semantic_types': [ - 'https://metadata.datadrivendiscovery.org/types/CategoricalData', - 'https://metadata.datadrivendiscovery.org/types/SuggestedTarget', - 'https://metadata.datadrivendiscovery.org/types/Target', - 'https://metadata.datadrivendiscovery.org/types/TrueTarget', - 'https://metadata.datadrivendiscovery.org/types/RedactedTarget', - 'https://metadata.datadrivendiscovery.org/types/MissingData', - ], - }, - }]) - - -if __name__ == '__main__': - unittest.main() diff --git a/common-primitives/tests/test_regex_filter.py b/common-primitives/tests/test_regex_filter.py deleted file mode 100644 index 42e0d71..0000000 --- a/common-primitives/tests/test_regex_filter.py +++ /dev/null @@ -1,114 +0,0 @@ -import unittest -import os - -from common_primitives import regex_filter -from d3m import container, exceptions - -import utils as test_utils - - -class RegexFilterPrimitiveTestCase(unittest.TestCase): - def test_inclusive(self): - dataset_doc_path = os.path.abspath(os.path.join(os.path.dirname(__file__), 'data', 'datasets', 'database_dataset_1', 'datasetDoc.json')) - dataset = container.Dataset.load('file://{dataset_doc_path}'.format(dataset_doc_path=dataset_doc_path)) - resource = test_utils.get_dataframe(dataset) - - filter_hyperparams_class = regex_filter.RegexFilterPrimitive.metadata.get_hyperparams() - hp = filter_hyperparams_class({ - 'column': 1, - 'inclusive': True, - 'regex': 'AAA' - }) - - filter_primitive = regex_filter.RegexFilterPrimitive(hyperparams=hp) - new_df = filter_primitive.produce(inputs=resource).value - - matches = new_df[new_df['code'].str.match('AAA')] - self.assertTrue(matches['code'].unique() == ['AAA']) - - def test_exclusive(self): - dataset_doc_path = os.path.abspath(os.path.join(os.path.dirname(__file__), 'data', 'datasets', 'database_dataset_1', 'datasetDoc.json')) - dataset = container.Dataset.load('file://{dataset_doc_path}'.format(dataset_doc_path=dataset_doc_path)) - resource = test_utils.get_dataframe(dataset) - - filter_hyperparams_class = regex_filter.RegexFilterPrimitive.metadata.get_hyperparams() - hp = filter_hyperparams_class({ - 'column': 1, - 'inclusive': False, - 'regex': 'AAA' - }) - - filter_primitive = regex_filter.RegexFilterPrimitive(hyperparams=hp) - new_df = filter_primitive.produce(inputs=resource).value - - matches = new_df[~new_df['code'].str.match('AAA')] - self.assertTrue(set(matches['code'].unique()) == set(['BBB', 'CCC'])) - - def test_numeric(self): - dataset_doc_path = os.path.abspath(os.path.join(os.path.dirname(__file__), 'data', 'datasets', 'database_dataset_1', 'datasetDoc.json')) - dataset = container.Dataset.load('file://{dataset_doc_path}'.format(dataset_doc_path=dataset_doc_path)) - resource = test_utils.get_dataframe(dataset) - - # set dataframe type to int to match output of a prior parse columns step - resource.iloc[:,3] = resource.iloc[:,3].astype(int) - - filter_hyperparams_class = regex_filter.RegexFilterPrimitive.metadata.get_hyperparams() - hp = filter_hyperparams_class({ - 'column': 3, - 'inclusive': False, - 'regex': '1990' - }) - - filter_primitive = regex_filter.RegexFilterPrimitive(hyperparams=hp) - new_df = filter_primitive.produce(inputs=resource).value - - matches = new_df[~new_df['year'].astype(str).str.match('1990')] - self.assertTrue(set(matches['year'].unique()) == set([2000, 2010])) - - def test_row_metadata_removal(self): - dataset_doc_path = os.path.abspath(os.path.join(os.path.dirname(__file__), 'data', 'datasets', 'database_dataset_1', 'datasetDoc.json')) - dataset = container.Dataset.load('file://{dataset_doc_path}'.format(dataset_doc_path=dataset_doc_path)) - - # add metadata for rows 0 and 1 - dataset.metadata = dataset.metadata.update(('learningData', 1), {'a': 0}) - dataset.metadata = dataset.metadata.update(('learningData', 2), {'b': 1}) - - resource = test_utils.get_dataframe(dataset) - - filter_hyperparams_class = regex_filter.RegexFilterPrimitive.metadata.get_hyperparams() - hp = filter_hyperparams_class({ - 'column': 1, - 'inclusive': False, - 'regex': 'AAA' - }) - - filter_primitive = regex_filter.RegexFilterPrimitive(hyperparams=hp) - new_df = filter_primitive.produce(inputs=resource).value - - # verify that the lenght is correct - self.assertEqual(len(new_df), new_df.metadata.query(())['dimension']['length']) - - # verify that the rows were re-indexed in the metadata - self.assertEquals(new_df.metadata.query((0,))['a'], 0) - self.assertEquals(new_df.metadata.query((1,))['b'], 1) - self.assertFalse('b' in new_df.metadata.query((2,))) - - def test_bad_regex(self): - dataset_doc_path = os.path.abspath(os.path.join(os.path.dirname(__file__), 'data', 'datasets', 'database_dataset_1', 'datasetDoc.json')) - dataset = container.Dataset.load('file://{dataset_doc_path}'.format(dataset_doc_path=dataset_doc_path)) - resource = test_utils.get_dataframe(dataset) - - filter_hyperparams_class = regex_filter.RegexFilterPrimitive.metadata.get_hyperparams() - hp = filter_hyperparams_class({ - 'column': 1, - 'inclusive': True, - 'regex': '[' - }) - - filter_primitive = regex_filter.RegexFilterPrimitive(hyperparams=hp) - with self.assertRaises(exceptions.InvalidArgumentValueError): - filter_primitive.produce(inputs=resource) - - -if __name__ == '__main__': - unittest.main() diff --git a/common-primitives/tests/test_remove_duplicate_columns.py b/common-primitives/tests/test_remove_duplicate_columns.py deleted file mode 100644 index 3713751..0000000 --- a/common-primitives/tests/test_remove_duplicate_columns.py +++ /dev/null @@ -1,123 +0,0 @@ -import unittest - -from d3m import container, utils -from d3m.metadata import base as metadata_base - -from common_primitives import remove_duplicate_columns - - -class RemoveDuplicateColumnsPrimitiveTestCase(unittest.TestCase): - def test_basic(self): - main = container.DataFrame({'a1': [1, 2, 3], 'b1': [4, 5, 6], 'a2': [1, 2, 3], 'c1': [7, 8, 9], 'a3': [1, 2, 3], 'a1a': [1, 2, 3]}, { - 'top_level': 'main', - }, columns=['a1', 'b1', 'a2', 'c1', 'a3', 'a1a'], generate_metadata=True) - main.metadata = main.metadata.update_column(0, {'name': 'aaa111'}) - main.metadata = main.metadata.update_column(1, {'name': 'bbb111'}) - main.metadata = main.metadata.update_column(2, {'name': 'aaa222'}) - main.metadata = main.metadata.update_column(3, {'name': 'ccc111'}) - main.metadata = main.metadata.update_column(4, {'name': 'aaa333'}) - main.metadata = main.metadata.update_column(5, {'name': 'aaa111'}) - - self.assertEqual(utils.to_json_structure(main.metadata.to_internal_simple_structure()), [{ - 'selector': [], - 'metadata': { - 'top_level': 'main', - 'schema': metadata_base.CONTAINER_SCHEMA_VERSION, - 'structural_type': 'd3m.container.pandas.DataFrame', - 'semantic_types': ['https://metadata.datadrivendiscovery.org/types/Table'], - 'dimension': { - 'name': 'rows', - 'semantic_types': ['https://metadata.datadrivendiscovery.org/types/TabularRow'], - 'length': 3, - }, - }, - }, { - 'selector': ['__ALL_ELEMENTS__'], - 'metadata': { - 'dimension': { - 'name': 'columns', - 'semantic_types': ['https://metadata.datadrivendiscovery.org/types/TabularColumn'], - 'length': 6, - }, - }, - }, { - 'selector': ['__ALL_ELEMENTS__', 0], - 'metadata': {'structural_type': 'numpy.int64', 'name': 'aaa111'}, - }, { - 'selector': ['__ALL_ELEMENTS__', 1], - 'metadata': {'structural_type': 'numpy.int64', 'name': 'bbb111'}, - }, { - 'selector': ['__ALL_ELEMENTS__', 2], - 'metadata': {'structural_type': 'numpy.int64', 'name': 'aaa222'}, - }, { - 'selector': ['__ALL_ELEMENTS__', 3], - 'metadata': {'structural_type': 'numpy.int64', 'name': 'ccc111'}, - }, { - 'selector': ['__ALL_ELEMENTS__', 4], - 'metadata': {'structural_type': 'numpy.int64', 'name': 'aaa333'}, - }, { - 'selector': ['__ALL_ELEMENTS__', 5], - 'metadata': {'structural_type': 'numpy.int64', 'name': 'aaa111'}, - }]) - - hyperparams_class = remove_duplicate_columns.RemoveDuplicateColumnsPrimitive.metadata.get_hyperparams() - primitive = remove_duplicate_columns.RemoveDuplicateColumnsPrimitive(hyperparams=hyperparams_class.defaults()) - primitive.set_training_data(inputs=main) - primitive.fit() - new_main = primitive.produce(inputs=main).value - - self.assertEqual(new_main.values.tolist(), [ - [1, 4, 7], - [2, 5, 8], - [3, 6, 9], - ]) - - self.assertEqual(utils.to_json_structure(new_main.metadata.to_internal_simple_structure()), [{ - 'selector': [], - 'metadata': { - 'top_level': 'main', - 'schema': metadata_base.CONTAINER_SCHEMA_VERSION, - 'structural_type': 'd3m.container.pandas.DataFrame', - 'semantic_types': ['https://metadata.datadrivendiscovery.org/types/Table'], - 'dimension': { - 'name': 'rows', - 'semantic_types': ['https://metadata.datadrivendiscovery.org/types/TabularRow'], - 'length': 3, - }, - }, - }, { - 'selector': ['__ALL_ELEMENTS__'], - 'metadata': { - 'dimension': { - 'name': 'columns', - 'semantic_types': ['https://metadata.datadrivendiscovery.org/types/TabularColumn'], - 'length': 3, - }, - }, - }, { - 'selector': ['__ALL_ELEMENTS__', 0], - 'metadata': { - 'name': 'aaa111', - 'other_names': ['aaa222', 'aaa333'], - 'structural_type': 'numpy.int64', - }, - }, { - 'selector': ['__ALL_ELEMENTS__', 1], - 'metadata': { - 'name': 'bbb111', - 'structural_type': 'numpy.int64', - }, - }, { - 'selector': ['__ALL_ELEMENTS__', 2], - 'metadata': { - 'name': 'ccc111', - 'structural_type': 'numpy.int64', - }, - }]) - - params = primitive.get_params() - primitive.set_params(params=params) - - -if __name__ == '__main__': - unittest.main() diff --git a/common-primitives/tests/test_rename_duplicate_columns.py b/common-primitives/tests/test_rename_duplicate_columns.py deleted file mode 100644 index 90cc522..0000000 --- a/common-primitives/tests/test_rename_duplicate_columns.py +++ /dev/null @@ -1,136 +0,0 @@ -import os -import unittest - -import pandas as pd - -from d3m import container -from d3m.metadata import base as metadata_base - -from common_primitives import dataset_to_dataframe, column_parser, rename_duplicate_columns - - -class RenameDuplicateColumnsPrimitiveTestCase(unittest.TestCase): - def _get_iris(self): - dataset_doc_path = os.path.abspath( - os.path.join(os.path.dirname(__file__), 'data', 'datasets', 'iris_dataset_1', 'datasetDoc.json')) - - dataset = container.Dataset.load('file://{dataset_doc_path}'.format(dataset_doc_path=dataset_doc_path)) - - hyperparams_class = \ - dataset_to_dataframe.DatasetToDataFramePrimitive.metadata.query()['primitive_code']['class_type_arguments'][ - 'Hyperparams'] - primitive = dataset_to_dataframe.DatasetToDataFramePrimitive(hyperparams=hyperparams_class.defaults()) - - dataframe = primitive.produce(inputs=dataset).value - - return dataframe - - def _get_iris_columns(self): - dataframe = self._get_iris() - # We set semantic types like runtime would. - dataframe.metadata = dataframe.metadata.add_semantic_type((metadata_base.ALL_ELEMENTS, 5), - 'https://metadata.datadrivendiscovery.org/types/Target') - dataframe.metadata = dataframe.metadata.add_semantic_type((metadata_base.ALL_ELEMENTS, 5), - 'https://metadata.datadrivendiscovery.org/types/TrueTarget') - dataframe.metadata = dataframe.metadata.remove_semantic_type((metadata_base.ALL_ELEMENTS, 5), - 'https://metadata.datadrivendiscovery.org/types/Attribute') - - # Parsing. - hyperparams_class = \ - column_parser.ColumnParserPrimitive.metadata.query()['primitive_code']['class_type_arguments'][ - 'Hyperparams'] - primitive = column_parser.ColumnParserPrimitive(hyperparams=hyperparams_class.defaults()) - dataframe = primitive.produce(inputs=dataframe).value - - return dataframe - - def test_basic(self): - test_data_inputs = {'col1': [1.0, 2.0, 3.0], - 'col2': [4.0, 5.0, 6.0], - 'col3': [100, 200, 300]} - dataframe_inputs = container.DataFrame.from_dict(data=test_data_inputs) - test_data_inputs_dup = {'col1': [1.0, 2.0, 3.0], - 'col2': [4.0, 5.0, 6.0]} - dataframe_inputs_dup = container.DataFrame.from_dict(data=test_data_inputs_dup) - test_data_inputs_dup_2 = {'col1': [1.0, 2.0, 3.0], - 'col2': [4.0, 5.0, 6.0], - 'col3': [100, 200, 300]} - dataframe_inputs_dup_2 = container.DataFrame.from_dict(data=test_data_inputs_dup_2) - input = pd.concat([dataframe_inputs, dataframe_inputs_dup, dataframe_inputs_dup_2], axis=1) - - hyperparams_class = rename_duplicate_columns.RenameDuplicateColumnsPrimitive.metadata.query()['primitive_code'][ - 'class_type_arguments']['Hyperparams'] - - primitive = rename_duplicate_columns.RenameDuplicateColumnsPrimitive(hyperparams=hyperparams_class.defaults()) - - call_result = primitive.produce(inputs=input) - dataframe_renamed = call_result.value - self.assertEqual(dataframe_renamed.columns.values.tolist(), - ['col1', 'col2', 'col3', 'col1.1', 'col2.1', 'col1.2', 'col2.2', 'col3.1']) - - def test_monotonic_dup_col_name(self): - """This test is added because of issue #73""" - test_data_inputs = {'a': [1.0, 2.0, 3.0], - 'b': [100, 200, 300]} - dataframe_inputs = container.DataFrame.from_dict(data=test_data_inputs) - test_data_inputs_dup = {'b': [1.0, 2.0, 3.0], - 'c': [4.0, 5.0, 6.0]} - dataframe_inputs_dup = container.DataFrame.from_dict(data=test_data_inputs_dup) - input = pd.concat([dataframe_inputs, dataframe_inputs_dup], axis=1) - - hyperparams_class = rename_duplicate_columns.RenameDuplicateColumnsPrimitive.metadata.query()['primitive_code'][ - 'class_type_arguments']['Hyperparams'] - - primitive = rename_duplicate_columns.RenameDuplicateColumnsPrimitive(hyperparams=hyperparams_class.defaults()) - - call_result = primitive.produce(inputs=input) - dataframe_renamed = call_result.value - self.assertEqual(dataframe_renamed.columns.values.tolist(), - ['a', 'b', 'b.1', 'c']) - - def test_no_change(self): - test_data_inputs = {'col0': [1.0, 2.0, 3.0], - 'col1': [4.0, 5.0, 6.0], - 'col2': [100, 200, 300]} - dataframe_inputs = container.DataFrame.from_dict(data=test_data_inputs) - test_data_inputs = {'col3': [1.0, 2.0, 3.0], - 'col4': [4.0, 5.0, 6.0], - 'col5': [100, 200, 300]} - dataframe_inputs_2 = container.DataFrame.from_dict(data=test_data_inputs) - - inputs = pd.concat([dataframe_inputs, dataframe_inputs_2], axis=1) - hyperparams_class = rename_duplicate_columns.RenameDuplicateColumnsPrimitive.metadata.query()['primitive_code'][ - 'class_type_arguments']['Hyperparams'] - - primitive = rename_duplicate_columns.RenameDuplicateColumnsPrimitive(hyperparams=hyperparams_class.defaults()) - - call_result = primitive.produce(inputs=inputs) - dataframe_renamed = call_result.value - - self.assertEqual(dataframe_renamed.columns.values.tolist(), - ['col0', 'col1', 'col2', 'col3', 'col4', 'col5']) - - def test_iris_with_metadata(self): - dataframe = self._get_iris_columns() - dataframe_1 = self._get_iris_columns() - dataframe_concated = dataframe.append_columns(dataframe_1) - dataframe_concated_bk = dataframe_concated.copy() - hyperparams_class = rename_duplicate_columns.RenameDuplicateColumnsPrimitive.metadata.query()['primitive_code'][ - 'class_type_arguments']['Hyperparams'] - - primitive = rename_duplicate_columns.RenameDuplicateColumnsPrimitive(hyperparams=hyperparams_class.defaults()) - - call_result = primitive.produce(inputs=dataframe_concated) - dataframe_renamed = call_result.value - names = ['d3mIndex', 'sepalLength', 'sepalWidth', 'petalLength', 'petalWidth', 'species', - 'd3mIndex.1', 'sepalLength.1', 'sepalWidth.1', 'petalLength.1', 'petalWidth.1', - 'species.1'] - self.assertEqual(dataframe_renamed.columns.values.tolist(), names) - self.assertTrue(dataframe_concated.equals(dataframe_concated_bk)) - self.assertTrue(dataframe_concated.metadata.to_internal_json_structure(), - dataframe_concated_bk.metadata.to_internal_json_structure()) - - for i, column_name in enumerate(dataframe_renamed.columns): - self.assertEqual(dataframe_renamed.metadata.query_column(i)['other_name'], - column_name.split(primitive.hyperparams['separator'])[0]) - self.assertEqual(dataframe_renamed.metadata.query_column(i)['name'], names[i]) diff --git a/common-primitives/tests/test_replace_semantic_types.py b/common-primitives/tests/test_replace_semantic_types.py deleted file mode 100644 index 258167a..0000000 --- a/common-primitives/tests/test_replace_semantic_types.py +++ /dev/null @@ -1,97 +0,0 @@ -import os -import unittest - -from d3m import container, utils -from d3m.metadata import base as metadata_base - -from common_primitives import dataset_to_dataframe, replace_semantic_types - -import utils as test_utils - - -class ReplaceSemanticTypesPrimitiveTestCase(unittest.TestCase): - def _get_iris_dataframe(self): - dataset_doc_path = os.path.abspath(os.path.join(os.path.dirname(__file__), 'data', 'datasets', 'iris_dataset_1', 'datasetDoc.json')) - - dataset = container.Dataset.load('file://{dataset_doc_path}'.format(dataset_doc_path=dataset_doc_path)) - - hyperparams_class = dataset_to_dataframe.DatasetToDataFramePrimitive.metadata.get_hyperparams() - - primitive = dataset_to_dataframe.DatasetToDataFramePrimitive(hyperparams=hyperparams_class.defaults()) - - call_metadata = primitive.produce(inputs=dataset) - - dataframe = call_metadata.value - - return dataframe - - def test_basic(self): - dataframe = self._get_iris_dataframe() - - hyperparams_class = replace_semantic_types.ReplaceSemanticTypesPrimitive.metadata.get_hyperparams() - primitive = replace_semantic_types.ReplaceSemanticTypesPrimitive(hyperparams=hyperparams_class.defaults().replace({ - 'from_semantic_types': ('https://metadata.datadrivendiscovery.org/types/SuggestedTarget',), - 'to_semantic_types': ('https://metadata.datadrivendiscovery.org/types/Attribute',), - })) - - outputs = primitive.produce(inputs=dataframe).value - - self._test_metadata(outputs.metadata) - - def _test_metadata(self, metadata): - self.maxDiff = None - - self.assertEqual(test_utils.convert_through_json(metadata.query(())), { - 'schema': metadata_base.CONTAINER_SCHEMA_VERSION, - 'structural_type': 'd3m.container.pandas.DataFrame', - 'semantic_types': [ - 'https://metadata.datadrivendiscovery.org/types/Table', - ], - 'dimension': { - 'name': 'rows', - 'semantic_types': ['https://metadata.datadrivendiscovery.org/types/TabularRow'], - 'length': 150, - } - }) - - self.assertEqual(test_utils.convert_through_json(metadata.query((metadata_base.ALL_ELEMENTS,))), { - 'dimension': { - 'name': 'columns', - 'semantic_types': ['https://metadata.datadrivendiscovery.org/types/TabularColumn'], - 'length': 6, - } - }) - - self.assertEqual(test_utils.convert_through_json(metadata.query((metadata_base.ALL_ELEMENTS, 0))), { - 'name': 'd3mIndex', - 'structural_type': 'str', - 'semantic_types': [ - 'http://schema.org/Integer', - 'https://metadata.datadrivendiscovery.org/types/PrimaryKey', - ], - }) - - for i in range(1, 5): - self.assertEqual(test_utils.convert_through_json(metadata.query((metadata_base.ALL_ELEMENTS, i))), { - 'name': ['sepalLength', 'sepalWidth', 'petalLength', 'petalWidth'][i - 1], - 'structural_type': 'str', - 'semantic_types': [ - 'http://schema.org/Float', - 'https://metadata.datadrivendiscovery.org/types/Attribute', - ], - }, i) - - self.assertEqual(test_utils.convert_through_json(metadata.query((metadata_base.ALL_ELEMENTS, 5))), { - 'name': 'species', - 'structural_type': 'str', - 'semantic_types': [ - 'https://metadata.datadrivendiscovery.org/types/CategoricalData', - 'https://metadata.datadrivendiscovery.org/types/Attribute', - ], - }) - - self.assertTrue(metadata.get_elements((metadata_base.ALL_ELEMENTS,)) in [[0, 1, 2, 3, 4, 5], [metadata_base.ALL_ELEMENTS, 0, 1, 2, 3, 4, 5]]) - - -if __name__ == '__main__': - unittest.main() diff --git a/common-primitives/tests/test_simple_profiler.py b/common-primitives/tests/test_simple_profiler.py deleted file mode 100644 index b9a6706..0000000 --- a/common-primitives/tests/test_simple_profiler.py +++ /dev/null @@ -1,446 +0,0 @@ -import os.path -import pickle -import unittest - -from d3m import container -from d3m.metadata import base as metadata_base - -from common_primitives import dataset_to_dataframe, simple_profiler, train_score_split - - -class SimpleProfilerPrimitiveTestCase(unittest.TestCase): - def _get_iris(self, set_target_as_categorical): - dataset_doc_path = os.path.abspath(os.path.join(os.path.dirname(__file__), 'data', 'datasets', 'iris_dataset_1', 'datasetDoc.json')) - dataset = container.Dataset.load('file://{dataset_doc_path}'.format(dataset_doc_path=dataset_doc_path)) - - original_metadata = dataset.metadata - - # We make a very empty metadata. - dataset.metadata = metadata_base.DataMetadata().generate(dataset) - dataset.metadata = dataset.metadata.add_semantic_type(('learningData', metadata_base.ALL_ELEMENTS, 0), 'http://schema.org/Integer') - dataset.metadata = dataset.metadata.add_semantic_type(('learningData', metadata_base.ALL_ELEMENTS, 0), 'https://metadata.datadrivendiscovery.org/types/PrimaryKey') - dataset.metadata = dataset.metadata.add_semantic_type(('learningData', metadata_base.ALL_ELEMENTS, 5), 'https://metadata.datadrivendiscovery.org/types/Attribute') - dataset.metadata = dataset.metadata.add_semantic_type(('learningData', metadata_base.ALL_ELEMENTS, 5), 'https://metadata.datadrivendiscovery.org/types/SuggestedTarget') - - if set_target_as_categorical: - dataset.metadata = dataset.metadata.add_semantic_type(('learningData', metadata_base.ALL_ELEMENTS, 5), 'https://metadata.datadrivendiscovery.org/types/CategoricalData') - else: - dataset.metadata = dataset.metadata.add_semantic_type(('learningData', metadata_base.ALL_ELEMENTS, 5), 'https://metadata.datadrivendiscovery.org/types/UnknownType') - - return dataset, original_metadata - - def _test_metadata(self, original_metadata, dataframe_metadata, set_target_as_categorical): - for column_index in range(5): - self.assertCountEqual(original_metadata.query_column_field(column_index, 'semantic_types', at=('learningData',)), dataframe_metadata.query_column_field(column_index, 'semantic_types'), (set_target_as_categorical, column_index)) - - self.assertEqual(dataframe_metadata.query_column_field(5, 'semantic_types'), ( - 'https://metadata.datadrivendiscovery.org/types/SuggestedTarget', - 'https://metadata.datadrivendiscovery.org/types/CategoricalData', - 'https://metadata.datadrivendiscovery.org/types/Target', - 'https://metadata.datadrivendiscovery.org/types/TrueTarget', - ), set_target_as_categorical) - - def test_basic(self): - for set_target_as_categorical in [False, True]: - dataset, original_metadata = self._get_iris(set_target_as_categorical) - - hyperparams_class = dataset_to_dataframe.DatasetToDataFramePrimitive.metadata.get_hyperparams() - - primitive = dataset_to_dataframe.DatasetToDataFramePrimitive(hyperparams=hyperparams_class.defaults()) - - dataframe = primitive.produce(inputs=dataset).value - - hyperparams_class = simple_profiler.SimpleProfilerPrimitive.metadata.get_hyperparams() - - primitive = simple_profiler.SimpleProfilerPrimitive(hyperparams=hyperparams_class.defaults()) - - primitive.set_training_data(inputs=dataframe) - primitive.fit() - - primitive_pickled = pickle.dumps(primitive) - primitive = pickle.loads(primitive_pickled) - - dataframe = primitive.produce(inputs=dataframe).value - - self._test_metadata(original_metadata, dataframe.metadata, set_target_as_categorical) - - def test_small_test(self): - for set_target_as_categorical in [False, True]: - dataset, original_metadata = self._get_iris(set_target_as_categorical) - - hyperparams_class = train_score_split.TrainScoreDatasetSplitPrimitive.metadata.get_hyperparams() - - primitive = train_score_split.TrainScoreDatasetSplitPrimitive(hyperparams=hyperparams_class.defaults().replace({ - 'train_score_ratio': 0.9, - 'shuffle': True, - })) - - primitive.set_training_data(dataset=dataset) - primitive.fit() - - results = primitive.produce(inputs=container.List([0], generate_metadata=True)).value - - self.assertEqual(len(results), 1) - - train_dataset = results[0] - - self.assertEqual(len(train_dataset['learningData']), 135) - - results = primitive.produce_score_data(inputs=container.List([0], generate_metadata=True)).value - - self.assertEqual(len(results), 1) - - score_dataset = results[0] - - self.assertEqual(len(score_dataset['learningData']), 15) - - hyperparams_class = dataset_to_dataframe.DatasetToDataFramePrimitive.metadata.get_hyperparams() - - primitive = dataset_to_dataframe.DatasetToDataFramePrimitive(hyperparams=hyperparams_class.defaults()) - - train_dataframe = primitive.produce(inputs=train_dataset).value - - score_dataframe = primitive.produce(inputs=score_dataset).value - - hyperparams_class = simple_profiler.SimpleProfilerPrimitive.metadata.get_hyperparams() - - primitive = simple_profiler.SimpleProfilerPrimitive(hyperparams=hyperparams_class.defaults()) - - primitive.set_training_data(inputs=train_dataframe) - primitive.fit() - dataframe = primitive.produce(inputs=score_dataframe).value - - self._test_metadata(original_metadata, dataframe.metadata, set_target_as_categorical) - - def _get_column_semantic_types(self, dataframe): - number_of_columns = dataframe.metadata.query((metadata_base.ALL_ELEMENTS,))['dimension']['length'] - generated_semantic_types = [ - dataframe.metadata.query((metadata_base.ALL_ELEMENTS, i))['semantic_types'] - for i in range(number_of_columns) - ] - generated_semantic_types = [sorted(x) for x in generated_semantic_types] - - return generated_semantic_types - - def test_iris_csv(self): - dataset_doc_path = os.path.abspath( - os.path.join(os.path.dirname(__file__), 'data', 'datasets', 'iris_dataset_1', 'tables', 'learningData.csv') - ) - dataset = container.Dataset.load('file://{dataset_doc_path}'.format(dataset_doc_path=dataset_doc_path)) - - # Use profiler to assign semantic types - dataframe = self._profile_dataset(dataset=dataset) - - generated_semantic_types = self._get_column_semantic_types(dataframe) - - semantic_types = [ - [ - 'http://schema.org/Integer', - 'https://metadata.datadrivendiscovery.org/types/PrimaryKey', - ], - ['http://schema.org/Float', 'https://metadata.datadrivendiscovery.org/types/Attribute'], - ['http://schema.org/Float', 'https://metadata.datadrivendiscovery.org/types/Attribute'], - ['http://schema.org/Float', 'https://metadata.datadrivendiscovery.org/types/Attribute'], - ['http://schema.org/Float', 'https://metadata.datadrivendiscovery.org/types/Attribute'], - [ - 'https://metadata.datadrivendiscovery.org/types/Attribute', - 'https://metadata.datadrivendiscovery.org/types/CategoricalData', - ], - ] - - self.assertEqual(generated_semantic_types, semantic_types) - - def _profile_dataset(self, dataset, hyperparams=None): - if hyperparams is None: - hyperparams = {} - - hyperparams_class = dataset_to_dataframe.DatasetToDataFramePrimitive.metadata.get_hyperparams() - primitive = dataset_to_dataframe.DatasetToDataFramePrimitive(hyperparams=hyperparams_class.defaults()) - dataframe = primitive.produce(inputs=dataset).value - - hyperparams_class = simple_profiler.SimpleProfilerPrimitive.metadata.get_hyperparams() - primitive = simple_profiler.SimpleProfilerPrimitive(hyperparams=hyperparams_class.defaults().replace(hyperparams)) - primitive.set_training_data(inputs=dataframe) - primitive.fit() - - return primitive.produce(inputs=dataframe).value - - def test_boston(self): - dataset = container.dataset.Dataset.load('sklearn://boston') - - # Use profiler to assign semantic types - dataframe = self._profile_dataset(dataset=dataset) - - generated_semantic_types = self._get_column_semantic_types(dataframe) - - semantic_types = [ - ['http://schema.org/Integer', 'https://metadata.datadrivendiscovery.org/types/PrimaryKey'], - ['http://schema.org/Float', 'https://metadata.datadrivendiscovery.org/types/Attribute'], - ['http://schema.org/Float', 'https://metadata.datadrivendiscovery.org/types/Attribute'], - ['http://schema.org/Float', 'https://metadata.datadrivendiscovery.org/types/Attribute'], - ['http://schema.org/Boolean', 'https://metadata.datadrivendiscovery.org/types/Attribute'], - ['http://schema.org/Float', 'https://metadata.datadrivendiscovery.org/types/Attribute'], - ['http://schema.org/Float', 'https://metadata.datadrivendiscovery.org/types/Attribute'], - ['http://schema.org/Float', 'https://metadata.datadrivendiscovery.org/types/Attribute'], - ['http://schema.org/Float', 'https://metadata.datadrivendiscovery.org/types/Attribute'], - [ - 'https://metadata.datadrivendiscovery.org/types/Attribute', - 'https://metadata.datadrivendiscovery.org/types/CategoricalData', - ], - ['http://schema.org/Integer', 'https://metadata.datadrivendiscovery.org/types/Attribute'], - ['http://schema.org/Float', 'https://metadata.datadrivendiscovery.org/types/Attribute'], - ['http://schema.org/Float', 'https://metadata.datadrivendiscovery.org/types/Attribute'], - ['http://schema.org/Float', 'https://metadata.datadrivendiscovery.org/types/Attribute'], - [ - 'http://schema.org/Float', - 'https://metadata.datadrivendiscovery.org/types/SuggestedTarget', - 'https://metadata.datadrivendiscovery.org/types/Target', - 'https://metadata.datadrivendiscovery.org/types/TrueTarget', - ], - ] - - self.assertEqual(generated_semantic_types, semantic_types) - - def test_diabetes(self): - dataset = container.dataset.Dataset.load('sklearn://diabetes') - - # Use profiler to assign semantic types - dataframe = self._profile_dataset(dataset=dataset) - - generated_semantic_types = self._get_column_semantic_types(dataframe) - - semantic_types = [ - ['http://schema.org/Integer', 'https://metadata.datadrivendiscovery.org/types/PrimaryKey'], - ['http://schema.org/Float', 'https://metadata.datadrivendiscovery.org/types/Attribute'], - ['http://schema.org/Float', 'https://metadata.datadrivendiscovery.org/types/Attribute'], - ['http://schema.org/Float', 'https://metadata.datadrivendiscovery.org/types/Attribute'], - ['http://schema.org/Float', 'https://metadata.datadrivendiscovery.org/types/Attribute'], - ['http://schema.org/Float', 'https://metadata.datadrivendiscovery.org/types/Attribute'], - ['http://schema.org/Float', 'https://metadata.datadrivendiscovery.org/types/Attribute'], - ['http://schema.org/Float', 'https://metadata.datadrivendiscovery.org/types/Attribute'], - ['http://schema.org/Float', 'https://metadata.datadrivendiscovery.org/types/Attribute'], - ['http://schema.org/Float', 'https://metadata.datadrivendiscovery.org/types/Attribute'], - ['http://schema.org/Float', 'https://metadata.datadrivendiscovery.org/types/Attribute'], - [ - 'http://schema.org/Integer', - 'https://metadata.datadrivendiscovery.org/types/SuggestedTarget', - 'https://metadata.datadrivendiscovery.org/types/Target', - 'https://metadata.datadrivendiscovery.org/types/TrueTarget', - ], - ] - - self.assertEqual(generated_semantic_types, semantic_types) - - def test_digits(self): - self.maxDiff = None - - dataset = container.dataset.Dataset.load('sklearn://digits') - - detect_semantic_types = list(simple_profiler.SimpleProfilerPrimitive.metadata.get_hyperparams().configuration['detect_semantic_types'].get_default()) - # Some pixels have very little different values. - detect_semantic_types.remove('http://schema.org/Boolean') - # There are just 16 colors, but we want to see them as integers. - detect_semantic_types.remove('https://metadata.datadrivendiscovery.org/types/CategoricalData') - - # Use profiler to assign semantic types - dataframe = self._profile_dataset(dataset=dataset, hyperparams={ - 'detect_semantic_types': detect_semantic_types, - }) - - generated_semantic_types = self._get_column_semantic_types(dataframe) - - semantic_types = ( - [['http://schema.org/Integer', 'https://metadata.datadrivendiscovery.org/types/PrimaryKey']] - + 64 - * [ - [ - 'http://schema.org/Integer', - 'https://metadata.datadrivendiscovery.org/types/Attribute', - ] - ] - + [ - [ - 'https://metadata.datadrivendiscovery.org/types/CategoricalData', - 'https://metadata.datadrivendiscovery.org/types/SuggestedTarget', - 'https://metadata.datadrivendiscovery.org/types/Target', - 'https://metadata.datadrivendiscovery.org/types/TrueTarget', - ] - ] - ) - - self.assertEqual(generated_semantic_types, semantic_types) - - def test_iris(self): - dataset = container.dataset.Dataset.load('sklearn://iris') - - # Use profiler to assign semantic types - dataframe = self._profile_dataset(dataset=dataset) - - generated_semantic_types = self._get_column_semantic_types(dataframe) - - semantic_types = [ - ['http://schema.org/Integer', 'https://metadata.datadrivendiscovery.org/types/PrimaryKey'], - [ - 'http://schema.org/Float', - 'https://metadata.datadrivendiscovery.org/types/Attribute', - ], - [ - 'http://schema.org/Float', - 'https://metadata.datadrivendiscovery.org/types/Attribute', - ], - [ - 'http://schema.org/Float', - 'https://metadata.datadrivendiscovery.org/types/Attribute', - ], - [ - 'http://schema.org/Float', - 'https://metadata.datadrivendiscovery.org/types/Attribute', - ], - [ - 'https://metadata.datadrivendiscovery.org/types/CategoricalData', - 'https://metadata.datadrivendiscovery.org/types/SuggestedTarget', - 'https://metadata.datadrivendiscovery.org/types/Target', - 'https://metadata.datadrivendiscovery.org/types/TrueTarget', - ], - ] - - self.assertEqual(generated_semantic_types, semantic_types) - - def test_breast_cancer(self): - dataset = container.dataset.Dataset.load('sklearn://breast_cancer') - - # Use profiler to assign semantic types - dataframe = self._profile_dataset(dataset=dataset) - - generated_semantic_types = self._get_column_semantic_types(dataframe) - - semantic_types = ( - [['http://schema.org/Integer', 'https://metadata.datadrivendiscovery.org/types/PrimaryKey']] - + 30 - * [ - [ - 'http://schema.org/Float', - 'https://metadata.datadrivendiscovery.org/types/Attribute', - ] - ] - + [ - [ - 'http://schema.org/Boolean', - 'https://metadata.datadrivendiscovery.org/types/SuggestedTarget', - 'https://metadata.datadrivendiscovery.org/types/Target', - 'https://metadata.datadrivendiscovery.org/types/TrueTarget', - ] - ] - ) - - self.assertEqual(generated_semantic_types, semantic_types) - - def test_linnerud(self): - dataset = container.dataset.Dataset.load('sklearn://linnerud') - - # Use profiler to assign semantic types - dataframe = self._profile_dataset(dataset=dataset) - - generated_semantic_types = self._get_column_semantic_types(dataframe) - - semantic_types = [ - ['http://schema.org/Integer', 'https://metadata.datadrivendiscovery.org/types/PrimaryKey'], - ['http://schema.org/Integer', 'https://metadata.datadrivendiscovery.org/types/Attribute'], - ['http://schema.org/Integer', 'https://metadata.datadrivendiscovery.org/types/Attribute'], - ['http://schema.org/Integer', 'https://metadata.datadrivendiscovery.org/types/Attribute'], - [ - 'http://schema.org/Integer', - 'https://metadata.datadrivendiscovery.org/types/SuggestedTarget', - # Only the first "SuggestedTarget" column is made into a target. - 'https://metadata.datadrivendiscovery.org/types/Target', - 'https://metadata.datadrivendiscovery.org/types/TrueTarget', - ], - [ - 'http://schema.org/Integer', - 'https://metadata.datadrivendiscovery.org/types/Attribute', - 'https://metadata.datadrivendiscovery.org/types/SuggestedTarget', - ], - [ - 'http://schema.org/Integer', - 'https://metadata.datadrivendiscovery.org/types/Attribute', - 'https://metadata.datadrivendiscovery.org/types/SuggestedTarget', - ], - ] - - self.assertEqual(generated_semantic_types, semantic_types) - - def test_wine(self): - dataset = container.dataset.Dataset.load('sklearn://wine') - - # Use profiler to assign semantic types - dataframe = self._profile_dataset(dataset=dataset) - - generated_semantic_types = self._get_column_semantic_types(dataframe) - - semantic_types = [ - ['http://schema.org/Integer', 'https://metadata.datadrivendiscovery.org/types/PrimaryKey'], - [ - 'http://schema.org/Float', - 'https://metadata.datadrivendiscovery.org/types/Attribute', - ], - [ - 'http://schema.org/Float', - 'https://metadata.datadrivendiscovery.org/types/Attribute', - ], - [ - 'http://schema.org/Float', - 'https://metadata.datadrivendiscovery.org/types/Attribute', - ], - [ - 'http://schema.org/Float', - 'https://metadata.datadrivendiscovery.org/types/Attribute', - ], - [ - 'http://schema.org/Integer', - 'https://metadata.datadrivendiscovery.org/types/Attribute', - ], - [ - 'http://schema.org/Float', - 'https://metadata.datadrivendiscovery.org/types/Attribute', - ], - [ - 'http://schema.org/Float', - 'https://metadata.datadrivendiscovery.org/types/Attribute', - ], - [ - 'http://schema.org/Float', - 'https://metadata.datadrivendiscovery.org/types/Attribute', - ], - [ - 'http://schema.org/Float', - 'https://metadata.datadrivendiscovery.org/types/Attribute', - ], - [ - 'http://schema.org/Float', - 'https://metadata.datadrivendiscovery.org/types/Attribute', - ], - [ - 'http://schema.org/Float', - 'https://metadata.datadrivendiscovery.org/types/Attribute', - ], - [ - 'http://schema.org/Float', - 'https://metadata.datadrivendiscovery.org/types/Attribute', - ], - [ - 'http://schema.org/Integer', - 'https://metadata.datadrivendiscovery.org/types/Attribute', - ], - [ - 'https://metadata.datadrivendiscovery.org/types/CategoricalData', - 'https://metadata.datadrivendiscovery.org/types/SuggestedTarget', - 'https://metadata.datadrivendiscovery.org/types/Target', - 'https://metadata.datadrivendiscovery.org/types/TrueTarget', - ], - ] - - self.assertEqual(generated_semantic_types, semantic_types) - - -if __name__ == '__main__': - unittest.main() diff --git a/common-primitives/tests/test_stack_ndarray_column.py b/common-primitives/tests/test_stack_ndarray_column.py deleted file mode 100644 index d6b3b1d..0000000 --- a/common-primitives/tests/test_stack_ndarray_column.py +++ /dev/null @@ -1,77 +0,0 @@ -import unittest - -from d3m import container, utils -from d3m.metadata import base as metadata_base - -from common_primitives import stack_ndarray_column - - -class StackNDArrayColumnPrimitiveTestCase(unittest.TestCase): - def _get_data(self): - data = container.DataFrame({ - 'a': [1, 2, 3], - 'b': [container.ndarray([2, 3, 4]), container.ndarray([5, 6, 7]), container.ndarray([8, 9, 10])] - }, { - 'top_level': 'foobar1', - }, generate_metadata=True) - - data.metadata = data.metadata.update_column(1, { - 'semantic_types': ['https://metadata.datadrivendiscovery.org/types/Attribute'], - }) - - return data - - def test_basic(self): - data = self._get_data() - - data_metadata_before = data.metadata.to_internal_json_structure() - - stack_hyperparams_class = stack_ndarray_column.StackNDArrayColumnPrimitive.metadata.get_hyperparams() - stack_primitive = stack_ndarray_column.StackNDArrayColumnPrimitive(hyperparams=stack_hyperparams_class.defaults()) - stack_array = stack_primitive.produce(inputs=data).value - - self.assertEqual(stack_array.shape, (3, 3)) - - self._test_metadata(stack_array.metadata) - - self.assertEqual(data.metadata.to_internal_json_structure(), data_metadata_before) - - def _test_metadata(self, metadata): - self.maxDiff = None - - self.assertEqual(utils.to_json_structure(metadata.to_internal_simple_structure()), [{ - 'selector': [], - 'metadata': { - 'top_level': 'foobar1', - 'schema': metadata_base.CONTAINER_SCHEMA_VERSION, - 'structural_type': 'd3m.container.numpy.ndarray', - 'semantic_types': ['https://metadata.datadrivendiscovery.org/types/Table'], - 'dimension': { - 'name': 'rows', - 'semantic_types': ['https://metadata.datadrivendiscovery.org/types/TabularRow'], - 'length': 3, - }, - }, - }, { - 'selector': ['__ALL_ELEMENTS__'], - 'metadata': { - 'dimension': { - 'length': 3, - 'name': 'columns', - 'semantic_types': ['https://metadata.datadrivendiscovery.org/types/TabularColumn'], - }, - # It is unclear if name and semantic types should be moved to rows, but this is what currently happens. - 'name': 'b', - 'semantic_types': ['https://metadata.datadrivendiscovery.org/types/Attribute'], - 'structural_type': '__NO_VALUE__', - }, - }, { - 'selector': ['__ALL_ELEMENTS__', '__ALL_ELEMENTS__'], - 'metadata': { - 'structural_type': 'numpy.int64', - }, - }]) - - -if __name__ == '__main__': - unittest.main() diff --git a/common-primitives/tests/test_tabular_extractor.py b/common-primitives/tests/test_tabular_extractor.py deleted file mode 100644 index 29b2905..0000000 --- a/common-primitives/tests/test_tabular_extractor.py +++ /dev/null @@ -1,173 +0,0 @@ -import os -import unittest - -from d3m import container, utils -from d3m.metadata import base as metadata_base - -from common_primitives import dataset_to_dataframe, column_parser, tabular_extractor - -import utils as test_utils - - -class TabularExtractorPrimitiveTestCase(unittest.TestCase): - def setUp(self): - dataset_doc_path = os.path.abspath(os.path.join(os.path.dirname(__file__), 'data', 'datasets', 'iris_dataset_1', 'datasetDoc.json')) - - dataset = container.Dataset.load('file://{dataset_doc_path}'.format(dataset_doc_path=dataset_doc_path)) - - # We mark targets as attributes. - dataset.metadata = dataset.metadata.remove_semantic_type(('learningData', metadata_base.ALL_ELEMENTS, 5), 'https://metadata.datadrivendiscovery.org/types/Target') - dataset.metadata = dataset.metadata.remove_semantic_type(('learningData', metadata_base.ALL_ELEMENTS, 5), 'https://metadata.datadrivendiscovery.org/types/TrueTarget') - dataset.metadata = dataset.metadata.add_semantic_type(('learningData', metadata_base.ALL_ELEMENTS, 5), 'https://metadata.datadrivendiscovery.org/types/Attribute') - - self.dataset = dataset - - # DatasetToDataFramePrimitive - - df_hyperparams_class = dataset_to_dataframe.DatasetToDataFramePrimitive.metadata.get_hyperparams() - - df_primitive = dataset_to_dataframe.DatasetToDataFramePrimitive(hyperparams=df_hyperparams_class.defaults()) - - df_dataframe = df_primitive.produce(inputs=self.dataset).value - - # Set some missing values. - df_dataframe.iloc[1, 1] = "" - df_dataframe.iloc[10, 1] = "" - df_dataframe.iloc[15, 1] = "" - - # ColumnParserPrimitive - - cp_hyperparams_class = column_parser.ColumnParserPrimitive.metadata.get_hyperparams() - - # To simulate how Pandas "read_csv" is reading CSV files, we parse just numbers. - cp_primitive = column_parser.ColumnParserPrimitive( - hyperparams=cp_hyperparams_class.defaults().replace({ - 'parse_semantic_types': ['http://schema.org/Integer', 'http://schema.org/Float'], - }), - ) - - self.dataframe = cp_primitive.produce(inputs=df_dataframe).value - - def test_defaults(self): - te_hyperparams_class = tabular_extractor.AnnotatedTabularExtractorPrimitive.metadata.get_hyperparams() - - # It one-hot encodes categorical columns, it imputes numerical values, - # and adds missing indicator column for each. - te_primitive = tabular_extractor.AnnotatedTabularExtractorPrimitive( - hyperparams=te_hyperparams_class.defaults(), - ) - - te_primitive.set_training_data(inputs=self.dataframe) - te_primitive.fit() - - dataframe = te_primitive.produce(inputs=self.dataframe).value - - # 1 index column, 4 numerical columns with one indicator column each, - # 3 columns for one-hot encoding of "target" column and indicator column for that. - self.assertEqual(dataframe.shape, (150, 13)) - - self.assertEqual(test_utils.convert_through_json(utils.to_json_structure(dataframe.metadata.to_internal_simple_structure())), [{ - 'selector': [], - 'metadata': { - 'dimension': { - 'length': 150, - 'name': 'rows', - 'semantic_types': ['https://metadata.datadrivendiscovery.org/types/TabularRow'], - }, - 'schema': 'https://metadata.datadrivendiscovery.org/schemas/v0/container.json', - 'semantic_types': ['https://metadata.datadrivendiscovery.org/types/Table'], - 'structural_type': 'd3m.container.pandas.DataFrame', - }, - }, { - 'selector': ['__ALL_ELEMENTS__'], - 'metadata': { - 'dimension': { - 'length': 13, - 'name': 'columns', - 'semantic_types': ['https://metadata.datadrivendiscovery.org/types/TabularColumn'], - }, - }, - }, { - 'selector': ['__ALL_ELEMENTS__', 0], - 'metadata': { - 'name': 'd3mIndex', - 'semantic_types': ['http://schema.org/Integer', 'https://metadata.datadrivendiscovery.org/types/PrimaryKey'], - 'structural_type': 'int', - }, - }, { - 'selector': ['__ALL_ELEMENTS__', 1], - 'metadata': { - 'semantic_types': ['https://metadata.datadrivendiscovery.org/types/Attribute'], - 'structural_type': 'numpy.float64', - }, - }, { - 'selector': ['__ALL_ELEMENTS__', 2], - 'metadata': { - 'semantic_types': ['https://metadata.datadrivendiscovery.org/types/Attribute'], - 'structural_type': 'numpy.float64', - }, - }, { - 'selector': ['__ALL_ELEMENTS__', 3], - 'metadata': { - 'semantic_types': ['https://metadata.datadrivendiscovery.org/types/Attribute'], - 'structural_type': 'numpy.float64', - }, - }, { - 'selector': ['__ALL_ELEMENTS__', 4], - 'metadata': { - 'semantic_types': ['https://metadata.datadrivendiscovery.org/types/Attribute'], - 'structural_type': 'numpy.float64', - }, - }, { - 'selector': ['__ALL_ELEMENTS__', 5], - 'metadata': { - 'semantic_types': ['https://metadata.datadrivendiscovery.org/types/Attribute'], - 'structural_type': 'numpy.float64', - }, - }, { - 'selector': ['__ALL_ELEMENTS__', 6], - 'metadata': { - 'semantic_types': ['https://metadata.datadrivendiscovery.org/types/Attribute'], - 'structural_type': 'numpy.float64', - }, - }, { - 'selector': ['__ALL_ELEMENTS__', 7], - 'metadata': { - 'semantic_types': ['https://metadata.datadrivendiscovery.org/types/Attribute'], - 'structural_type': 'numpy.float64', - }, - }, { - 'selector': ['__ALL_ELEMENTS__', 8], - 'metadata': { - 'semantic_types': ['https://metadata.datadrivendiscovery.org/types/Attribute'], - 'structural_type': 'numpy.float64', - }, - }, { - 'selector': ['__ALL_ELEMENTS__', 9], - 'metadata': { - 'semantic_types': ['https://metadata.datadrivendiscovery.org/types/Attribute'], - 'structural_type': 'numpy.float64', - }, - }, { - 'selector': ['__ALL_ELEMENTS__', 10], - 'metadata': { - 'semantic_types': ['https://metadata.datadrivendiscovery.org/types/Attribute'], - 'structural_type': 'numpy.float64', - }, - }, { - 'selector': ['__ALL_ELEMENTS__', 11], - 'metadata': { - 'semantic_types': ['https://metadata.datadrivendiscovery.org/types/Attribute'], - 'structural_type': 'numpy.float64', - }, - }, { - 'selector': ['__ALL_ELEMENTS__', 12], - 'metadata': { - 'semantic_types': ['https://metadata.datadrivendiscovery.org/types/Attribute'], - 'structural_type': 'numpy.float64', - }, - }]) - - -if __name__ == '__main__': - unittest.main() diff --git a/common-primitives/tests/test_term_filter.py b/common-primitives/tests/test_term_filter.py deleted file mode 100644 index 5131238..0000000 --- a/common-primitives/tests/test_term_filter.py +++ /dev/null @@ -1,136 +0,0 @@ -import unittest -import os - -from common_primitives import term_filter -from d3m import container - -import utils as test_utils - - -class TermFilterPrimitiveTestCase(unittest.TestCase): - def test_inclusive(self): - dataset_doc_path = os.path.abspath(os.path.join(os.path.dirname(__file__), 'data', 'datasets', 'database_dataset_1', 'datasetDoc.json')) - dataset = container.Dataset.load('file://{dataset_doc_path}'.format(dataset_doc_path=dataset_doc_path)) - resource = test_utils.get_dataframe(dataset) - - filter_hyperparams_class = term_filter.TermFilterPrimitive.metadata.get_hyperparams() - hp = filter_hyperparams_class({ - 'column': 1, - 'inclusive': True, - 'terms': ['AAA', 'CCC'], - 'match_whole': True - }) - - filter_primitive = term_filter.TermFilterPrimitive(hyperparams=hp) - new_df = filter_primitive.produce(inputs=resource).value - - self.assertTrue(set(new_df['code'].unique()) == set(['AAA', 'CCC'])) - - def test_exclusive(self): - dataset_doc_path = os.path.abspath(os.path.join(os.path.dirname(__file__), 'data', 'datasets', 'database_dataset_1', 'datasetDoc.json')) - dataset = container.Dataset.load('file://{dataset_doc_path}'.format(dataset_doc_path=dataset_doc_path)) - resource = test_utils.get_dataframe(dataset) - - filter_hyperparams_class = term_filter.TermFilterPrimitive.metadata.get_hyperparams() - hp = filter_hyperparams_class({ - 'column': 1, - 'inclusive': False, - 'terms': ['AAA', 'CCC'], - 'match_whole': True - }) - - filter_primitive = term_filter.TermFilterPrimitive(hyperparams=hp) - new_df = filter_primitive.produce(inputs=resource).value - - self.assertTrue(set(new_df['code'].unique()) == set(['BBB'])) - - def test_numeric(self): - dataset_doc_path = os.path.abspath(os.path.join(os.path.dirname(__file__), 'data', 'datasets', 'database_dataset_1', 'datasetDoc.json')) - dataset = container.Dataset.load('file://{dataset_doc_path}'.format(dataset_doc_path=dataset_doc_path)) - resource = test_utils.get_dataframe(dataset) - - # set dataframe type to int to match output of a prior parse columns step - resource.iloc[:,3] = resource.iloc[:,3].astype(int) - - filter_hyperparams_class = term_filter.TermFilterPrimitive.metadata.get_hyperparams() - hp = filter_hyperparams_class({ - 'column': 3, - 'inclusive': False, - 'terms': ['1990'], - 'match_whole': True - }) - - filter_primitive = term_filter.TermFilterPrimitive(hyperparams=hp) - new_df = filter_primitive.produce(inputs=resource).value - - matches = new_df[~new_df['year'].astype(str).str.match('1990')] - self.assertTrue(set(matches['year'].unique()) == set([2000, 2010])) - - def test_partial_no_match(self): - dataset_doc_path = os.path.abspath(os.path.join(os.path.dirname(__file__), 'data', 'datasets', 'database_dataset_1', 'datasetDoc.json')) - dataset = container.Dataset.load('file://{dataset_doc_path}'.format(dataset_doc_path=dataset_doc_path)) - resource = test_utils.get_dataframe(dataset) - - filter_hyperparams_class = term_filter.TermFilterPrimitive.metadata.get_hyperparams() - hp = filter_hyperparams_class({ - 'column': 1, - 'inclusive': True, - 'terms': ['AA', 'CC'], - 'match_whole': False - }) - - filter_primitive = term_filter.TermFilterPrimitive(hyperparams=hp) - new_df = filter_primitive.produce(inputs=resource).value - - self.assertTrue(set(new_df['code'].unique()) == set(['AAA', 'CCC'])) - - def test_escaped_regex(self): - dataset_doc_path = os.path.abspath(os.path.join(os.path.dirname(__file__), 'data', 'datasets', 'database_dataset_1', 'datasetDoc.json')) - dataset = container.Dataset.load('file://{dataset_doc_path}'.format(dataset_doc_path=dataset_doc_path)) - resource = test_utils.get_dataframe(dataset) - - filter_hyperparams_class = term_filter.TermFilterPrimitive.metadata.get_hyperparams() - hp = filter_hyperparams_class({ - 'column': 4, - 'inclusive': True, - 'terms': ['40.2'], - 'match_whole': False - }) - - filter_primitive = term_filter.TermFilterPrimitive(hyperparams=hp) - new_df = filter_primitive.produce(inputs=resource).value - - self.assertListEqual(list(new_df['value']), ['40.2346487255306']) - - def test_row_metadata_removal(self): - dataset_doc_path = os.path.abspath(os.path.join(os.path.dirname(__file__), 'data', 'datasets', 'database_dataset_1', 'datasetDoc.json')) - dataset = container.Dataset.load('file://{dataset_doc_path}'.format(dataset_doc_path=dataset_doc_path)) - - # add metadata for rows 0 and 1 - dataset.metadata = dataset.metadata.update(('learningData', 1), {'a': 0}) - dataset.metadata = dataset.metadata.update(('learningData', 2), {'b': 1}) - - resource = test_utils.get_dataframe(dataset) - - filter_hyperparams_class = term_filter.TermFilterPrimitive.metadata.get_hyperparams() - hp = filter_hyperparams_class({ - 'column': 1, - 'inclusive': False, - 'terms': ['AAA'], - 'match_whole': True - }) - - filter_primitive = term_filter.TermFilterPrimitive(hyperparams=hp) - new_df = filter_primitive.produce(inputs=resource).value - - # verify that the lenght is correct - self.assertEqual(len(new_df), new_df.metadata.query(())['dimension']['length']) - - # verify that the rows were re-indexed in the metadata - self.assertEquals(new_df.metadata.query((0,))['a'], 0) - self.assertEquals(new_df.metadata.query((1,))['b'], 1) - self.assertFalse('b' in new_df.metadata.query((2,))) - - -if __name__ == '__main__': - unittest.main() diff --git a/common-primitives/tests/test_text_reader.py b/common-primitives/tests/test_text_reader.py deleted file mode 100644 index 00335be..0000000 --- a/common-primitives/tests/test_text_reader.py +++ /dev/null @@ -1,30 +0,0 @@ -import unittest -import os - -from d3m import container - -from common_primitives import dataset_to_dataframe, text_reader - - -class TextReaderPrimitiveTestCase(unittest.TestCase): - def test_basic(self): - dataset_doc_path = os.path.abspath(os.path.join(os.path.dirname(__file__), 'data', 'datasets', 'text_dataset_1', 'datasetDoc.json')) - - dataset = container.Dataset.load('file://{dataset_doc_path}'.format(dataset_doc_path=dataset_doc_path)) - - dataframe_hyperparams_class = dataset_to_dataframe.DatasetToDataFramePrimitive.metadata.get_hyperparams() - dataframe_primitive = dataset_to_dataframe.DatasetToDataFramePrimitive(hyperparams=dataframe_hyperparams_class.defaults().replace({'dataframe_resource': '0'})) - dataframe = dataframe_primitive.produce(inputs=dataset).value - - text_hyperparams_class = text_reader.TextReaderPrimitive.metadata.get_hyperparams() - text_primitive = text_reader.TextReaderPrimitive(hyperparams=text_hyperparams_class.defaults().replace({'return_result': 'replace'})) - tables = text_primitive.produce(inputs=dataframe).value - - self.assertEqual(tables.shape, (4, 1)) - - self.assertEqual(tables.metadata.query_column(0)['structural_type'], str) - self.assertEqual(tables.metadata.query_column(0)['semantic_types'], ('https://metadata.datadrivendiscovery.org/types/PrimaryKey', 'http://schema.org/Text')) - - -if __name__ == '__main__': - unittest.main() diff --git a/common-primitives/tests/test_train_score_split.py b/common-primitives/tests/test_train_score_split.py deleted file mode 100644 index 317367a..0000000 --- a/common-primitives/tests/test_train_score_split.py +++ /dev/null @@ -1,88 +0,0 @@ -import os -import pickle -import unittest - -from d3m import container -from d3m.metadata import base as metadata_base - -from common_primitives import train_score_split - - -class TrainScoreDatasetSplitPrimitiveTestCase(unittest.TestCase): - def test_produce_train(self): - dataset_doc_path = os.path.abspath(os.path.join(os.path.dirname(__file__), '..', 'tests', 'data', 'datasets', 'iris_dataset_1', 'datasetDoc.json')) - - dataset = container.Dataset.load('file://{dataset_doc_path}'.format(dataset_doc_path=dataset_doc_path)) - - # We set semantic types like runtime would. - dataset.metadata = dataset.metadata.add_semantic_type(('learningData', metadata_base.ALL_ELEMENTS, 5), 'https://metadata.datadrivendiscovery.org/types/Target') - dataset.metadata = dataset.metadata.add_semantic_type(('learningData', metadata_base.ALL_ELEMENTS, 5), 'https://metadata.datadrivendiscovery.org/types/TrueTarget') - dataset.metadata = dataset.metadata.remove_semantic_type(('learningData', metadata_base.ALL_ELEMENTS, 5), 'https://metadata.datadrivendiscovery.org/types/Attribute') - - hyperparams_class = train_score_split.TrainScoreDatasetSplitPrimitive.metadata.get_hyperparams() - - primitive = train_score_split.TrainScoreDatasetSplitPrimitive(hyperparams=hyperparams_class.defaults().replace({ - 'shuffle': True, - })) - - primitive.set_training_data(dataset=dataset) - primitive.fit() - - # To test that pickling works. - pickle.dumps(primitive) - - results = primitive.produce(inputs=container.List([0], generate_metadata=True)).value - - self.assertEqual(len(results), 1) - - for dataset in results: - self.assertEqual(len(dataset), 1) - - self.assertEqual(results[0]['learningData'].shape[0], 112) - self.assertEqual(list(results[0]['learningData'].iloc[:, 0]), [ - '0', '1', '2', '3', '4', '5', '6', '9', '10', '11', '12', '13', '14', '15', '17', '19', '20', - '21', '23', '25', '28', '29', '30', '31', '32', '34', '35', '36', '38', '39', '41', '42', '43', - '46', '47', '48', '49', '50', '52', '53', '55', '56', '57', '58', '60', '61', '64', '65', '67', - '68', '69', '70', '72', '74', '75', '77', '79', '80', '81', '82', '85', '87', '88', '89', '91', - '92', '94', '95', '96', '98', '99', '101', '102', '103', '104', '105', '106', '108', '109', '110', - '111', '112', '113', '115', '116', '117', '118', '119', '120', '122', '123', '124', '125', '128', - '129', '130', '131', '133', '135', '136', '138', '139', '140', '141', '142', '143', '144', '145', - '146', '147', '148', '149', - ]) - - def test_produce_score(self): - dataset_doc_path = os.path.abspath(os.path.join(os.path.dirname(__file__), '..', 'tests', 'data', 'datasets', 'iris_dataset_1', 'datasetDoc.json')) - - dataset = container.Dataset.load('file://{dataset_doc_path}'.format(dataset_doc_path=dataset_doc_path)) - - # We set semantic types like runtime would. - dataset.metadata = dataset.metadata.add_semantic_type(('learningData', metadata_base.ALL_ELEMENTS, 5), 'https://metadata.datadrivendiscovery.org/types/Target') - dataset.metadata = dataset.metadata.add_semantic_type(('learningData', metadata_base.ALL_ELEMENTS, 5), 'https://metadata.datadrivendiscovery.org/types/TrueTarget') - dataset.metadata = dataset.metadata.remove_semantic_type(('learningData', metadata_base.ALL_ELEMENTS, 5), 'https://metadata.datadrivendiscovery.org/types/Attribute') - - hyperparams_class = train_score_split.TrainScoreDatasetSplitPrimitive.metadata.get_hyperparams() - - primitive = train_score_split.TrainScoreDatasetSplitPrimitive(hyperparams=hyperparams_class.defaults().replace({ - 'shuffle': True, - })) - - primitive.set_training_data(dataset=dataset) - primitive.fit() - - results = primitive.produce_score_data(inputs=container.List([0], generate_metadata=True)).value - - self.assertEqual(len(results), 1) - - for dataset in results: - self.assertEqual(len(dataset), 1) - - self.assertEqual(results[0]['learningData'].shape[0], 38) - self.assertEqual(list(results[0]['learningData'].iloc[:, 0]), [ - '7', '8', '16', '18', '22', '24', '26', '27', '33', '37', '40', '44', '45', '51', '54', - '59', '62', '63', '66', '71', '73', '76', '78', '83', '84', '86', '90', '93', '97', '100', - '107', '114', '121', '126', '127', '132', '134', '137', - ]) - - -if __name__ == '__main__': - unittest.main() diff --git a/common-primitives/tests/test_unseen_label_decoder.py b/common-primitives/tests/test_unseen_label_decoder.py deleted file mode 100644 index 108a5c6..0000000 --- a/common-primitives/tests/test_unseen_label_decoder.py +++ /dev/null @@ -1,51 +0,0 @@ -import unittest - -from d3m import container - -from common_primitives import unseen_label_encoder, unseen_label_decoder - - -class UnseenLabelEncoderTestCase(unittest.TestCase): - def test_basic(self): - encoder_hyperparams_class = unseen_label_encoder.UnseenLabelEncoderPrimitive.metadata.get_hyperparams() - encoder_primitive = unseen_label_encoder.UnseenLabelEncoderPrimitive(hyperparams=encoder_hyperparams_class.defaults()) - - inputs = container.DataFrame({ - 'value': [0.0, 1.0, 2.0, 3.0], - 'number': [0, 1, 2, 3], - 'word': ['one', 'two', 'three', 'four'], - }, generate_metadata=True) - inputs.metadata = inputs.metadata.update_column(2, { - 'semantic_types': ['https://metadata.datadrivendiscovery.org/types/CategoricalData'], - }) - - encoder_primitive.set_training_data(inputs=inputs) - encoder_primitive.fit() - - inputs = container.DataFrame({ - 'value': [1.0, 2.0, 3.0], - 'number': [1, 2, 3], - 'word': ['one', 'two', 'five'], - }, generate_metadata=True) - inputs.metadata = inputs.metadata.update_column(2, { - 'semantic_types': ['https://metadata.datadrivendiscovery.org/types/CategoricalData'], - }) - - outputs = encoder_primitive.produce(inputs=inputs).value - - decoder_hyperparams_class = unseen_label_decoder.UnseenLabelDecoderPrimitive.metadata.get_hyperparams() - decoder_primitive = unseen_label_decoder.UnseenLabelDecoderPrimitive(hyperparams=decoder_hyperparams_class.defaults().replace({'encoder': encoder_primitive})) - - decoded = decoder_primitive.produce(inputs=outputs).value - - self.assertEqual(decoded.values.tolist(), [ - [1, 1.0, 'one'], - [2, 2.0, 'two'], - [3, 3.0, ''], - ]) - - self.assertEqual(decoded.metadata.query_column(2)['structural_type'], str) - - -if __name__ == '__main__': - unittest.main() diff --git a/common-primitives/tests/test_unseen_label_encoder.py b/common-primitives/tests/test_unseen_label_encoder.py deleted file mode 100644 index 5057688..0000000 --- a/common-primitives/tests/test_unseen_label_encoder.py +++ /dev/null @@ -1,46 +0,0 @@ -import unittest - -from d3m import container - -from common_primitives import unseen_label_encoder - - -class UnseenLabelEncoderTestCase(unittest.TestCase): - def test_basic(self): - encoder_hyperparams_class = unseen_label_encoder.UnseenLabelEncoderPrimitive.metadata.get_hyperparams() - encoder_primitive = unseen_label_encoder.UnseenLabelEncoderPrimitive(hyperparams=encoder_hyperparams_class.defaults()) - - inputs = container.DataFrame({ - 'value': [0.0, 1.0, 2.0, 3.0], - 'number': [0, 1, 2, 3], - 'word': ['one', 'two', 'three', 'four'], - }, generate_metadata=True) - inputs.metadata = inputs.metadata.update_column(2, { - 'semantic_types': ['https://metadata.datadrivendiscovery.org/types/CategoricalData'], - }) - - encoder_primitive.set_training_data(inputs=inputs) - encoder_primitive.fit() - - inputs = container.DataFrame({ - 'value': [1.0, 2.0, 3.0], - 'number': [1, 2, 3], - 'word': ['one', 'two', 'five'], - }, generate_metadata=True) - inputs.metadata = inputs.metadata.update_column(2, { - 'semantic_types': ['https://metadata.datadrivendiscovery.org/types/CategoricalData'], - }) - - outputs = encoder_primitive.produce(inputs=inputs).value - - self.assertEqual(outputs.values.tolist(), [ - [1, 1.0, 1], - [2, 2.0, 2], - [3, 3.0, 0], - ]) - - self.assertEqual(outputs.metadata.query_column(2)['structural_type'], int) - - -if __name__ == '__main__': - unittest.main() diff --git a/common-primitives/tests/test_video_reader.py b/common-primitives/tests/test_video_reader.py deleted file mode 100644 index 4ae2f72..0000000 --- a/common-primitives/tests/test_video_reader.py +++ /dev/null @@ -1,35 +0,0 @@ -import unittest -import os - -from d3m import container - -from common_primitives import dataset_to_dataframe, video_reader - - -class VideoReaderPrimitiveTestCase(unittest.TestCase): - def test_basic(self): - dataset_doc_path = os.path.abspath(os.path.join(os.path.dirname(__file__), 'data', 'datasets', 'video_dataset_1', 'datasetDoc.json')) - - dataset = container.Dataset.load('file://{dataset_doc_path}'.format(dataset_doc_path=dataset_doc_path)) - - dataframe_hyperparams_class = dataset_to_dataframe.DatasetToDataFramePrimitive.metadata.get_hyperparams() - dataframe_primitive = dataset_to_dataframe.DatasetToDataFramePrimitive(hyperparams=dataframe_hyperparams_class.defaults().replace({'dataframe_resource': '0'})) - dataframe = dataframe_primitive.produce(inputs=dataset).value - - video_hyperparams_class = video_reader.VideoReaderPrimitive.metadata.get_hyperparams() - video_primitive = video_reader.VideoReaderPrimitive(hyperparams=video_hyperparams_class.defaults().replace({'return_result': 'replace'})) - videos = video_primitive.produce(inputs=dataframe).value - - self.assertEqual(videos.shape, (2, 1)) - self.assertEqual(videos.iloc[0, 0].shape, (408, 240, 320, 3)) - self.assertEqual(videos.iloc[1, 0].shape, (79, 240, 320, 3)) - - self._test_metadata(videos.metadata) - - def _test_metadata(self, metadata): - self.assertEqual(metadata.query_column(0)['structural_type'], container.ndarray) - self.assertEqual(metadata.query_column(0)['semantic_types'], ('https://metadata.datadrivendiscovery.org/types/PrimaryKey', 'http://schema.org/VideoObject')) - - -if __name__ == '__main__': - unittest.main() diff --git a/common-primitives/tests/test_xgboost_dart.py b/common-primitives/tests/test_xgboost_dart.py deleted file mode 100644 index a2928f4..0000000 --- a/common-primitives/tests/test_xgboost_dart.py +++ /dev/null @@ -1,687 +0,0 @@ -import os -import pickle -import unittest - -from d3m import container, utils -from d3m.metadata import base as metadata_base - -from common_primitives import dataset_to_dataframe, extract_columns_semantic_types, xgboost_dart, column_parser - - -class XGBoostDartTestCase(unittest.TestCase): - def _get_iris(self): - dataset_doc_path = os.path.abspath( - os.path.join(os.path.dirname(__file__), 'data', 'datasets', 'iris_dataset_1', 'datasetDoc.json')) - - dataset = container.Dataset.load('file://{dataset_doc_path}'.format(dataset_doc_path=dataset_doc_path)) - - hyperparams_class = \ - dataset_to_dataframe.DatasetToDataFramePrimitive.metadata.query()['primitive_code']['class_type_arguments'][ - 'Hyperparams'] - primitive = dataset_to_dataframe.DatasetToDataFramePrimitive(hyperparams=hyperparams_class.defaults()) - - dataframe = primitive.produce(inputs=dataset).value - - return dataframe - - def _get_iris_columns(self): - dataframe = self._get_iris() - - # We set custom metadata on columns. - for column_index in range(1, 5): - dataframe.metadata = dataframe.metadata.update_column(column_index, {'custom_metadata': 'attributes'}) - for column_index in range(5, 6): - dataframe.metadata = dataframe.metadata.update_column(column_index, {'custom_metadata': 'targets'}) - - # We set semantic types like runtime would. - dataframe.metadata = dataframe.metadata.add_semantic_type((metadata_base.ALL_ELEMENTS, 5), - 'https://metadata.datadrivendiscovery.org/types/Target') - dataframe.metadata = dataframe.metadata.add_semantic_type((metadata_base.ALL_ELEMENTS, 5), - 'https://metadata.datadrivendiscovery.org/types/TrueTarget') - dataframe.metadata = dataframe.metadata.remove_semantic_type((metadata_base.ALL_ELEMENTS, 5), 'https://metadata.datadrivendiscovery.org/types/Attribute') - - # Parsing. - hyperparams_class = \ - column_parser.ColumnParserPrimitive.metadata.query()['primitive_code']['class_type_arguments'][ - 'Hyperparams'] - primitive = column_parser.ColumnParserPrimitive(hyperparams=hyperparams_class.defaults()) - dataframe = primitive.produce(inputs=dataframe).value - - hyperparams_class = \ - extract_columns_semantic_types.ExtractColumnsBySemanticTypesPrimitive.metadata.query()['primitive_code'][ - 'class_type_arguments']['Hyperparams'] - - primitive = extract_columns_semantic_types.ExtractColumnsBySemanticTypesPrimitive( - hyperparams=hyperparams_class.defaults().replace( - {'semantic_types': ('https://metadata.datadrivendiscovery.org/types/Attribute',)})) - attributes = primitive.produce(inputs=dataframe).value - - primitive = extract_columns_semantic_types.ExtractColumnsBySemanticTypesPrimitive( - hyperparams=hyperparams_class.defaults().replace( - {'semantic_types': ('https://metadata.datadrivendiscovery.org/types/SuggestedTarget',)})) - targets = primitive.produce(inputs=dataframe).value - - return dataframe, attributes, targets - - def test_single_target(self): - dataframe, attributes, targets = self._get_iris_columns() - - self.assertEqual(list(targets.columns), ['species']) - hyperparams_class = \ - xgboost_dart.XGBoostDartClassifierPrimitive.metadata.query()['primitive_code']['class_type_arguments'][ - 'Hyperparams'] - primitive = xgboost_dart.XGBoostDartClassifierPrimitive( - hyperparams=hyperparams_class.defaults().replace({'return_result': 'new', 'add_index_columns': False})) - primitive.set_training_data(inputs=attributes, outputs=targets) - primitive.fit() - - predictions = primitive.produce(inputs=attributes).value - self.assertEqual(list(predictions.columns), ['species']) - - self.assertEqual(predictions.shape, (150, 1)) - self.assertEqual(predictions.iloc[0, 0], 'Iris-setosa') - self.assertTrue(predictions.metadata.has_semantic_type((metadata_base.ALL_ELEMENTS, 0), - 'https://metadata.datadrivendiscovery.org/types/PredictedTarget')) - self.assertFalse(predictions.metadata.has_semantic_type((metadata_base.ALL_ELEMENTS, 0), - 'https://metadata.datadrivendiscovery.org/types/TrueTarget')) - self.assertEqual(predictions.metadata.query_column(0)['name'], 'species') - self.assertEqual(predictions.metadata.query_column(0)['custom_metadata'], 'targets') - - self._test_single_target_metadata(predictions.metadata) - - samples = primitive.sample(inputs=attributes).value - self.assertEqual(list(samples[0].columns), ['species']) - - self.assertEqual(len(samples), 1) - self.assertEqual(samples[0].shape, (150, 1)) - self.assertEqual(samples[0].iloc[0, 0], 'Iris-setosa') - self.assertTrue(samples[0].metadata.has_semantic_type((metadata_base.ALL_ELEMENTS, 0), - 'https://metadata.datadrivendiscovery.org/types/PredictedTarget')) - self.assertFalse(samples[0].metadata.has_semantic_type((metadata_base.ALL_ELEMENTS, 0), - 'https://metadata.datadrivendiscovery.org/types/TrueTarget')) - self.assertEqual(samples[0].metadata.query_column(0)['name'], 'species') - self.assertEqual(samples[0].metadata.query_column(0)['custom_metadata'], 'targets') - - log_likelihoods = primitive.log_likelihoods(inputs=attributes, outputs=targets).value - self.assertEqual(list(log_likelihoods.columns), ['species']) - - self.assertEqual(log_likelihoods.shape, (150, 1)) - self.assertEqual(log_likelihoods.metadata.query_column(0)['name'], 'species') - - log_likelihood = primitive.log_likelihood(inputs=attributes, outputs=targets).value - self.assertEqual(list(log_likelihood.columns), ['species']) - - self.assertEqual(log_likelihood.shape, (1, 1)) - self.assertAlmostEqual(log_likelihood.iloc[0, 0], -2.414982318878174) - self.assertEqual(log_likelihoods.metadata.query_column(0)['name'], 'species') - - def test_single_target_continue_fit(self): - dataframe, attributes, targets = self._get_iris_columns() - - hyperparams_class = \ - xgboost_dart.XGBoostDartClassifierPrimitive.metadata.query()['primitive_code']['class_type_arguments'][ - 'Hyperparams'] - primitive = xgboost_dart.XGBoostDartClassifierPrimitive( - hyperparams=hyperparams_class.defaults().replace({'return_result': 'new', 'add_index_columns': False})) - - primitive.set_training_data(inputs=attributes, outputs=targets) - primitive.fit() - # reset the training data to make continue_fit() work. - primitive.set_training_data(inputs=attributes, outputs=targets) - primitive.continue_fit() - params = primitive.get_params() - self.assertEqual(params['booster'].best_ntree_limit, - primitive.hyperparams['n_estimators'] + primitive.hyperparams['n_more_estimators']) - predictions = primitive.produce(inputs=attributes).value - - self.assertEqual(predictions.shape, (150, 1)) - self.assertEqual(predictions.iloc[0, 0], 'Iris-setosa') - self.assertTrue(predictions.metadata.has_semantic_type((metadata_base.ALL_ELEMENTS, 0), - 'https://metadata.datadrivendiscovery.org/types/PredictedTarget')) - self.assertFalse(predictions.metadata.has_semantic_type((metadata_base.ALL_ELEMENTS, 0), - 'https://metadata.datadrivendiscovery.org/types/TrueTarget')) - self.assertEqual(predictions.metadata.query_column(0)['name'], 'species') - self.assertEqual(predictions.metadata.query_column(0)['custom_metadata'], 'targets') - - self._test_single_target_metadata(predictions.metadata) - - samples = primitive.sample(inputs=attributes).value - - self.assertEqual(len(samples), 1) - self.assertEqual(samples[0].shape, (150, 1)) - self.assertEqual(samples[0].iloc[0, 0], 'Iris-setosa') - self.assertTrue(samples[0].metadata.has_semantic_type((metadata_base.ALL_ELEMENTS, 0), - 'https://metadata.datadrivendiscovery.org/types/PredictedTarget')) - self.assertFalse(samples[0].metadata.has_semantic_type((metadata_base.ALL_ELEMENTS, 0), - 'https://metadata.datadrivendiscovery.org/types/TrueTarget')) - self.assertEqual(samples[0].metadata.query_column(0)['name'], 'species') - self.assertEqual(samples[0].metadata.query_column(0)['custom_metadata'], 'targets') - - log_likelihoods = primitive.log_likelihoods(inputs=attributes, outputs=targets).value - - self.assertEqual(log_likelihoods.shape, (150, 1)) - self.assertEqual(log_likelihoods.metadata.query_column(0)['name'], 'species') - - log_likelihood = primitive.log_likelihood(inputs=attributes, outputs=targets).value - - self.assertEqual(log_likelihood.shape, (1, 1)) - self.assertEqual(log_likelihoods.metadata.query_column(0)['name'], 'species') - - def _test_single_target_metadata(self, predictions_metadata): - expected_metadata = [{ - 'selector': [], - 'metadata': { - 'schema': metadata_base.CONTAINER_SCHEMA_VERSION, - 'structural_type': 'd3m.container.pandas.DataFrame', - 'semantic_types': ['https://metadata.datadrivendiscovery.org/types/Table'], - 'dimension': { - 'name': 'rows', - 'semantic_types': ['https://metadata.datadrivendiscovery.org/types/TabularRow'], - 'length': 150, - }, - }, - }, { - 'selector': ['__ALL_ELEMENTS__'], - 'metadata': { - 'dimension': { - 'name': 'columns', - 'semantic_types': ['https://metadata.datadrivendiscovery.org/types/TabularColumn'], - 'length': 1, - }, - }, - }, { - 'selector': ['__ALL_ELEMENTS__', 0], - 'metadata': { - 'structural_type': 'str', - 'name': 'species', - 'semantic_types': ['https://metadata.datadrivendiscovery.org/types/CategoricalData', - 'https://metadata.datadrivendiscovery.org/types/SuggestedTarget', - 'https://metadata.datadrivendiscovery.org/types/Target', - 'https://metadata.datadrivendiscovery.org/types/PredictedTarget'], - 'custom_metadata': 'targets', - }, - }] - - self.assertEqual(utils.to_json_structure(predictions_metadata.to_internal_simple_structure()), expected_metadata) - - def test_multiple_targets(self): - dataframe, attributes, targets = self._get_iris_columns() - - targets = targets.append_columns(targets) - - self.assertEqual(list(targets.columns), ['species', 'species']) - - hyperparams_class = \ - xgboost_dart.XGBoostDartClassifierPrimitive.metadata.query()['primitive_code']['class_type_arguments'][ - 'Hyperparams'] - primitive = xgboost_dart.XGBoostDartClassifierPrimitive( - hyperparams=hyperparams_class.defaults().replace({'return_result': 'new', 'add_index_columns': False})) - - primitive.set_training_data(inputs=attributes, outputs=targets) - primitive.fit() - - predictions = primitive.produce(inputs=attributes).value - self.assertEqual(list(predictions.columns), ['species', 'species']) - self.assertEqual(predictions.shape, (150, 2)) - for column_index in range(2): - self.assertEqual(predictions.iloc[0, column_index], 'Iris-setosa') - self.assertTrue(predictions.metadata.has_semantic_type((metadata_base.ALL_ELEMENTS, column_index), - 'https://metadata.datadrivendiscovery.org/types/PredictedTarget')) - self.assertFalse(predictions.metadata.has_semantic_type((metadata_base.ALL_ELEMENTS, column_index), - 'https://metadata.datadrivendiscovery.org/types/TrueTarget')) - self.assertEqual(predictions.metadata.query_column(column_index)['name'], 'species') - self.assertEqual(predictions.metadata.query_column(column_index)['custom_metadata'], 'targets') - - samples = primitive.sample(inputs=attributes).value - self.assertEqual(list(samples[0].columns), ['species', 'species']) - self.assertEqual(len(samples), 1) - self.assertEqual(samples[0].shape, (150, 2)) - for column_index in range(2): - self.assertEqual(samples[0].iloc[0, column_index], 'Iris-setosa') - self.assertTrue(samples[0].metadata.has_semantic_type((metadata_base.ALL_ELEMENTS, column_index), - 'https://metadata.datadrivendiscovery.org/types/PredictedTarget')) - self.assertFalse(samples[0].metadata.has_semantic_type((metadata_base.ALL_ELEMENTS, column_index), - 'https://metadata.datadrivendiscovery.org/types/TrueTarget')) - self.assertEqual(samples[0].metadata.query_column(column_index)['name'], 'species') - self.assertEqual(samples[0].metadata.query_column(column_index)['custom_metadata'], 'targets') - - log_likelihoods = primitive.log_likelihoods(inputs=attributes, outputs=targets).value - self.assertEqual(list(log_likelihoods.columns), ['species', 'species']) - - self.assertEqual(log_likelihoods.shape, (150, 2)) - for column_index in range(2): - self.assertEqual(log_likelihoods.metadata.query_column(column_index)['name'], 'species') - - log_likelihood = primitive.log_likelihood(inputs=attributes, outputs=targets).value - self.assertEqual(list(log_likelihood.columns), ['species', 'species']) - - self.assertEqual(log_likelihood.shape, (1, 2)) - for column_index in range(2): - self.assertAlmostEqual(log_likelihood.iloc[0, column_index], -2.414982318878174) - self.assertEqual(log_likelihoods.metadata.query_column(column_index)['name'], 'species') - - def test_multiple_targets_continue_fit(self): - dataframe, attributes, targets = self._get_iris_columns() - second_targets = targets.copy() - second_targets['species'] = targets['species'].map( - {'Iris-setosa': 't-Iris-setosa', 'Iris-versicolor': 't-Iris-versicolor', - 'Iris-virginica': 't-Iris-virginica'}) - second_targets.rename(columns={'species': 't-species'}, inplace=True) - second_targets.metadata = second_targets.metadata.update_column(0, {'name': 't-species'}) - targets = targets.append_columns(second_targets) - hyperparams_class = \ - xgboost_dart.XGBoostDartClassifierPrimitive.metadata.query()['primitive_code']['class_type_arguments'][ - 'Hyperparams'] - primitive = xgboost_dart.XGBoostDartClassifierPrimitive( - hyperparams=hyperparams_class.defaults().replace({'return_result': 'new', 'add_index_columns': False})) - - primitive.set_training_data(inputs=attributes, outputs=targets) - primitive.fit() - primitive.set_training_data(inputs=attributes, outputs=targets) - primitive.continue_fit() - params = primitive.get_params() - for estimator in params['estimators']: - self.assertEqual(estimator.get_booster().best_ntree_limit, - primitive.hyperparams['n_estimators'] + primitive.hyperparams['n_more_estimators']) - - predictions = primitive.produce(inputs=attributes).value - - self.assertEqual(predictions.shape, (150, 2)) - self.assertEqual(predictions.iloc[0, 0], 'Iris-setosa') - self.assertTrue(predictions.metadata.has_semantic_type((metadata_base.ALL_ELEMENTS, 0), - 'https://metadata.datadrivendiscovery.org/types/PredictedTarget')) - self.assertFalse(predictions.metadata.has_semantic_type((metadata_base.ALL_ELEMENTS, 0), - 'https://metadata.datadrivendiscovery.org/types/TrueTarget')) - self.assertEqual(predictions.metadata.query_column(0)['name'], 'species') - self.assertEqual(predictions.metadata.query_column(0)['custom_metadata'], 'targets') - self.assertEqual(predictions.iloc[0, 1], 't-Iris-setosa') - self.assertTrue(predictions.metadata.has_semantic_type((metadata_base.ALL_ELEMENTS, 1), - 'https://metadata.datadrivendiscovery.org/types/PredictedTarget')) - self.assertFalse(predictions.metadata.has_semantic_type((metadata_base.ALL_ELEMENTS, 1), - 'https://metadata.datadrivendiscovery.org/types/TrueTarget')) - self.assertEqual(predictions.metadata.query_column(1)['name'], 't-species') - self.assertEqual(predictions.metadata.query_column(1)['custom_metadata'], 'targets') - samples = primitive.sample(inputs=attributes).value - - self.assertEqual(len(samples), 1) - self.assertEqual(samples[0].shape, (150, 2)) - self.assertEqual(samples[0].iloc[0, 0], 'Iris-setosa') - self.assertTrue(samples[0].metadata.has_semantic_type((metadata_base.ALL_ELEMENTS, 0), - 'https://metadata.datadrivendiscovery.org/types/PredictedTarget')) - self.assertFalse(samples[0].metadata.has_semantic_type((metadata_base.ALL_ELEMENTS, 0), - 'https://metadata.datadrivendiscovery.org/types/TrueTarget')) - self.assertEqual(samples[0].metadata.query_column(0)['name'], 'species') - self.assertEqual(samples[0].metadata.query_column(0)['custom_metadata'], 'targets') - - self.assertEqual(samples[0].iloc[0, 1], 't-Iris-setosa') - self.assertTrue(samples[0].metadata.has_semantic_type((metadata_base.ALL_ELEMENTS, 1), - 'https://metadata.datadrivendiscovery.org/types/PredictedTarget')) - self.assertFalse(samples[0].metadata.has_semantic_type((metadata_base.ALL_ELEMENTS, 1), - 'https://metadata.datadrivendiscovery.org/types/TrueTarget')) - self.assertEqual(samples[0].metadata.query_column(1)['name'], 't-species') - self.assertEqual(samples[0].metadata.query_column(1)['custom_metadata'], 'targets') - log_likelihoods = primitive.log_likelihoods(inputs=attributes, outputs=targets).value - - self.assertEqual(log_likelihoods.shape, (150, 2)) - self.assertEqual(log_likelihoods.metadata.query_column(0)['name'], 'species') - self.assertEqual(log_likelihoods.metadata.query_column(1)['name'], 't-species') - - log_likelihood = primitive.log_likelihood(inputs=attributes, outputs=targets).value - - self.assertEqual(log_likelihood.shape, (1, 2)) - self.assertEqual(log_likelihoods.metadata.query_column(0)['name'], 'species') - self.assertEqual(log_likelihoods.metadata.query_column(1)['name'], 't-species') - - def test_semantic_types(self): - dataframe, attributes, targets = self._get_iris_columns() - - hyperparams_class = \ - xgboost_dart.XGBoostDartClassifierPrimitive.metadata.query()['primitive_code']['class_type_arguments'][ - 'Hyperparams'] - primitive = xgboost_dart.XGBoostDartClassifierPrimitive( - hyperparams=hyperparams_class.defaults().replace({'return_result': 'new', 'add_index_columns': False})) - - primitive.set_training_data(inputs=dataframe, outputs=dataframe) - primitive.fit() - - predictions = primitive.produce(inputs=dataframe).value - self.assertEqual(list(predictions.columns), ['species']) - - self.assertEqual(predictions.shape, (150, 1)) - self.assertEqual(predictions.iloc[0, 0], 'Iris-setosa') - self.assertTrue(predictions.metadata.has_semantic_type((metadata_base.ALL_ELEMENTS, 0), - 'https://metadata.datadrivendiscovery.org/types/PredictedTarget')) - self.assertFalse(predictions.metadata.has_semantic_type((metadata_base.ALL_ELEMENTS, 0), - 'https://metadata.datadrivendiscovery.org/types/TrueTarget')) - self.assertEqual(predictions.metadata.query_column(0)['name'], 'species') - self.assertEqual(predictions.metadata.query_column(0)['custom_metadata'], 'targets') - - samples = primitive.sample(inputs=dataframe).value - self.assertEqual(list(samples[0].columns), ['species']) - - self.assertEqual(len(samples), 1) - self.assertEqual(samples[0].shape, (150, 1)) - self.assertEqual(samples[0].iloc[0, 0], 'Iris-setosa') - self.assertTrue(samples[0].metadata.has_semantic_type((metadata_base.ALL_ELEMENTS, 0), - 'https://metadata.datadrivendiscovery.org/types/PredictedTarget')) - self.assertFalse(samples[0].metadata.has_semantic_type((metadata_base.ALL_ELEMENTS, 0), - 'https://metadata.datadrivendiscovery.org/types/TrueTarget')) - self.assertEqual(samples[0].metadata.query_column(0)['name'], 'species') - self.assertEqual(samples[0].metadata.query_column(0)['custom_metadata'], 'targets') - - log_likelihoods = primitive.log_likelihoods(inputs=dataframe, outputs=dataframe).value - self.assertEqual(list(log_likelihoods.columns), ['species']) - - self.assertEqual(log_likelihoods.shape, (150, 1)) - self.assertEqual(log_likelihoods.metadata.query_column(0)['name'], 'species') - - log_likelihood = primitive.log_likelihood(inputs=dataframe, outputs=dataframe).value - self.assertEqual(list(log_likelihood.columns), ['species']) - - self.assertEqual(log_likelihood.shape, (1, 1)) - self.assertAlmostEqual(log_likelihood.iloc[0, 0], -2.414982318878174) - self.assertEqual(log_likelihoods.metadata.query_column(0)['name'], 'species') - - def test_return_append(self): - dataframe, attributes, targets = self._get_iris_columns() - - hyperparams_class = \ - xgboost_dart.XGBoostDartClassifierPrimitive.metadata.query()['primitive_code']['class_type_arguments'][ - 'Hyperparams'] - primitive = xgboost_dart.XGBoostDartClassifierPrimitive(hyperparams=hyperparams_class.defaults()) - - primitive.set_training_data(inputs=dataframe, outputs=dataframe) - primitive.fit() - - predictions = primitive.produce(inputs=dataframe).value - - self.assertEqual(list(predictions.columns), [ - 'd3mIndex', - 'sepalLength', - 'sepalWidth', - 'petalLength', - 'petalWidth', - 'species', - 'species', - ]) - - self.assertEqual(predictions.shape, (150, 7)) - self.assertEqual(predictions.iloc[0, 6], 'Iris-setosa') - self.assertTrue(predictions.metadata.has_semantic_type((metadata_base.ALL_ELEMENTS, 6), - 'https://metadata.datadrivendiscovery.org/types/PredictedTarget')) - self.assertFalse(predictions.metadata.has_semantic_type((metadata_base.ALL_ELEMENTS, 6), - 'https://metadata.datadrivendiscovery.org/types/TrueTarget')) - self.assertEqual(predictions.metadata.query_column(6)['name'], 'species') - self.assertEqual(predictions.metadata.query_column(6)['custom_metadata'], 'targets') - - self._test_return_append_metadata(predictions.metadata) - - def _test_return_append_metadata(self, predictions_metadata): - self.assertEqual(utils.to_json_structure(predictions_metadata.to_internal_simple_structure()), [{ - 'selector': [], - 'metadata': { - 'structural_type': 'd3m.container.pandas.DataFrame', - 'semantic_types': ['https://metadata.datadrivendiscovery.org/types/Table'], - 'dimension': { - 'name': 'rows', - 'semantic_types': ['https://metadata.datadrivendiscovery.org/types/TabularRow'], - 'length': 150, - }, - 'schema': 'https://metadata.datadrivendiscovery.org/schemas/v0/container.json', - }, - }, { - 'selector': ['__ALL_ELEMENTS__'], - 'metadata': { - 'dimension': { - 'name': 'columns', - 'semantic_types': ['https://metadata.datadrivendiscovery.org/types/TabularColumn'], - 'length': 7, - }, - }, - }, { - 'selector': ['__ALL_ELEMENTS__', 0], - 'metadata': { - 'name': 'd3mIndex', - 'structural_type': 'int', - 'semantic_types': ['http://schema.org/Integer', - 'https://metadata.datadrivendiscovery.org/types/PrimaryKey'], - }, - }, { - 'selector': ['__ALL_ELEMENTS__', 1], - 'metadata': { - 'name': 'sepalLength', - 'structural_type': 'float', - 'semantic_types': ['http://schema.org/Float', - 'https://metadata.datadrivendiscovery.org/types/Attribute'], - 'custom_metadata': 'attributes', - }, - }, { - 'selector': ['__ALL_ELEMENTS__', 2], - 'metadata': { - 'name': 'sepalWidth', - 'structural_type': 'float', - 'semantic_types': ['http://schema.org/Float', - 'https://metadata.datadrivendiscovery.org/types/Attribute'], - 'custom_metadata': 'attributes', - }, - }, { - 'selector': ['__ALL_ELEMENTS__', 3], - 'metadata': { - 'name': 'petalLength', - 'structural_type': 'float', - 'semantic_types': ['http://schema.org/Float', - 'https://metadata.datadrivendiscovery.org/types/Attribute'], - 'custom_metadata': 'attributes', - }, - }, { - 'selector': ['__ALL_ELEMENTS__', 4], - 'metadata': { - 'name': 'petalWidth', - 'structural_type': 'float', - 'semantic_types': ['http://schema.org/Float', - 'https://metadata.datadrivendiscovery.org/types/Attribute'], - 'custom_metadata': 'attributes', - }, - }, { - 'selector': ['__ALL_ELEMENTS__', 5], - 'metadata': { - 'name': 'species', - 'structural_type': 'str', - 'semantic_types': ['https://metadata.datadrivendiscovery.org/types/CategoricalData', - 'https://metadata.datadrivendiscovery.org/types/SuggestedTarget', - 'https://metadata.datadrivendiscovery.org/types/Target', - 'https://metadata.datadrivendiscovery.org/types/TrueTarget'], - 'custom_metadata': 'targets', - }, - }, { - 'selector': ['__ALL_ELEMENTS__', 6], - 'metadata': { - 'structural_type': 'str', - 'name': 'species', - 'semantic_types': ['https://metadata.datadrivendiscovery.org/types/CategoricalData', - 'https://metadata.datadrivendiscovery.org/types/SuggestedTarget', - 'https://metadata.datadrivendiscovery.org/types/Target', - 'https://metadata.datadrivendiscovery.org/types/PredictedTarget'], - 'custom_metadata': 'targets', - }, - }]) - - def test_return_new(self): - dataframe, attributes, targets = self._get_iris_columns() - - hyperparams_class = \ - xgboost_dart.XGBoostDartClassifierPrimitive.metadata.query()['primitive_code']['class_type_arguments'][ - 'Hyperparams'] - primitive = xgboost_dart.XGBoostDartClassifierPrimitive( - hyperparams=hyperparams_class.defaults().replace({'return_result': 'new'})) - - primitive.set_training_data(inputs=dataframe, outputs=dataframe) - primitive.fit() - - predictions = primitive.produce(inputs=dataframe).value - self.assertEqual(list(predictions.columns), [ - 'd3mIndex', - 'species', - ]) - self.assertEqual(predictions.shape, (150, 2)) - self.assertEqual(predictions.iloc[0, 1], 'Iris-setosa') - self.assertTrue(predictions.metadata.has_semantic_type((metadata_base.ALL_ELEMENTS, 1), - 'https://metadata.datadrivendiscovery.org/types/PredictedTarget')) - self.assertFalse(predictions.metadata.has_semantic_type((metadata_base.ALL_ELEMENTS, 1), - 'https://metadata.datadrivendiscovery.org/types/TrueTarget')) - self.assertEqual(predictions.metadata.query_column(1)['name'], 'species') - self.assertEqual(predictions.metadata.query_column(1)['custom_metadata'], 'targets') - - self._test_return_new_metadata(predictions.metadata) - - def _test_return_new_metadata(self, predictions_metadata): - expected_metadata = [{ - 'selector': [], - 'metadata': { - 'structural_type': 'd3m.container.pandas.DataFrame', - 'semantic_types': ['https://metadata.datadrivendiscovery.org/types/Table'], - 'dimension': { - 'name': 'rows', - 'semantic_types': ['https://metadata.datadrivendiscovery.org/types/TabularRow'], - 'length': 150, - }, - 'schema': 'https://metadata.datadrivendiscovery.org/schemas/v0/container.json', - }, - }, { - 'selector': ['__ALL_ELEMENTS__'], - 'metadata': { - 'dimension': { - 'name': 'columns', - 'semantic_types': ['https://metadata.datadrivendiscovery.org/types/TabularColumn'], - 'length': 2, - }, - }, - }, { - 'selector': ['__ALL_ELEMENTS__', 0], - 'metadata': { - 'name': 'd3mIndex', - 'structural_type': 'int', - 'semantic_types': ['http://schema.org/Integer', - 'https://metadata.datadrivendiscovery.org/types/PrimaryKey'], - }, - }, { - 'selector': ['__ALL_ELEMENTS__', 1], - 'metadata': { - 'structural_type': 'str', - 'name': 'species', - 'semantic_types': ['https://metadata.datadrivendiscovery.org/types/CategoricalData', - 'https://metadata.datadrivendiscovery.org/types/SuggestedTarget', - 'https://metadata.datadrivendiscovery.org/types/Target', - 'https://metadata.datadrivendiscovery.org/types/PredictedTarget'], - 'custom_metadata': 'targets', - }, - }] - - self.assertEqual(utils.to_json_structure(predictions_metadata.to_internal_simple_structure()), expected_metadata) - - def test_return_replace(self): - dataframe, attributes, targets = self._get_iris_columns() - - hyperparams_class = \ - xgboost_dart.XGBoostDartClassifierPrimitive.metadata.query()['primitive_code']['class_type_arguments'][ - 'Hyperparams'] - primitive = xgboost_dart.XGBoostDartClassifierPrimitive( - hyperparams=hyperparams_class.defaults().replace({'return_result': 'replace'})) - - primitive.set_training_data(inputs=dataframe, outputs=dataframe) - primitive.fit() - - predictions = primitive.produce(inputs=dataframe).value - self.assertEqual(list(predictions.columns), [ - 'd3mIndex', - 'species', - 'species', - ]) - - self.assertEqual(predictions.shape, (150, 3)) - self.assertEqual(predictions.iloc[0, 1], 'Iris-setosa') - self.assertTrue(predictions.metadata.has_semantic_type((metadata_base.ALL_ELEMENTS, 1), - 'https://metadata.datadrivendiscovery.org/types/PredictedTarget')) - self.assertFalse(predictions.metadata.has_semantic_type((metadata_base.ALL_ELEMENTS, 1), - 'https://metadata.datadrivendiscovery.org/types/TrueTarget')) - self.assertEqual(predictions.metadata.query_column(1)['name'], 'species') - self.assertEqual(predictions.metadata.query_column(1)['custom_metadata'], 'targets') - - self._test_return_replace_metadata(predictions.metadata) - - def test_pickle_unpickle(self): - dataframe, attributes, targets = self._get_iris_columns() - - hyperparams_class = \ - xgboost_dart.XGBoostDartClassifierPrimitive.metadata.query()['primitive_code']['class_type_arguments'][ - 'Hyperparams'] - primitive = xgboost_dart.XGBoostDartClassifierPrimitive( - hyperparams=hyperparams_class.defaults().replace({'return_result': 'new', 'add_index_columns': False})) - - primitive.set_training_data(inputs=attributes, outputs=targets) - primitive.fit() - - before_pickled_prediction = primitive.produce(inputs=attributes).value - pickle_object = pickle.dumps(primitive) - primitive = pickle.loads(pickle_object) - after_unpickled_prediction = primitive.produce(inputs=attributes).value - _ = pickle.dumps(primitive) - self.assertTrue(container.DataFrame.equals(before_pickled_prediction, after_unpickled_prediction)) - - def _test_return_replace_metadata(self, predictions_metadata): - self.assertEqual(utils.to_json_structure(predictions_metadata.to_internal_simple_structure()), [{ - 'selector': [], - 'metadata': { - 'structural_type': 'd3m.container.pandas.DataFrame', - 'semantic_types': ['https://metadata.datadrivendiscovery.org/types/Table'], - 'dimension': { - 'name': 'rows', - 'semantic_types': ['https://metadata.datadrivendiscovery.org/types/TabularRow'], - 'length': 150, - }, - 'schema': 'https://metadata.datadrivendiscovery.org/schemas/v0/container.json', - }, - }, { - 'selector': ['__ALL_ELEMENTS__'], - 'metadata': { - 'dimension': { - 'name': 'columns', - 'semantic_types': ['https://metadata.datadrivendiscovery.org/types/TabularColumn'], - 'length': 3, - }, - }, - }, { - 'selector': ['__ALL_ELEMENTS__', 0], - 'metadata': { - 'name': 'd3mIndex', - 'structural_type': 'int', - 'semantic_types': ['http://schema.org/Integer', - 'https://metadata.datadrivendiscovery.org/types/PrimaryKey'], - }, - }, { - 'selector': ['__ALL_ELEMENTS__', 1], - 'metadata': { - 'structural_type': 'str', - 'name': 'species', - 'semantic_types': ['https://metadata.datadrivendiscovery.org/types/CategoricalData', - 'https://metadata.datadrivendiscovery.org/types/SuggestedTarget', - 'https://metadata.datadrivendiscovery.org/types/Target', - 'https://metadata.datadrivendiscovery.org/types/PredictedTarget'], - 'custom_metadata': 'targets', - }, - }, { - 'selector': ['__ALL_ELEMENTS__', 2], - 'metadata': { - 'name': 'species', - 'structural_type': 'str', - 'semantic_types': ['https://metadata.datadrivendiscovery.org/types/CategoricalData', - 'https://metadata.datadrivendiscovery.org/types/SuggestedTarget', - 'https://metadata.datadrivendiscovery.org/types/Target', - 'https://metadata.datadrivendiscovery.org/types/TrueTarget'], - 'custom_metadata': 'targets', - }, - }]) - - -if __name__ == '__main__': - unittest.main() diff --git a/common-primitives/tests/test_xgboost_gbtree.py b/common-primitives/tests/test_xgboost_gbtree.py deleted file mode 100644 index 1ec0e67..0000000 --- a/common-primitives/tests/test_xgboost_gbtree.py +++ /dev/null @@ -1,733 +0,0 @@ -import os -import pickle -import unittest - -from d3m import container, utils -from d3m.metadata import base as metadata_base - -from common_primitives import dataset_to_dataframe, extract_columns_semantic_types, xgboost_gbtree, column_parser - - -class XGBoostTestCase(unittest.TestCase): - def _get_iris(self): - dataset_doc_path = os.path.abspath( - os.path.join(os.path.dirname(__file__), 'data', 'datasets', 'iris_dataset_1', 'datasetDoc.json')) - - dataset = container.Dataset.load('file://{dataset_doc_path}'.format(dataset_doc_path=dataset_doc_path)) - - hyperparams_class = \ - dataset_to_dataframe.DatasetToDataFramePrimitive.metadata.query()['primitive_code']['class_type_arguments'][ - 'Hyperparams'] - primitive = dataset_to_dataframe.DatasetToDataFramePrimitive(hyperparams=hyperparams_class.defaults()) - - dataframe = primitive.produce(inputs=dataset).value - - return dataframe - - def _get_iris_columns(self): - dataframe = self._get_iris() - - # We set custom metadata on columns. - for column_index in range(1, 5): - dataframe.metadata = dataframe.metadata.update_column(column_index, {'custom_metadata': 'attributes'}) - for column_index in range(5, 6): - dataframe.metadata = dataframe.metadata.update_column(column_index, {'custom_metadata': 'targets'}) - - # We set semantic types like runtime would. - dataframe.metadata = dataframe.metadata.add_semantic_type((metadata_base.ALL_ELEMENTS, 5), - 'https://metadata.datadrivendiscovery.org/types/Target') - dataframe.metadata = dataframe.metadata.add_semantic_type((metadata_base.ALL_ELEMENTS, 5), - 'https://metadata.datadrivendiscovery.org/types/TrueTarget') - dataframe.metadata = dataframe.metadata.remove_semantic_type((metadata_base.ALL_ELEMENTS, 5), - 'https://metadata.datadrivendiscovery.org/types/Attribute') - - # Parsing. - hyperparams_class = \ - column_parser.ColumnParserPrimitive.metadata.query()['primitive_code']['class_type_arguments'][ - 'Hyperparams'] - primitive = column_parser.ColumnParserPrimitive(hyperparams=hyperparams_class.defaults()) - dataframe = primitive.produce(inputs=dataframe).value - - hyperparams_class = \ - extract_columns_semantic_types.ExtractColumnsBySemanticTypesPrimitive.metadata.query()['primitive_code'][ - 'class_type_arguments']['Hyperparams'] - - primitive = extract_columns_semantic_types.ExtractColumnsBySemanticTypesPrimitive( - hyperparams=hyperparams_class.defaults().replace( - {'semantic_types': ('https://metadata.datadrivendiscovery.org/types/Attribute',)})) - attributes = primitive.produce(inputs=dataframe).value - - primitive = extract_columns_semantic_types.ExtractColumnsBySemanticTypesPrimitive( - hyperparams=hyperparams_class.defaults().replace( - {'semantic_types': ('https://metadata.datadrivendiscovery.org/types/SuggestedTarget',)})) - targets = primitive.produce(inputs=dataframe).value - - return dataframe, attributes, targets - - def test_single_target(self): - dataframe, attributes, targets = self._get_iris_columns() - - self.assertEqual(list(targets.columns), ['species']) - hyperparams_class = \ - xgboost_gbtree.XGBoostGBTreeClassifierPrimitive.metadata.query()['primitive_code']['class_type_arguments'][ - 'Hyperparams'] - primitive = xgboost_gbtree.XGBoostGBTreeClassifierPrimitive( - hyperparams=hyperparams_class.defaults().replace({'return_result': 'new', 'add_index_columns': False})) - - primitive.set_training_data(inputs=attributes, outputs=targets) - primitive.fit() - - predictions = primitive.produce(inputs=attributes).value - self.assertEqual(list(predictions.columns), ['species']) - - self.assertEqual(predictions.shape, (150, 1)) - self.assertEqual(predictions.iloc[0, 0], 'Iris-setosa') - self.assertTrue(predictions.metadata.has_semantic_type((metadata_base.ALL_ELEMENTS, 0), - 'https://metadata.datadrivendiscovery.org/types/PredictedTarget')) - self.assertFalse(predictions.metadata.has_semantic_type((metadata_base.ALL_ELEMENTS, 0), - 'https://metadata.datadrivendiscovery.org/types/TrueTarget')) - self.assertEqual(predictions.metadata.query_column(0)['name'], 'species') - self.assertEqual(predictions.metadata.query_column(0)['custom_metadata'], 'targets') - - self._test_single_target_metadata(predictions.metadata) - - samples = primitive.sample(inputs=attributes).value - self.assertEqual(list(samples[0].columns), ['species']) - - self.assertEqual(len(samples), 1) - self.assertEqual(samples[0].shape, (150, 1)) - self.assertEqual(samples[0].iloc[0, 0], 'Iris-setosa') - self.assertTrue(samples[0].metadata.has_semantic_type((metadata_base.ALL_ELEMENTS, 0), - 'https://metadata.datadrivendiscovery.org/types/PredictedTarget')) - self.assertFalse(samples[0].metadata.has_semantic_type((metadata_base.ALL_ELEMENTS, 0), - 'https://metadata.datadrivendiscovery.org/types/TrueTarget')) - self.assertEqual(samples[0].metadata.query_column(0)['name'], 'species') - self.assertEqual(samples[0].metadata.query_column(0)['custom_metadata'], 'targets') - - log_likelihoods = primitive.log_likelihoods(inputs=attributes, outputs=targets).value - self.assertEqual(list(log_likelihoods.columns), ['species']) - - self.assertEqual(log_likelihoods.shape, (150, 1)) - self.assertEqual(log_likelihoods.metadata.query_column(0)['name'], 'species') - - log_likelihood = primitive.log_likelihood(inputs=attributes, outputs=targets).value - self.assertEqual(list(log_likelihood.columns), ['species']) - - self.assertEqual(log_likelihood.shape, (1, 1)) - self.assertAlmostEqual(log_likelihood.iloc[0, 0], -3.4919378757476807) - self.assertEqual(log_likelihoods.metadata.query_column(0)['name'], 'species') - - def test_single_target_continue_fit(self): - dataframe, attributes, targets = self._get_iris_columns() - - hyperparams_class = \ - xgboost_gbtree.XGBoostGBTreeClassifierPrimitive.metadata.query()['primitive_code']['class_type_arguments'][ - 'Hyperparams'] - primitive = xgboost_gbtree.XGBoostGBTreeClassifierPrimitive( - hyperparams=hyperparams_class.defaults().replace({'return_result': 'new', 'add_index_columns': False})) - - primitive.set_training_data(inputs=attributes, outputs=targets) - primitive.fit() - # reset the training data to make continue_fit() work. - primitive.set_training_data(inputs=attributes, outputs=targets) - primitive.continue_fit() - params = primitive.get_params() - self.assertEqual(params['booster'].best_ntree_limit, - primitive.hyperparams['n_estimators'] + primitive.hyperparams['n_more_estimators']) - predictions = primitive.produce(inputs=attributes).value - - self.assertEqual(predictions.shape, (150, 1)) - self.assertEqual(predictions.iloc[0, 0], 'Iris-setosa') - self.assertTrue(predictions.metadata.has_semantic_type((metadata_base.ALL_ELEMENTS, 0), - 'https://metadata.datadrivendiscovery.org/types/PredictedTarget')) - self.assertFalse(predictions.metadata.has_semantic_type((metadata_base.ALL_ELEMENTS, 0), - 'https://metadata.datadrivendiscovery.org/types/TrueTarget')) - self.assertEqual(predictions.metadata.query_column(0)['name'], 'species') - self.assertEqual(predictions.metadata.query_column(0)['custom_metadata'], 'targets') - - self._test_single_target_metadata(predictions.metadata) - - samples = primitive.sample(inputs=attributes).value - - self.assertEqual(len(samples), 1) - self.assertEqual(samples[0].shape, (150, 1)) - self.assertEqual(samples[0].iloc[0, 0], 'Iris-setosa') - self.assertTrue(samples[0].metadata.has_semantic_type((metadata_base.ALL_ELEMENTS, 0), - 'https://metadata.datadrivendiscovery.org/types/PredictedTarget')) - self.assertFalse(samples[0].metadata.has_semantic_type((metadata_base.ALL_ELEMENTS, 0), - 'https://metadata.datadrivendiscovery.org/types/TrueTarget')) - self.assertEqual(samples[0].metadata.query_column(0)['name'], 'species') - self.assertEqual(samples[0].metadata.query_column(0)['custom_metadata'], 'targets') - - log_likelihoods = primitive.log_likelihoods(inputs=attributes, outputs=targets).value - - self.assertEqual(log_likelihoods.shape, (150, 1)) - self.assertEqual(log_likelihoods.metadata.query_column(0)['name'], 'species') - - log_likelihood = primitive.log_likelihood(inputs=attributes, outputs=targets).value - - self.assertEqual(log_likelihood.shape, (1, 1)) - self.assertAlmostEqual(log_likelihood.iloc[0, 0], -2.4149818420410156) - self.assertEqual(log_likelihoods.metadata.query_column(0)['name'], 'species') - - def _test_single_target_metadata(self, predictions_metadata): - expected_metadata = [{ - 'selector': [], - 'metadata': { - 'schema': metadata_base.CONTAINER_SCHEMA_VERSION, - 'structural_type': 'd3m.container.pandas.DataFrame', - 'semantic_types': ['https://metadata.datadrivendiscovery.org/types/Table'], - 'dimension': { - 'name': 'rows', - 'semantic_types': ['https://metadata.datadrivendiscovery.org/types/TabularRow'], - 'length': 150, - }, - }, - }, { - 'selector': ['__ALL_ELEMENTS__'], - 'metadata': { - 'dimension': { - 'name': 'columns', - 'semantic_types': ['https://metadata.datadrivendiscovery.org/types/TabularColumn'], - 'length': 1, - }, - }, - }, { - 'selector': ['__ALL_ELEMENTS__', 0], - 'metadata': { - 'structural_type': 'str', - 'name': 'species', - 'semantic_types': ['https://metadata.datadrivendiscovery.org/types/CategoricalData', - 'https://metadata.datadrivendiscovery.org/types/SuggestedTarget', - 'https://metadata.datadrivendiscovery.org/types/Target', - 'https://metadata.datadrivendiscovery.org/types/PredictedTarget'], - 'custom_metadata': 'targets', - }, - }] - - self.assertEqual(utils.to_json_structure(predictions_metadata.to_internal_simple_structure()), expected_metadata) - - def test_multiple_targets(self): - dataframe, attributes, targets = self._get_iris_columns() - - targets = targets.append_columns(targets) - self.assertEqual(list(targets.columns), ['species', 'species']) - - hyperparams_class = \ - xgboost_gbtree.XGBoostGBTreeClassifierPrimitive.metadata.query()['primitive_code']['class_type_arguments'][ - 'Hyperparams'] - primitive = xgboost_gbtree.XGBoostGBTreeClassifierPrimitive( - hyperparams=hyperparams_class.defaults().replace({'return_result': 'new', 'add_index_columns': False})) - - primitive.set_training_data(inputs=attributes, outputs=targets) - primitive.fit() - - predictions = primitive.produce(inputs=attributes).value - self.assertEqual(list(predictions.columns), ['species', 'species']) - - self.assertEqual(predictions.shape, (150, 2)) - for column_index in range(2): - self.assertEqual(predictions.iloc[0, column_index], 'Iris-setosa') - self.assertTrue(predictions.metadata.has_semantic_type((metadata_base.ALL_ELEMENTS, column_index), - 'https://metadata.datadrivendiscovery.org/types/PredictedTarget')) - self.assertFalse(predictions.metadata.has_semantic_type((metadata_base.ALL_ELEMENTS, column_index), - 'https://metadata.datadrivendiscovery.org/types/TrueTarget')) - self.assertEqual(predictions.metadata.query_column(column_index)['name'], 'species') - self.assertEqual(predictions.metadata.query_column(column_index)['custom_metadata'], 'targets') - - samples = primitive.sample(inputs=attributes).value - self.assertEqual(list(samples[0].columns), ['species', 'species']) - - self.assertEqual(len(samples), 1) - self.assertEqual(samples[0].shape, (150, 2)) - for column_index in range(2): - self.assertEqual(samples[0].iloc[0, column_index], 'Iris-setosa') - self.assertTrue(samples[0].metadata.has_semantic_type((metadata_base.ALL_ELEMENTS, column_index), - 'https://metadata.datadrivendiscovery.org/types/PredictedTarget')) - self.assertFalse(samples[0].metadata.has_semantic_type((metadata_base.ALL_ELEMENTS, column_index), - 'https://metadata.datadrivendiscovery.org/types/TrueTarget')) - self.assertEqual(samples[0].metadata.query_column(column_index)['name'], 'species') - self.assertEqual(samples[0].metadata.query_column(column_index)['custom_metadata'], 'targets') - - log_likelihoods = primitive.log_likelihoods(inputs=attributes, outputs=targets).value - self.assertEqual(list(log_likelihoods.columns), ['species', 'species']) - - self.assertEqual(log_likelihoods.shape, (150, 2)) - for column_index in range(2): - self.assertEqual(log_likelihoods.metadata.query_column(column_index)['name'], 'species') - - log_likelihood = primitive.log_likelihood(inputs=attributes, outputs=targets).value - - self.assertEqual(list(log_likelihood.columns), ['species', 'species']) - self.assertEqual(log_likelihood.shape, (1, 2)) - for column_index in range(2): - self.assertAlmostEqual(log_likelihood.iloc[0, column_index], -3.4919378757476807) - self.assertEqual(log_likelihoods.metadata.query_column(column_index)['name'], 'species') - - feature_importances = primitive.produce_feature_importances().value - self.assertEqual(list(feature_importances), ['sepalLength', 'sepalWidth', 'petalLength', 'petalWidth']) - self.assertEqual(feature_importances.metadata.query_column(0)['name'], 'sepalLength') - self.assertEqual(feature_importances.metadata.query_column(1)['name'], 'sepalWidth') - self.assertEqual(feature_importances.metadata.query_column(2)['name'], 'petalLength') - self.assertEqual(feature_importances.metadata.query_column(3)['name'], 'petalWidth') - - self.assertEqual(feature_importances.values.tolist(), [[0.012397459708154202, - 0.03404613956809044, - 0.5992223024368286, - 0.35433411598205566, - ]]) - - def test_multiple_targets_continue_fit(self): - dataframe, attributes, targets = self._get_iris_columns() - second_targets = targets.copy() - second_targets['species'] = targets['species'].map( - {'Iris-setosa': 't-Iris-setosa', 'Iris-versicolor': 't-Iris-versicolor', - 'Iris-virginica': 't-Iris-virginica'}) - second_targets.rename(columns={'species': 't-species'}, inplace=True) - second_targets.metadata = second_targets.metadata.update_column(0, {'name': 't-species'}) - targets = targets.append_columns(second_targets) - hyperparams_class = \ - xgboost_gbtree.XGBoostGBTreeClassifierPrimitive.metadata.query()['primitive_code']['class_type_arguments'][ - 'Hyperparams'] - primitive = xgboost_gbtree.XGBoostGBTreeClassifierPrimitive( - hyperparams=hyperparams_class.defaults().replace({'return_result': 'new', 'add_index_columns': False})) - - primitive.set_training_data(inputs=attributes, outputs=targets) - primitive.fit() - primitive.set_training_data(inputs=attributes, outputs=targets) - primitive.continue_fit() - params = primitive.get_params() - for estimator in params['estimators']: - self.assertEqual(estimator.get_booster().best_ntree_limit, - primitive.hyperparams['n_estimators'] + primitive.hyperparams['n_more_estimators']) - - predictions = primitive.produce(inputs=attributes).value - - - self.assertEqual(predictions.shape, (150, 2)) - self.assertEqual(predictions.iloc[0, 0], 'Iris-setosa') - self.assertTrue(predictions.metadata.has_semantic_type((metadata_base.ALL_ELEMENTS, 0), - 'https://metadata.datadrivendiscovery.org/types/PredictedTarget')) - self.assertFalse(predictions.metadata.has_semantic_type((metadata_base.ALL_ELEMENTS, 0), - 'https://metadata.datadrivendiscovery.org/types/TrueTarget')) - self.assertEqual(predictions.metadata.query_column(0)['name'], 'species') - self.assertEqual(predictions.metadata.query_column(0)['custom_metadata'], 'targets') - self.assertEqual(predictions.iloc[0, 1], 't-Iris-setosa') - self.assertTrue(predictions.metadata.has_semantic_type((metadata_base.ALL_ELEMENTS, 1), - 'https://metadata.datadrivendiscovery.org/types/PredictedTarget')) - self.assertFalse(predictions.metadata.has_semantic_type((metadata_base.ALL_ELEMENTS, 1), - 'https://metadata.datadrivendiscovery.org/types/TrueTarget')) - self.assertEqual(predictions.metadata.query_column(1)['name'], 't-species') - self.assertEqual(predictions.metadata.query_column(1)['custom_metadata'], 'targets') - samples = primitive.sample(inputs=attributes).value - - self.assertEqual(len(samples), 1) - self.assertEqual(samples[0].shape, (150, 2)) - self.assertEqual(samples[0].iloc[0, 0], 'Iris-setosa') - self.assertTrue(samples[0].metadata.has_semantic_type((metadata_base.ALL_ELEMENTS, 0), - 'https://metadata.datadrivendiscovery.org/types/PredictedTarget')) - self.assertFalse(samples[0].metadata.has_semantic_type((metadata_base.ALL_ELEMENTS, 0), - 'https://metadata.datadrivendiscovery.org/types/TrueTarget')) - self.assertEqual(samples[0].metadata.query_column(0)['name'], 'species') - self.assertEqual(samples[0].metadata.query_column(0)['custom_metadata'], 'targets') - - self.assertEqual(samples[0].iloc[0, 1], 't-Iris-setosa') - self.assertTrue(samples[0].metadata.has_semantic_type((metadata_base.ALL_ELEMENTS, 1), - 'https://metadata.datadrivendiscovery.org/types/PredictedTarget')) - self.assertFalse(samples[0].metadata.has_semantic_type((metadata_base.ALL_ELEMENTS, 1), - 'https://metadata.datadrivendiscovery.org/types/TrueTarget')) - self.assertEqual(samples[0].metadata.query_column(1)['name'], 't-species') - self.assertEqual(samples[0].metadata.query_column(1)['custom_metadata'], 'targets') - log_likelihoods = primitive.log_likelihoods(inputs=attributes, outputs=targets).value - - self.assertEqual(log_likelihoods.shape, (150, 2)) - self.assertEqual(log_likelihoods.metadata.query_column(0)['name'], 'species') - self.assertEqual(log_likelihoods.metadata.query_column(1)['name'], 't-species') - - log_likelihood = primitive.log_likelihood(inputs=attributes, outputs=targets).value - - self.assertEqual(log_likelihood.shape, (1, 2)) - self.assertAlmostEqual(log_likelihood.iloc[0, 0], -2.4149818420410156) - self.assertEqual(log_likelihoods.metadata.query_column(0)['name'], 'species') - self.assertAlmostEqual(log_likelihood.iloc[0, 1], -2.4149818420410156) - self.assertEqual(log_likelihoods.metadata.query_column(1)['name'], 't-species') - - feature_importances = primitive.produce_feature_importances().value - - self.assertEqual(feature_importances.values.tolist(), - [[0.011062598787248135, - 0.026943154633045197, - 0.6588393449783325, - 0.3031548857688904]]) - - def test_semantic_types(self): - dataframe, attributes, targets = self._get_iris_columns() - - hyperparams_class = \ - xgboost_gbtree.XGBoostGBTreeClassifierPrimitive.metadata.query()['primitive_code']['class_type_arguments'][ - 'Hyperparams'] - primitive = xgboost_gbtree.XGBoostGBTreeClassifierPrimitive( - hyperparams=hyperparams_class.defaults().replace({'return_result': 'new', 'add_index_columns': False})) - - primitive.set_training_data(inputs=dataframe, outputs=dataframe) - primitive.fit() - - predictions = primitive.produce(inputs=dataframe).value - self.assertEqual(list(predictions.columns), ['species']) - - self.assertEqual(predictions.shape, (150, 1)) - self.assertEqual(predictions.iloc[0, 0], 'Iris-setosa') - self.assertTrue(predictions.metadata.has_semantic_type((metadata_base.ALL_ELEMENTS, 0), - 'https://metadata.datadrivendiscovery.org/types/PredictedTarget')) - self.assertFalse(predictions.metadata.has_semantic_type((metadata_base.ALL_ELEMENTS, 0), - 'https://metadata.datadrivendiscovery.org/types/TrueTarget')) - self.assertEqual(predictions.metadata.query_column(0)['name'], 'species') - self.assertEqual(predictions.metadata.query_column(0)['custom_metadata'], 'targets') - - samples = primitive.sample(inputs=dataframe).value - self.assertEqual(list(samples[0].columns), ['species']) - - self.assertEqual(len(samples), 1) - self.assertEqual(samples[0].shape, (150, 1)) - self.assertEqual(samples[0].iloc[0, 0], 'Iris-setosa') - self.assertTrue(samples[0].metadata.has_semantic_type((metadata_base.ALL_ELEMENTS, 0), - 'https://metadata.datadrivendiscovery.org/types/PredictedTarget')) - self.assertFalse(samples[0].metadata.has_semantic_type((metadata_base.ALL_ELEMENTS, 0), - 'https://metadata.datadrivendiscovery.org/types/TrueTarget')) - self.assertEqual(samples[0].metadata.query_column(0)['name'], 'species') - self.assertEqual(samples[0].metadata.query_column(0)['custom_metadata'], 'targets') - - log_likelihoods = primitive.log_likelihoods(inputs=dataframe, outputs=dataframe).value - self.assertEqual(list(log_likelihoods.columns), ['species']) - - self.assertEqual(log_likelihoods.shape, (150, 1)) - self.assertEqual(log_likelihoods.metadata.query_column(0)['name'], 'species') - - log_likelihood = primitive.log_likelihood(inputs=dataframe, outputs=dataframe).value - self.assertEqual(list(log_likelihood.columns), ['species']) - - self.assertEqual(log_likelihood.shape, (1, 1)) - self.assertAlmostEqual(log_likelihood.iloc[0, 0], -3.4919378757476807) - self.assertEqual(log_likelihoods.metadata.query_column(0)['name'], 'species') - - feature_importances = primitive.produce_feature_importances().value - self.assertEqual(list(feature_importances), ['sepalLength', 'sepalWidth', 'petalLength', 'petalWidth']) - self.assertEqual(feature_importances.metadata.query_column(0)['name'], 'sepalLength') - self.assertEqual(feature_importances.metadata.query_column(1)['name'], 'sepalWidth') - self.assertEqual(feature_importances.metadata.query_column(2)['name'], 'petalLength') - self.assertEqual(feature_importances.metadata.query_column(3)['name'], 'petalWidth') - - - self.assertEqual(feature_importances.values.tolist(), - [[0.012397459708154202, - 0.03404613956809044, - 0.5992223024368286, - 0.35433411598205566]]) - - def test_return_append(self): - dataframe, attributes, targets = self._get_iris_columns() - - hyperparams_class = \ - xgboost_gbtree.XGBoostGBTreeClassifierPrimitive.metadata.query()['primitive_code']['class_type_arguments'][ - 'Hyperparams'] - primitive = xgboost_gbtree.XGBoostGBTreeClassifierPrimitive(hyperparams=hyperparams_class.defaults()) - - primitive.set_training_data(inputs=dataframe, outputs=dataframe) - primitive.fit() - - predictions = primitive.produce(inputs=dataframe).value - - self.assertEqual(list(predictions.columns), [ - 'd3mIndex', - 'sepalLength', - 'sepalWidth', - 'petalLength', - 'petalWidth', - 'species', - 'species', - ]) - - self.assertEqual(predictions.shape, (150, 7)) - self.assertEqual(predictions.iloc[0, 6], 'Iris-setosa') - self.assertTrue(predictions.metadata.has_semantic_type((metadata_base.ALL_ELEMENTS, 6), - 'https://metadata.datadrivendiscovery.org/types/PredictedTarget')) - self.assertFalse(predictions.metadata.has_semantic_type((metadata_base.ALL_ELEMENTS, 6), - 'https://metadata.datadrivendiscovery.org/types/TrueTarget')) - self.assertEqual(predictions.metadata.query_column(6)['name'], 'species') - self.assertEqual(predictions.metadata.query_column(6)['custom_metadata'], 'targets') - - self._test_return_append_metadata(predictions.metadata) - - def _test_return_append_metadata(self, predictions_metadata): - self.assertEqual(utils.to_json_structure(predictions_metadata.to_internal_simple_structure()), [{ - 'selector': [], - 'metadata': { - 'structural_type': 'd3m.container.pandas.DataFrame', - 'semantic_types': ['https://metadata.datadrivendiscovery.org/types/Table'], - 'dimension': { - 'name': 'rows', - 'semantic_types': ['https://metadata.datadrivendiscovery.org/types/TabularRow'], - 'length': 150, - }, - 'schema': 'https://metadata.datadrivendiscovery.org/schemas/v0/container.json', - }, - }, { - 'selector': ['__ALL_ELEMENTS__'], - 'metadata': { - 'dimension': { - 'name': 'columns', - 'semantic_types': ['https://metadata.datadrivendiscovery.org/types/TabularColumn'], - 'length': 7, - }, - }, - }, { - 'selector': ['__ALL_ELEMENTS__', 0], - 'metadata': { - 'name': 'd3mIndex', - 'structural_type': 'int', - 'semantic_types': ['http://schema.org/Integer', - 'https://metadata.datadrivendiscovery.org/types/PrimaryKey'], - }, - }, { - 'selector': ['__ALL_ELEMENTS__', 1], - 'metadata': { - 'name': 'sepalLength', - 'structural_type': 'float', - 'semantic_types': ['http://schema.org/Float', - 'https://metadata.datadrivendiscovery.org/types/Attribute'], - 'custom_metadata': 'attributes', - }, - }, { - 'selector': ['__ALL_ELEMENTS__', 2], - 'metadata': { - 'name': 'sepalWidth', - 'structural_type': 'float', - 'semantic_types': ['http://schema.org/Float', - 'https://metadata.datadrivendiscovery.org/types/Attribute'], - 'custom_metadata': 'attributes', - }, - }, { - 'selector': ['__ALL_ELEMENTS__', 3], - 'metadata': { - 'name': 'petalLength', - 'structural_type': 'float', - 'semantic_types': ['http://schema.org/Float', - 'https://metadata.datadrivendiscovery.org/types/Attribute'], - 'custom_metadata': 'attributes', - }, - }, { - 'selector': ['__ALL_ELEMENTS__', 4], - 'metadata': { - 'name': 'petalWidth', - 'structural_type': 'float', - 'semantic_types': ['http://schema.org/Float', - 'https://metadata.datadrivendiscovery.org/types/Attribute'], - 'custom_metadata': 'attributes', - }, - }, { - 'selector': ['__ALL_ELEMENTS__', 5], - 'metadata': { - 'name': 'species', - 'structural_type': 'str', - 'semantic_types': ['https://metadata.datadrivendiscovery.org/types/CategoricalData', - 'https://metadata.datadrivendiscovery.org/types/SuggestedTarget', - 'https://metadata.datadrivendiscovery.org/types/Target', - 'https://metadata.datadrivendiscovery.org/types/TrueTarget'], - 'custom_metadata': 'targets', - }, - }, { - 'selector': ['__ALL_ELEMENTS__', 6], - 'metadata': { - 'structural_type': 'str', - 'name': 'species', - 'semantic_types': ['https://metadata.datadrivendiscovery.org/types/CategoricalData', - 'https://metadata.datadrivendiscovery.org/types/SuggestedTarget', - 'https://metadata.datadrivendiscovery.org/types/Target', - 'https://metadata.datadrivendiscovery.org/types/PredictedTarget'], - 'custom_metadata': 'targets', - }, - }]) - - def test_return_new(self): - dataframe, attributes, targets = self._get_iris_columns() - - hyperparams_class = \ - xgboost_gbtree.XGBoostGBTreeClassifierPrimitive.metadata.query()['primitive_code']['class_type_arguments'][ - 'Hyperparams'] - primitive = xgboost_gbtree.XGBoostGBTreeClassifierPrimitive( - hyperparams=hyperparams_class.defaults().replace({'return_result': 'new'})) - - primitive.set_training_data(inputs=dataframe, outputs=dataframe) - primitive.fit() - - predictions = primitive.produce(inputs=dataframe).value - - self.assertEqual(list(predictions.columns), [ - 'd3mIndex', - 'species', - ]) - - self.assertEqual(predictions.shape, (150, 2)) - self.assertEqual(predictions.iloc[0, 1], 'Iris-setosa') - self.assertTrue(predictions.metadata.has_semantic_type((metadata_base.ALL_ELEMENTS, 1), - 'https://metadata.datadrivendiscovery.org/types/PredictedTarget')) - self.assertFalse(predictions.metadata.has_semantic_type((metadata_base.ALL_ELEMENTS, 1), - 'https://metadata.datadrivendiscovery.org/types/TrueTarget')) - self.assertEqual(predictions.metadata.query_column(1)['name'], 'species') - self.assertEqual(predictions.metadata.query_column(1)['custom_metadata'], 'targets') - - self._test_return_new_metadata(predictions.metadata) - - def _test_return_new_metadata(self, predictions_metadata): - expected_metadata = [{ - 'selector': [], - 'metadata': { - 'structural_type': 'd3m.container.pandas.DataFrame', - 'semantic_types': ['https://metadata.datadrivendiscovery.org/types/Table'], - 'dimension': { - 'name': 'rows', - 'semantic_types': ['https://metadata.datadrivendiscovery.org/types/TabularRow'], - 'length': 150, - }, - 'schema': 'https://metadata.datadrivendiscovery.org/schemas/v0/container.json', - }, - }, { - 'selector': ['__ALL_ELEMENTS__'], - 'metadata': { - 'dimension': { - 'name': 'columns', - 'semantic_types': ['https://metadata.datadrivendiscovery.org/types/TabularColumn'], - 'length': 2, - }, - }, - }, { - 'selector': ['__ALL_ELEMENTS__', 0], - 'metadata': { - 'name': 'd3mIndex', - 'structural_type': 'int', - 'semantic_types': ['http://schema.org/Integer', - 'https://metadata.datadrivendiscovery.org/types/PrimaryKey'], - }, - }, { - 'selector': ['__ALL_ELEMENTS__', 1], - 'metadata': { - 'structural_type': 'str', - 'name': 'species', - 'semantic_types': ['https://metadata.datadrivendiscovery.org/types/CategoricalData', - 'https://metadata.datadrivendiscovery.org/types/SuggestedTarget', - 'https://metadata.datadrivendiscovery.org/types/Target', - 'https://metadata.datadrivendiscovery.org/types/PredictedTarget'], - 'custom_metadata': 'targets', - }, - }] - - self.assertEqual(utils.to_json_structure(predictions_metadata.to_internal_simple_structure()), expected_metadata) - - def test_return_replace(self): - dataframe, attributes, targets = self._get_iris_columns() - - hyperparams_class = \ - xgboost_gbtree.XGBoostGBTreeClassifierPrimitive.metadata.query()['primitive_code']['class_type_arguments'][ - 'Hyperparams'] - primitive = xgboost_gbtree.XGBoostGBTreeClassifierPrimitive( - hyperparams=hyperparams_class.defaults().replace({'return_result': 'replace'})) - - primitive.set_training_data(inputs=dataframe, outputs=dataframe) - primitive.fit() - - predictions = primitive.produce(inputs=dataframe).value - - self.assertEqual(list(predictions.columns), [ - 'd3mIndex', - 'species', - 'species', - ]) - - self.assertEqual(predictions.shape, (150, 3)) - self.assertEqual(predictions.iloc[0, 1], 'Iris-setosa') - self.assertTrue(predictions.metadata.has_semantic_type((metadata_base.ALL_ELEMENTS, 1), - 'https://metadata.datadrivendiscovery.org/types/PredictedTarget')) - self.assertFalse(predictions.metadata.has_semantic_type((metadata_base.ALL_ELEMENTS, 1), - 'https://metadata.datadrivendiscovery.org/types/TrueTarget')) - self.assertEqual(predictions.metadata.query_column(1)['name'], 'species') - self.assertEqual(predictions.metadata.query_column(1)['custom_metadata'], 'targets') - - self._test_return_replace_metadata(predictions.metadata) - - def test_pickle_unpickle(self): - dataframe, attributes, targets = self._get_iris_columns() - - hyperparams_class = \ - xgboost_gbtree.XGBoostGBTreeClassifierPrimitive.metadata.query()['primitive_code']['class_type_arguments'][ - 'Hyperparams'] - primitive = xgboost_gbtree.XGBoostGBTreeClassifierPrimitive( - hyperparams=hyperparams_class.defaults().replace({'return_result': 'new', 'add_index_columns': False})) - - primitive.set_training_data(inputs=attributes, outputs=targets) - primitive.fit() - - before_pickled_prediction = primitive.produce(inputs=attributes).value - pickle_object = pickle.dumps(primitive) - primitive = pickle.loads(pickle_object) - after_unpickled_prediction = primitive.produce(inputs=attributes).value - # try to pickle again to see if we load it properly - _ = pickle.dumps(primitive) - self.assertTrue(container.DataFrame.equals(before_pickled_prediction, after_unpickled_prediction)) - - def _test_return_replace_metadata(self, predictions_metadata): - self.assertEqual(utils.to_json_structure(predictions_metadata.to_internal_simple_structure()), [{ - 'selector': [], - 'metadata': { - 'structural_type': 'd3m.container.pandas.DataFrame', - 'semantic_types': ['https://metadata.datadrivendiscovery.org/types/Table'], - 'dimension': { - 'name': 'rows', - 'semantic_types': ['https://metadata.datadrivendiscovery.org/types/TabularRow'], - 'length': 150, - }, - 'schema': 'https://metadata.datadrivendiscovery.org/schemas/v0/container.json', - }, - }, { - 'selector': ['__ALL_ELEMENTS__'], - 'metadata': { - 'dimension': { - 'name': 'columns', - 'semantic_types': ['https://metadata.datadrivendiscovery.org/types/TabularColumn'], - 'length': 3, - }, - }, - }, { - 'selector': ['__ALL_ELEMENTS__', 0], - 'metadata': { - 'name': 'd3mIndex', - 'structural_type': 'int', - 'semantic_types': ['http://schema.org/Integer', - 'https://metadata.datadrivendiscovery.org/types/PrimaryKey'], - }, - }, { - 'selector': ['__ALL_ELEMENTS__', 1], - 'metadata': { - 'structural_type': 'str', - 'name': 'species', - 'semantic_types': ['https://metadata.datadrivendiscovery.org/types/CategoricalData', - 'https://metadata.datadrivendiscovery.org/types/SuggestedTarget', - 'https://metadata.datadrivendiscovery.org/types/Target', - 'https://metadata.datadrivendiscovery.org/types/PredictedTarget'], - 'custom_metadata': 'targets', - }, - }, { - 'selector': ['__ALL_ELEMENTS__', 2], - 'metadata': { - 'name': 'species', - 'structural_type': 'str', - 'semantic_types': ['https://metadata.datadrivendiscovery.org/types/CategoricalData', - 'https://metadata.datadrivendiscovery.org/types/SuggestedTarget', - 'https://metadata.datadrivendiscovery.org/types/Target', - 'https://metadata.datadrivendiscovery.org/types/TrueTarget'], - 'custom_metadata': 'targets', - }, - }]) - - -if __name__ == '__main__': - unittest.main() diff --git a/common-primitives/tests/test_xgboost_regressor.py b/common-primitives/tests/test_xgboost_regressor.py deleted file mode 100644 index d513cc1..0000000 --- a/common-primitives/tests/test_xgboost_regressor.py +++ /dev/null @@ -1,617 +0,0 @@ -import os -import pickle -import unittest - -from sklearn.metrics import mean_squared_error - -from d3m import container, utils -from d3m.metadata import base as metadata_base - -from common_primitives import dataset_to_dataframe, extract_columns_semantic_types, xgboost_regressor, column_parser - - -class XGBoostRegressorTestCase(unittest.TestCase): - def _get_iris(self): - dataset_doc_path = os.path.abspath( - os.path.join(os.path.dirname(__file__), 'data', 'datasets', 'iris_dataset_1', 'datasetDoc.json')) - - dataset = container.Dataset.load('file://{dataset_doc_path}'.format(dataset_doc_path=dataset_doc_path)) - - hyperparams_class = \ - dataset_to_dataframe.DatasetToDataFramePrimitive.metadata.query()['primitive_code']['class_type_arguments'][ - 'Hyperparams'] - primitive = dataset_to_dataframe.DatasetToDataFramePrimitive(hyperparams=hyperparams_class.defaults()) - - dataframe = primitive.produce(inputs=dataset).value - - return dataframe - - def _get_iris_columns(self): - dataframe = self._get_iris() - col_index_list = list(range(len(dataframe.columns))) - _, target = col_index_list.pop(0), col_index_list.pop(3) - original_target_col = 5 - # We set custom metadata on columns. - for column_index in col_index_list: - dataframe.metadata = dataframe.metadata.update_column(column_index, {'custom_metadata': 'attributes'}) - dataframe.metadata = dataframe.metadata.update_column(target, {'custom_metadata': 'targets'}) - dataframe.metadata = dataframe.metadata.remove_semantic_type((metadata_base.ALL_ELEMENTS, target), - 'https://metadata.datadrivendiscovery.org/types/Attribute') - dataframe.metadata = dataframe.metadata.add_semantic_type((metadata_base.ALL_ELEMENTS, original_target_col), - 'https://metadata.datadrivendiscovery.org/types/Attribute') - # We set semantic types like runtime would. - dataframe.metadata = dataframe.metadata.add_semantic_type((metadata_base.ALL_ELEMENTS, target), - 'https://metadata.datadrivendiscovery.org/types/Target') - dataframe.metadata = dataframe.metadata.add_semantic_type((metadata_base.ALL_ELEMENTS, target), - 'https://metadata.datadrivendiscovery.org/types/TrueTarget') - dataframe.metadata = dataframe.metadata.remove_semantic_type((metadata_base.ALL_ELEMENTS, target), 'https://metadata.datadrivendiscovery.org/types/Attribute') - - # Parsing. - hyperparams_class = \ - column_parser.ColumnParserPrimitive.metadata.query()['primitive_code']['class_type_arguments'][ - 'Hyperparams'] - primitive = column_parser.ColumnParserPrimitive(hyperparams=hyperparams_class.defaults()) - dataframe = primitive.produce(inputs=dataframe).value - - hyperparams_class = \ - extract_columns_semantic_types.ExtractColumnsBySemanticTypesPrimitive.metadata.query()['primitive_code'][ - 'class_type_arguments']['Hyperparams'] - - primitive = extract_columns_semantic_types.ExtractColumnsBySemanticTypesPrimitive( - hyperparams=hyperparams_class.defaults().replace( - {'semantic_types': ('https://metadata.datadrivendiscovery.org/types/Attribute',)})) - attributes = primitive.produce(inputs=dataframe).value - - primitive = extract_columns_semantic_types.ExtractColumnsBySemanticTypesPrimitive( - hyperparams=hyperparams_class.defaults().replace( - {'semantic_types': ('https://metadata.datadrivendiscovery.org/types/TrueTarget',)})) - targets = primitive.produce(inputs=dataframe).value - - return dataframe, attributes, targets - - def test_single_target(self): - dataframe, attributes, targets = self._get_iris_columns() - - hyperparams_class = \ - xgboost_regressor.XGBoostGBTreeRegressorPrimitive.metadata.query()['primitive_code'][ - 'class_type_arguments']['Hyperparams'] - primitive = xgboost_regressor.XGBoostGBTreeRegressorPrimitive( - hyperparams=hyperparams_class.defaults().replace({'return_result': 'new', 'add_index_columns': False})) - - primitive.set_training_data(inputs=attributes, outputs=targets) - primitive.fit() - - predictions = primitive.produce(inputs=attributes).value - mse = mean_squared_error(targets, predictions) - self.assertLessEqual(mse, 0.01) - self.assertEqual(predictions.shape, (150, 1)) - self.assertTrue(predictions.metadata.has_semantic_type((metadata_base.ALL_ELEMENTS, 0), - 'https://metadata.datadrivendiscovery.org/types/PredictedTarget')) - self.assertFalse(predictions.metadata.has_semantic_type((metadata_base.ALL_ELEMENTS, 0), - 'https://metadata.datadrivendiscovery.org/types/TrueTarget')) - self.assertEqual(predictions.metadata.query_column(0)['name'], 'petalWidth') - self.assertEqual(predictions.metadata.query_column(0)['custom_metadata'], 'targets') - - self._test_single_target_metadata(predictions.metadata) - - samples = primitive.sample(inputs=attributes).value - - self.assertEqual(len(samples), 1) - self.assertEqual(samples[0].shape, (150, 1)) - self.assertTrue(samples[0].metadata.has_semantic_type((metadata_base.ALL_ELEMENTS, 0), - 'https://metadata.datadrivendiscovery.org/types/PredictedTarget')) - self.assertFalse(samples[0].metadata.has_semantic_type((metadata_base.ALL_ELEMENTS, 0), - 'https://metadata.datadrivendiscovery.org/types/TrueTarget')) - self.assertEqual(samples[0].metadata.query_column(0)['name'], 'petalWidth') - self.assertEqual(samples[0].metadata.query_column(0)['custom_metadata'], 'targets') - - def test_single_target_continue(self): - dataframe, attributes, targets = self._get_iris_columns() - - hyperparams_class = \ - xgboost_regressor.XGBoostGBTreeRegressorPrimitive.metadata.query()['primitive_code'][ - 'class_type_arguments'][ - 'Hyperparams'] - primitive = xgboost_regressor.XGBoostGBTreeRegressorPrimitive( - hyperparams=hyperparams_class.defaults().replace({'return_result': 'new', 'add_index_columns': False})) - - primitive.set_training_data(inputs=attributes, outputs=targets) - primitive.fit() - # reset the training data to make continue_fit() work. - primitive.set_training_data(inputs=attributes, outputs=targets) - primitive.continue_fit() - params = primitive.get_params() - self.assertEqual(params['booster'].best_ntree_limit, - primitive.hyperparams['n_estimators'] + primitive.hyperparams['n_more_estimators']) - predictions = primitive.produce(inputs=attributes).value - mse = mean_squared_error(targets, predictions) - self.assertLessEqual(mse, 0.01) - self.assertEqual(predictions.shape, (150, 1)) - self.assertTrue(predictions.metadata.has_semantic_type((metadata_base.ALL_ELEMENTS, 0), - 'https://metadata.datadrivendiscovery.org/types/PredictedTarget')) - self.assertFalse(predictions.metadata.has_semantic_type((metadata_base.ALL_ELEMENTS, 0), - 'https://metadata.datadrivendiscovery.org/types/TrueTarget')) - self.assertEqual(predictions.metadata.query_column(0)['name'], 'petalWidth') - self.assertEqual(predictions.metadata.query_column(0)['custom_metadata'], 'targets') - - self._test_single_target_metadata(predictions.metadata) - - samples = primitive.sample(inputs=attributes).value - - self.assertEqual(len(samples), 1) - self.assertEqual(samples[0].shape, (150, 1)) - self.assertTrue(samples[0].metadata.has_semantic_type((metadata_base.ALL_ELEMENTS, 0), - 'https://metadata.datadrivendiscovery.org/types/PredictedTarget')) - self.assertFalse(samples[0].metadata.has_semantic_type((metadata_base.ALL_ELEMENTS, 0), - 'https://metadata.datadrivendiscovery.org/types/TrueTarget')) - self.assertEqual(samples[0].metadata.query_column(0)['name'], 'petalWidth') - self.assertEqual(samples[0].metadata.query_column(0)['custom_metadata'], 'targets') - - def _test_single_target_metadata(self, predictions_metadata): - expected_metadata = [{ - 'selector': [], - 'metadata': { - 'schema': metadata_base.CONTAINER_SCHEMA_VERSION, - 'structural_type': 'd3m.container.pandas.DataFrame', - 'semantic_types': ['https://metadata.datadrivendiscovery.org/types/Table'], - 'dimension': { - 'name': 'rows', - 'semantic_types': ['https://metadata.datadrivendiscovery.org/types/TabularRow'], - 'length': 150, - }, - }, - }, { - 'selector': ['__ALL_ELEMENTS__'], - 'metadata': { - 'dimension': { - 'name': 'columns', - 'semantic_types': ['https://metadata.datadrivendiscovery.org/types/TabularColumn'], - 'length': 1, - }, - }, - }, { - 'selector': ['__ALL_ELEMENTS__', 0], - 'metadata': { - 'structural_type': 'float', - 'name': 'petalWidth', - 'semantic_types': ['http://schema.org/Float', - 'https://metadata.datadrivendiscovery.org/types/Target', - 'https://metadata.datadrivendiscovery.org/types/PredictedTarget'], - 'custom_metadata': 'targets', - }, - }] - - self.assertEqual(utils.to_json_structure(predictions_metadata.to_internal_simple_structure()), expected_metadata) - - def test_multiple_targets(self): - dataframe, attributes, targets = self._get_iris_columns() - - targets = targets.append_columns(targets) - - hyperparams_class = \ - xgboost_regressor.XGBoostGBTreeRegressorPrimitive.metadata.query()['primitive_code'][ - 'class_type_arguments']['Hyperparams'] - primitive = xgboost_regressor.XGBoostGBTreeRegressorPrimitive( - hyperparams=hyperparams_class.defaults().replace({'return_result': 'new', 'add_index_columns': False})) - - primitive.set_training_data(inputs=attributes, outputs=targets) - primitive.fit() - - predictions = primitive.produce(inputs=attributes).value - mse = mean_squared_error(targets, predictions) - self.assertLessEqual(mse, 0.01) - - self.assertEqual(predictions.shape, (150, 2)) - for column_index in range(2): - self.assertTrue(predictions.metadata.has_semantic_type((metadata_base.ALL_ELEMENTS, column_index), - 'https://metadata.datadrivendiscovery.org/types/PredictedTarget')) - self.assertFalse(predictions.metadata.has_semantic_type((metadata_base.ALL_ELEMENTS, column_index), - 'https://metadata.datadrivendiscovery.org/types/TrueTarget')) - self.assertEqual(predictions.metadata.query_column(column_index)['name'], 'petalWidth') - self.assertEqual(predictions.metadata.query_column(column_index)['custom_metadata'], 'targets') - - samples = primitive.sample(inputs=attributes).value - - self.assertEqual(len(samples), 1) - self.assertEqual(samples[0].shape, (150, 2)) - for column_index in range(2): - self.assertTrue(samples[0].metadata.has_semantic_type((metadata_base.ALL_ELEMENTS, column_index), - 'https://metadata.datadrivendiscovery.org/types/PredictedTarget')) - self.assertFalse(samples[0].metadata.has_semantic_type((metadata_base.ALL_ELEMENTS, column_index), - 'https://metadata.datadrivendiscovery.org/types/TrueTarget')) - self.assertEqual(samples[0].metadata.query_column(column_index)['name'], 'petalWidth') - self.assertEqual(samples[0].metadata.query_column(column_index)['custom_metadata'], 'targets') - - feature_importances = primitive.produce_feature_importances().value - - self.assertEqual(feature_importances.values.tolist(), - [[0.0049971588887274265, - 0.006304567214101553, - 0.27505698800086975, - 0.7136412858963013]]) - - def test_multiple_targets_continue(self): - dataframe, attributes, targets = self._get_iris_columns() - second_targets = targets.copy() - second_targets.rename(columns={'petalWidth': 't-petalWidth'}, inplace=True) - second_targets.metadata = second_targets.metadata.update_column(0, {'name': 't-petalWidth'}) - targets = targets.append_columns(second_targets) - - hyperparams_class = \ - xgboost_regressor.XGBoostGBTreeRegressorPrimitive.metadata.query()['primitive_code'][ - 'class_type_arguments']['Hyperparams'] - primitive = xgboost_regressor.XGBoostGBTreeRegressorPrimitive( - hyperparams=hyperparams_class.defaults().replace({'return_result': 'new', 'add_index_columns': False})) - - primitive.set_training_data(inputs=attributes, outputs=targets) - primitive.fit() - # Set training data again to make continue_fit work - primitive.set_training_data(inputs=attributes, outputs=targets) - primitive.continue_fit() - params = primitive.get_params() - for estimator in params['estimators']: - self.assertEqual(estimator.get_booster().best_ntree_limit, - primitive.hyperparams['n_estimators'] + primitive.hyperparams['n_more_estimators']) - - predictions = primitive.produce(inputs=attributes).value - mse = mean_squared_error(targets, predictions) - self.assertLessEqual(mse, 0.01) - self.assertEqual(predictions.shape, (150, 2)) - - self.assertTrue(predictions.metadata.has_semantic_type((metadata_base.ALL_ELEMENTS, 0), - 'https://metadata.datadrivendiscovery.org/types/PredictedTarget')) - self.assertFalse(predictions.metadata.has_semantic_type((metadata_base.ALL_ELEMENTS, 0), - 'https://metadata.datadrivendiscovery.org/types/TrueTarget')) - self.assertEqual(predictions.metadata.query_column(0)['name'], 'petalWidth') - self.assertEqual(predictions.metadata.query_column(0)['custom_metadata'], 'targets') - self.assertTrue(predictions.metadata.has_semantic_type((metadata_base.ALL_ELEMENTS, 1), - 'https://metadata.datadrivendiscovery.org/types/PredictedTarget')) - self.assertFalse(predictions.metadata.has_semantic_type((metadata_base.ALL_ELEMENTS, 1), - 'https://metadata.datadrivendiscovery.org/types/TrueTarget')) - self.assertEqual(predictions.metadata.query_column(1)['name'], 't-petalWidth') - self.assertEqual(predictions.metadata.query_column(1)['custom_metadata'], 'targets') - - samples = primitive.sample(inputs=attributes).value - - self.assertEqual(len(samples), 1) - self.assertEqual(samples[0].shape, (150, 2)) - self.assertTrue(samples[0].metadata.has_semantic_type((metadata_base.ALL_ELEMENTS, 0), - 'https://metadata.datadrivendiscovery.org/types/PredictedTarget')) - self.assertFalse(samples[0].metadata.has_semantic_type((metadata_base.ALL_ELEMENTS, 0), - 'https://metadata.datadrivendiscovery.org/types/TrueTarget')) - self.assertEqual(samples[0].metadata.query_column(0)['name'], 'petalWidth') - self.assertEqual(samples[0].metadata.query_column(0)['custom_metadata'], 'targets') - self.assertTrue(samples[0].metadata.has_semantic_type((metadata_base.ALL_ELEMENTS, 1), - 'https://metadata.datadrivendiscovery.org/types/PredictedTarget')) - self.assertFalse(samples[0].metadata.has_semantic_type((metadata_base.ALL_ELEMENTS, 1), - 'https://metadata.datadrivendiscovery.org/types/TrueTarget')) - self.assertEqual(samples[0].metadata.query_column(1)['name'], 't-petalWidth') - self.assertEqual(samples[0].metadata.query_column(1)['custom_metadata'], 'targets') - - feature_importances = primitive.produce_feature_importances().value - - self.assertEqual(feature_importances.values.tolist(), - [[0.003233343129977584, - 0.003926052246242762, - 0.19553671777248383, - 0.7973038554191589]]) - - def test_semantic_types(self): - dataframe, attributes, targets = self._get_iris_columns() - - hyperparams_class = \ - xgboost_regressor.XGBoostGBTreeRegressorPrimitive.metadata.query()['primitive_code'][ - 'class_type_arguments']['Hyperparams'] - primitive = xgboost_regressor.XGBoostGBTreeRegressorPrimitive( - hyperparams=hyperparams_class.defaults().replace({'return_result': 'new', 'add_index_columns': False})) - - primitive.set_training_data(inputs=dataframe, outputs=dataframe) - primitive.fit() - - predictions = primitive.produce(inputs=dataframe).value - - self.assertEqual(predictions.shape, (150, 1)) - self.assertTrue(predictions.metadata.has_semantic_type((metadata_base.ALL_ELEMENTS, 0), - 'https://metadata.datadrivendiscovery.org/types/PredictedTarget')) - self.assertFalse(predictions.metadata.has_semantic_type((metadata_base.ALL_ELEMENTS, 0), - 'https://metadata.datadrivendiscovery.org/types/TrueTarget')) - self.assertEqual(predictions.metadata.query_column(0)['name'], 'petalWidth') - self.assertEqual(predictions.metadata.query_column(0)['custom_metadata'], 'targets') - - samples = primitive.sample(inputs=attributes).value - - self.assertEqual(len(samples), 1) - self.assertEqual(samples[0].shape, (150, 1)) - self.assertTrue(samples[0].metadata.has_semantic_type((metadata_base.ALL_ELEMENTS, 0), - 'https://metadata.datadrivendiscovery.org/types/PredictedTarget')) - self.assertFalse(samples[0].metadata.has_semantic_type((metadata_base.ALL_ELEMENTS, 0), - 'https://metadata.datadrivendiscovery.org/types/TrueTarget')) - self.assertEqual(samples[0].metadata.query_column(0)['name'], 'petalWidth') - self.assertEqual(samples[0].metadata.query_column(0)['custom_metadata'], 'targets') - - feature_importances = primitive.produce_feature_importances().value - - self.assertEqual(feature_importances.values.tolist(), - [[0.0049971588887274265, - 0.006304567214101553, - 0.27505698800086975, - 0.7136412858963013]]) - - def test_return_append(self): - dataframe, attributes, targets = self._get_iris_columns() - - hyperparams_class = \ - xgboost_regressor.XGBoostGBTreeRegressorPrimitive.metadata.query()['primitive_code'][ - 'class_type_arguments']['Hyperparams'] - primitive = xgboost_regressor.XGBoostGBTreeRegressorPrimitive(hyperparams=hyperparams_class.defaults()) - - primitive.set_training_data(inputs=dataframe, outputs=dataframe) - primitive.fit() - - predictions = primitive.produce(inputs=dataframe).value - - self.assertEqual(predictions.shape, (150, 7)) - self.assertTrue(predictions.metadata.has_semantic_type((metadata_base.ALL_ELEMENTS, 6), - 'https://metadata.datadrivendiscovery.org/types/PredictedTarget')) - self.assertFalse(predictions.metadata.has_semantic_type((metadata_base.ALL_ELEMENTS, 6), - 'https://metadata.datadrivendiscovery.org/types/TrueTarget')) - self.assertEqual(predictions.metadata.query_column(6)['name'], 'petalWidth') - self.assertEqual(predictions.metadata.query_column(6)['custom_metadata'], 'targets') - - self._test_return_append_metadata(predictions.metadata) - - def _test_return_append_metadata(self, predictions_metadata): - self.assertEqual(utils.to_json_structure(predictions_metadata.to_internal_simple_structure()), [{ - 'selector': [], - 'metadata': { - 'structural_type': 'd3m.container.pandas.DataFrame', - 'semantic_types': ['https://metadata.datadrivendiscovery.org/types/Table'], - 'dimension': { - 'name': 'rows', - 'semantic_types': ['https://metadata.datadrivendiscovery.org/types/TabularRow'], - 'length': 150, - }, - 'schema': 'https://metadata.datadrivendiscovery.org/schemas/v0/container.json', - }, - }, { - 'selector': ['__ALL_ELEMENTS__'], - 'metadata': { - 'dimension': { - 'name': 'columns', - 'semantic_types': ['https://metadata.datadrivendiscovery.org/types/TabularColumn'], - 'length': 7, - }, - }, - }, { - 'selector': ['__ALL_ELEMENTS__', 0], - 'metadata': { - 'name': 'd3mIndex', - 'structural_type': 'int', - 'semantic_types': ['http://schema.org/Integer', - 'https://metadata.datadrivendiscovery.org/types/PrimaryKey'], - }, - }, { - 'selector': ['__ALL_ELEMENTS__', 1], - 'metadata': { - 'name': 'sepalLength', - 'structural_type': 'float', - 'semantic_types': ['http://schema.org/Float', - 'https://metadata.datadrivendiscovery.org/types/Attribute'], - 'custom_metadata': 'attributes', - }, - }, { - 'selector': ['__ALL_ELEMENTS__', 2], - 'metadata': { - 'name': 'sepalWidth', - 'structural_type': 'float', - 'semantic_types': ['http://schema.org/Float', - 'https://metadata.datadrivendiscovery.org/types/Attribute'], - 'custom_metadata': 'attributes', - }, - }, { - 'selector': ['__ALL_ELEMENTS__', 3], - 'metadata': { - 'name': 'petalLength', - 'structural_type': 'float', - 'semantic_types': ['http://schema.org/Float', - 'https://metadata.datadrivendiscovery.org/types/Attribute'], - 'custom_metadata': 'attributes', - }, - }, { - 'selector': ['__ALL_ELEMENTS__', 4], - 'metadata': { - 'name': 'petalWidth', - 'structural_type': 'float', - 'semantic_types': ['http://schema.org/Float', - 'https://metadata.datadrivendiscovery.org/types/Target', - 'https://metadata.datadrivendiscovery.org/types/TrueTarget'], - 'custom_metadata': 'targets', - }, - }, { - 'selector': ['__ALL_ELEMENTS__', 5], - 'metadata': { - 'name': 'species', - 'structural_type': 'int', - 'semantic_types': ['https://metadata.datadrivendiscovery.org/types/CategoricalData', - 'https://metadata.datadrivendiscovery.org/types/SuggestedTarget', - 'https://metadata.datadrivendiscovery.org/types/Attribute', ], - 'custom_metadata': 'attributes', - }, - }, { - 'selector': ['__ALL_ELEMENTS__', 6], - 'metadata': { - 'structural_type': 'float', - 'name': 'petalWidth', - 'semantic_types': ['http://schema.org/Float', - 'https://metadata.datadrivendiscovery.org/types/Target', - 'https://metadata.datadrivendiscovery.org/types/PredictedTarget'], - 'custom_metadata': 'targets', - }, - }]) - - def test_return_new(self): - dataframe, attributes, targets = self._get_iris_columns() - - hyperparams_class = \ - xgboost_regressor.XGBoostGBTreeRegressorPrimitive.metadata.query()['primitive_code'][ - 'class_type_arguments']['Hyperparams'] - primitive = xgboost_regressor.XGBoostGBTreeRegressorPrimitive( - hyperparams=hyperparams_class.defaults().replace({'return_result': 'new'})) - - primitive.set_training_data(inputs=dataframe, outputs=dataframe) - primitive.fit() - - predictions = primitive.produce(inputs=dataframe).value - - self.assertEqual(predictions.shape, (150, 2)) - self.assertTrue(predictions.metadata.has_semantic_type((metadata_base.ALL_ELEMENTS, 1), - 'https://metadata.datadrivendiscovery.org/types/PredictedTarget')) - self.assertFalse(predictions.metadata.has_semantic_type((metadata_base.ALL_ELEMENTS, 1), - 'https://metadata.datadrivendiscovery.org/types/TrueTarget')) - self.assertEqual(predictions.metadata.query_column(1)['name'], 'petalWidth') - self.assertEqual(predictions.metadata.query_column(1)['custom_metadata'], 'targets') - - self._test_return_new_metadata(predictions.metadata) - - def _test_return_new_metadata(self, predictions_metadata): - expected_metadata = [{ - 'selector': [], - 'metadata': { - 'structural_type': 'd3m.container.pandas.DataFrame', - 'semantic_types': ['https://metadata.datadrivendiscovery.org/types/Table'], - 'dimension': { - 'name': 'rows', - 'semantic_types': ['https://metadata.datadrivendiscovery.org/types/TabularRow'], - 'length': 150, - }, - 'schema': 'https://metadata.datadrivendiscovery.org/schemas/v0/container.json', - }, - }, { - 'selector': ['__ALL_ELEMENTS__'], - 'metadata': { - 'dimension': { - 'name': 'columns', - 'semantic_types': ['https://metadata.datadrivendiscovery.org/types/TabularColumn'], - 'length': 2, - }, - }, - }, { - 'selector': ['__ALL_ELEMENTS__', 0], - 'metadata': { - 'name': 'd3mIndex', - 'structural_type': 'int', - 'semantic_types': ['http://schema.org/Integer', - 'https://metadata.datadrivendiscovery.org/types/PrimaryKey'], - }, - }, { - 'selector': ['__ALL_ELEMENTS__', 1], - 'metadata': { - 'structural_type': 'float', - 'name': 'petalWidth', - 'semantic_types': ['http://schema.org/Float', - 'https://metadata.datadrivendiscovery.org/types/Target', - 'https://metadata.datadrivendiscovery.org/types/PredictedTarget'], - 'custom_metadata': 'targets', - }, - }] - - self.assertEqual(utils.to_json_structure(predictions_metadata.to_internal_simple_structure()), expected_metadata) - - def test_return_replace(self): - dataframe, attributes, targets = self._get_iris_columns() - - hyperparams_class = \ - xgboost_regressor.XGBoostGBTreeRegressorPrimitive.metadata.query()['primitive_code'][ - 'class_type_arguments']['Hyperparams'] - primitive = xgboost_regressor.XGBoostGBTreeRegressorPrimitive( - hyperparams=hyperparams_class.defaults().replace({'return_result': 'replace'})) - - primitive.set_training_data(inputs=dataframe, outputs=dataframe) - primitive.fit() - - predictions = primitive.produce(inputs=dataframe).value - - self.assertEqual(predictions.shape, (150, 3)) - self.assertTrue(predictions.metadata.has_semantic_type((metadata_base.ALL_ELEMENTS, 1), - 'https://metadata.datadrivendiscovery.org/types/PredictedTarget')) - self.assertFalse(predictions.metadata.has_semantic_type((metadata_base.ALL_ELEMENTS, 1), - 'https://metadata.datadrivendiscovery.org/types/TrueTarget')) - self.assertEqual(predictions.metadata.query_column(1)['name'], 'petalWidth') - self.assertEqual(predictions.metadata.query_column(1)['custom_metadata'], 'targets') - - self._test_return_replace_metadata(predictions.metadata) - - def test_pickle_unpickle(self): - dataframe, attributes, targets = self._get_iris_columns() - - hyperparams_class = \ - xgboost_regressor.XGBoostGBTreeRegressorPrimitive.metadata.query()['primitive_code'][ - 'class_type_arguments'][ - 'Hyperparams'] - primitive = xgboost_regressor.XGBoostGBTreeRegressorPrimitive( - hyperparams=hyperparams_class.defaults().replace({'return_result': 'new', 'add_index_columns': False})) - - primitive.set_training_data(inputs=attributes, outputs=targets) - primitive.fit() - - before_pickled_prediction = primitive.produce(inputs=attributes).value - pickle_object = pickle.dumps(primitive) - primitive = pickle.loads(pickle_object) - after_unpickled_prediction = primitive.produce(inputs=attributes).value - self.assertTrue(container.DataFrame.equals(before_pickled_prediction, after_unpickled_prediction)) - - def _test_return_replace_metadata(self, predictions_metadata): - self.assertEqual(utils.to_json_structure(predictions_metadata.to_internal_simple_structure()), [{ - 'selector': [], - 'metadata': { - 'structural_type': 'd3m.container.pandas.DataFrame', - 'semantic_types': ['https://metadata.datadrivendiscovery.org/types/Table'], - 'dimension': { - 'name': 'rows', - 'semantic_types': ['https://metadata.datadrivendiscovery.org/types/TabularRow'], - 'length': 150, - }, - 'schema': 'https://metadata.datadrivendiscovery.org/schemas/v0/container.json', - }, - }, { - 'selector': ['__ALL_ELEMENTS__'], - 'metadata': { - 'dimension': { - 'name': 'columns', - 'semantic_types': ['https://metadata.datadrivendiscovery.org/types/TabularColumn'], - 'length': 3, - }, - }, - }, { - 'selector': ['__ALL_ELEMENTS__', 0], - 'metadata': { - 'name': 'd3mIndex', - 'structural_type': 'int', - 'semantic_types': ['http://schema.org/Integer', - 'https://metadata.datadrivendiscovery.org/types/PrimaryKey'], - }, - }, { - 'selector': ['__ALL_ELEMENTS__', 1], - 'metadata': { - 'structural_type': 'float', - 'name': 'petalWidth', - 'semantic_types': ['http://schema.org/Float', - 'https://metadata.datadrivendiscovery.org/types/Target', - 'https://metadata.datadrivendiscovery.org/types/PredictedTarget'], - 'custom_metadata': 'targets', - }, - }, { - 'selector': ['__ALL_ELEMENTS__', 2], - 'metadata': { - 'name': 'petalWidth', - 'structural_type': 'float', - 'semantic_types': ['http://schema.org/Float', - 'https://metadata.datadrivendiscovery.org/types/Target', - 'https://metadata.datadrivendiscovery.org/types/TrueTarget'], - 'custom_metadata': 'targets', - }, - }]) - - -if __name__ == '__main__': - unittest.main() diff --git a/common-primitives/tests/utils.py b/common-primitives/tests/utils.py deleted file mode 100644 index 18dc51c..0000000 --- a/common-primitives/tests/utils.py +++ /dev/null @@ -1,112 +0,0 @@ -import json -import os - -from d3m import utils, container -from d3m.metadata import base as metadata_base - -from common_primitives import dataset_to_dataframe - - -def convert_metadata(metadata): - return json.loads(json.dumps(metadata, cls=utils.JsonEncoder)) - - -def load_iris_metadata(): - dataset_doc_path = os.path.abspath(os.path.join(os.path.dirname(__file__), 'data', 'datasets', 'iris_dataset_1', 'datasetDoc.json')) - dataset = container.Dataset.load('file://{dataset_doc_path}'.format(dataset_doc_path=dataset_doc_path)) - return dataset - - -def test_iris_metadata(test_obj, metadata, structural_type, rows_structural_type=None): - test_obj.maxDiff = None - - test_obj.assertEqual(convert_metadata(metadata.query(())), { - 'schema': metadata_base.CONTAINER_SCHEMA_VERSION, - 'structural_type': structural_type, - 'semantic_types': [ - 'https://metadata.datadrivendiscovery.org/types/Table', - ], - 'dimension': { - 'name': 'rows', - 'semantic_types': ['https://metadata.datadrivendiscovery.org/types/TabularRow'], - 'length': 150, - } - }) - - if rows_structural_type is None: - test_obj.assertEqual(convert_metadata(metadata.query((metadata_base.ALL_ELEMENTS,))), { - 'dimension': { - 'name': 'columns', - 'semantic_types': ['https://metadata.datadrivendiscovery.org/types/TabularColumn'], - 'length': 6, - } - }) - else: - test_obj.assertEqual(convert_metadata(metadata.query((metadata_base.ALL_ELEMENTS,))), { - 'structural_type': rows_structural_type, - 'dimension': { - 'name': 'columns', - 'semantic_types': ['https://metadata.datadrivendiscovery.org/types/TabularColumn'], - 'length': 6, - } - }) - - test_obj.assertEqual(convert_metadata(metadata.query((metadata_base.ALL_ELEMENTS, 0))), { - 'name': 'd3mIndex', - 'structural_type': 'str', - 'semantic_types': [ - 'http://schema.org/Integer', - 'https://metadata.datadrivendiscovery.org/types/PrimaryKey', - ], - }) - - for i in range(1, 5): - test_obj.assertEqual(convert_metadata(metadata.query((metadata_base.ALL_ELEMENTS, i))), { - 'name': ['sepalLength', 'sepalWidth', 'petalLength', 'petalWidth'][i - 1], - 'structural_type': 'str', - 'semantic_types': [ - 'http://schema.org/Float', - 'https://metadata.datadrivendiscovery.org/types/Attribute', - ], - }, i) - - test_obj.assertEqual(convert_metadata(metadata.query((metadata_base.ALL_ELEMENTS, 5))), { - 'name': 'species', - 'structural_type': 'str', - 'semantic_types': [ - 'https://metadata.datadrivendiscovery.org/types/CategoricalData', - 'https://metadata.datadrivendiscovery.org/types/SuggestedTarget', - 'https://metadata.datadrivendiscovery.org/types/Attribute', - ], - }) - - -def convert_through_json(data): - return json.loads(json.dumps(data, cls=utils.JsonEncoder)) - - -def normalize_semantic_types(data): - if isinstance(data, dict): - if 'semantic_types' in data: - # We sort them so that it is easier to compare them. - data['semantic_types'] = sorted(data['semantic_types']) - - return {key: normalize_semantic_types(value) for key, value in data.items()} - - return data - - -def effective_metadata(metadata): - output = metadata.to_json_structure() - - for entry in output: - entry['metadata'] = normalize_semantic_types(entry['metadata']) - - return output - - -def get_dataframe(dataset): - dataset_hyperparams_class = dataset_to_dataframe.DatasetToDataFramePrimitive.metadata.get_hyperparams() - dataframe_primitive = dataset_to_dataframe.DatasetToDataFramePrimitive(hyperparams=dataset_hyperparams_class.defaults()) - dataframe = dataframe_primitive.produce(inputs=dataset).value - return dataframe diff --git a/entry_points.ini b/entry_points.ini index 6806df3..690abb7 100644 --- a/entry_points.ini +++ b/entry_points.ini @@ -1,79 +1,79 @@ [d3m.primitives] -tods.data_processing.dataset_to_dataframe = data_processing.DatasetToDataframe:DatasetToDataFramePrimitive -tods.data_processing.time_interval_transform = data_processing.TimeIntervalTransform:TimeIntervalTransform -tods.data_processing.categorical_to_binary = data_processing.CategoricalToBinary:CategoricalToBinary -tods.data_processing.column_filter = data_processing.ColumnFilter:ColumnFilter -tods.data_processing.timestamp_validation = data_processing.TimeStampValidation:TimeStampValidationPrimitive -tods.data_processing.duplication_validation = data_processing.DuplicationValidation:DuplicationValidation -tods.data_processing.continuity_validation = data_processing.ContinuityValidation:ContinuityValidation +tods.data_processing.dataset_to_dataframe = tods.data_processing.DatasetToDataframe:DatasetToDataFramePrimitive +tods.data_processing.time_interval_transform = tods.data_processing.TimeIntervalTransform:TimeIntervalTransform +tods.data_processing.categorical_to_binary = tods.data_processing.CategoricalToBinary:CategoricalToBinary +tods.data_processing.column_filter = tods.data_processing.ColumnFilter:ColumnFilter +tods.data_processing.timestamp_validation = tods.data_processing.TimeStampValidation:TimeStampValidationPrimitive +tods.data_processing.duplication_validation = tods.data_processing.DuplicationValidation:DuplicationValidation +tods.data_processing.continuity_validation = tods.data_processing.ContinuityValidation:ContinuityValidation -tods.timeseries_processing.transformation.axiswise_scaler=timeseries_processing.SKAxiswiseScaler:SKAxiswiseScaler -tods.timeseries_processing.transformation.standard_scaler=timeseries_processing.SKStandardScaler:SKStandardScaler -tods.timeseries_processing.transformation.power_transformer=timeseries_processing.SKPowerTransformer:SKPowerTransformer -tods.timeseries_processing.transformation.quantile_transformer=timeseries_processing.SKQuantileTransformer:SKQuantileTransformer -tods.timeseries_processing.transformation.moving_average_transform = timeseries_processing.MovingAverageTransform:MovingAverageTransform -tods.timeseries_processing.transformation.simple_exponential_smoothing = timeseries_processing.SimpleExponentialSmoothing:SimpleExponentialSmoothing -tods.timeseries_processing.transformation.holt_smoothing = timeseries_processing.HoltSmoothing:HoltSmoothing -tods.timeseries_processing.transformation.holt_winters_exponential_smoothing= timeseries_processing.HoltWintersExponentialSmoothing:HoltWintersExponentialSmoothing -tods.timeseries_processing.decomposition.time_series_seasonality_trend_decomposition = timeseries_processing.TimeSeriesSeasonalityTrendDecomposition:TimeSeriesSeasonalityTrendDecompositionPrimitive +tods.timeseries_processing.transformation.axiswise_scaler = tods.timeseries_processing.SKAxiswiseScaler:SKAxiswiseScaler +tods.timeseries_processing.transformation.standard_scaler = tods.timeseries_processing.SKStandardScaler:SKStandardScaler +tods.timeseries_processing.transformation.power_transformer = tods.timeseries_processing.SKPowerTransformer:SKPowerTransformer +tods.timeseries_processing.transformation.quantile_transformer = tods.timeseries_processing.SKQuantileTransformer:SKQuantileTransformer +tods.timeseries_processing.transformation.moving_average_transform = tods.timeseries_processing.MovingAverageTransform:MovingAverageTransform +tods.timeseries_processing.transformation.simple_exponential_smoothing = tods.timeseries_processing.SimpleExponentialSmoothing:SimpleExponentialSmoothing +tods.timeseries_processing.transformation.holt_smoothing = tods.timeseries_processing.HoltSmoothing:HoltSmoothing +tods.timeseries_processing.transformation.holt_winters_exponential_smoothing= tods.timeseries_processing.HoltWintersExponentialSmoothing:HoltWintersExponentialSmoothing +tods.timeseries_processing.decomposition.time_series_seasonality_trend_decomposition = tods.timeseries_processing.TimeSeriesSeasonalityTrendDecomposition:TimeSeriesSeasonalityTrendDecompositionPrimitive -tods.feature_analysis.auto_correlation = feature_analysis.AutoCorrelation:AutoCorrelation -tods.feature_analysis.statistical_mean = feature_analysis.StatisticalMean:StatisticalMeanPrimitive -tods.feature_analysis.statistical_median = feature_analysis.StatisticalMedian:StatisticalMedianPrimitive -tods.feature_analysis.statistical_g_mean = feature_analysis.StatisticalGmean:StatisticalGmeanPrimitive -tods.feature_analysis.statistical_abs_energy = feature_analysis.StatisticalAbsEnergy:StatisticalAbsEnergyPrimitive -tods.feature_analysis.statistical_abs_sum = feature_analysis.StatisticalAbsSum:StatisticalAbsSumPrimitive -tods.feature_analysis.statistical_h_mean = feature_analysis.StatisticalHmean:StatisticalHmeanPrimitive -tods.feature_analysis.statistical_maximum = feature_analysis.StatisticalMaximum:StatisticalMaximumPrimitive -tods.feature_analysis.statistical_minimum = feature_analysis.StatisticalMinimum:StatisticalMinimumPrimitive -tods.feature_analysis.statistical_mean_abs = feature_analysis.StatisticalMeanAbs:StatisticalMeanAbsPrimitive -tods.feature_analysis.statistical_mean_abs_temporal_derivative = feature_analysis.StatisticalMeanAbsTemporalDerivative:StatisticalMeanAbsTemporalDerivativePrimitive -tods.feature_analysis.statistical_mean_temporal_derivative = feature_analysis.StatisticalMeanTemporalDerivative:StatisticalMeanTemporalDerivativePrimitive -tods.feature_analysis.statistical_median_abs_deviation = feature_analysis.StatisticalMedianAbsoluteDeviation:StatisticalMedianAbsoluteDeviationPrimitive -tods.feature_analysis.statistical_kurtosis = feature_analysis.StatisticalKurtosis:StatisticalKurtosisPrimitive -tods.feature_analysis.statistical_skew = feature_analysis.StatisticalSkew:StatisticalSkewPrimitive -tods.feature_analysis.statistical_std = feature_analysis.StatisticalStd:StatisticalStdPrimitive -tods.feature_analysis.statistical_var = feature_analysis.StatisticalVar:StatisticalVarPrimitive -tods.feature_analysis.statistical_variation = feature_analysis.StatisticalVariation:StatisticalVariationPrimitive -tods.feature_analysis.statistical_vec_sum = feature_analysis.StatisticalVecSum:StatisticalVecSumPrimitive -tods.feature_analysis.statistical_willison_amplitude = feature_analysis.StatisticalWillisonAmplitude:StatisticalWillisonAmplitudePrimitive -tods.feature_analysis.statistical_zero_crossing = feature_analysis.StatisticalZeroCrossing:StatisticalZeroCrossingPrimitive -tods.feature_analysis.spectral_residual_transform = feature_analysis.SpectralResidualTransform:SpectralResidualTransformPrimitive -tods.feature_analysis.fast_fourier_transform = feature_analysis.FastFourierTransform:FastFourierTransform -tods.feature_analysis.discrete_cosine_transform = feature_analysis.DiscreteCosineTransform:DiscreteCosineTransform -tods.feature_analysis.non_negative_matrix_factorization = feature_analysis.NonNegativeMatrixFactorization:NonNegativeMatrixFactorization -tods.feature_analysis.bk_filter = feature_analysis.BKFilter:BKFilter -tods.feature_analysis.hp_filter = feature_analysis.HPFilter:HPFilter -tods.feature_analysis.truncated_svd = feature_analysis.SKTruncatedSVD:SKTruncatedSVD -tods.feature_analysis.wavelet_transform = feature_analysis.WaveletTransform:WaveletTransformer -tods.feature_analysis.trmf = feature_analysis.TRMF:TRMF +tods.feature_analysis.auto_correlation = tods.feature_analysis.AutoCorrelation:AutoCorrelation +tods.feature_analysis.statistical_mean = tods.feature_analysis.StatisticalMean:StatisticalMeanPrimitive +tods.feature_analysis.statistical_median = tods.feature_analysis.StatisticalMedian:StatisticalMedianPrimitive +tods.feature_analysis.statistical_g_mean = tods.feature_analysis.StatisticalGmean:StatisticalGmeanPrimitive +tods.feature_analysis.statistical_abs_energy = tods.feature_analysis.StatisticalAbsEnergy:StatisticalAbsEnergyPrimitive +tods.feature_analysis.statistical_abs_sum = tods.feature_analysis.StatisticalAbsSum:StatisticalAbsSumPrimitive +tods.feature_analysis.statistical_h_mean = tods.feature_analysis.StatisticalHmean:StatisticalHmeanPrimitive +tods.feature_analysis.statistical_maximum = tods.feature_analysis.StatisticalMaximum:StatisticalMaximumPrimitive +tods.feature_analysis.statistical_minimum = tods.feature_analysis.StatisticalMinimum:StatisticalMinimumPrimitive +tods.feature_analysis.statistical_mean_abs = tods.feature_analysis.StatisticalMeanAbs:StatisticalMeanAbsPrimitive +tods.feature_analysis.statistical_mean_abs_temporal_derivative = tods.feature_analysis.StatisticalMeanAbsTemporalDerivative:StatisticalMeanAbsTemporalDerivativePrimitive +tods.feature_analysis.statistical_mean_temporal_derivative = tods.feature_analysis.StatisticalMeanTemporalDerivative:StatisticalMeanTemporalDerivativePrimitive +tods.feature_analysis.statistical_median_abs_deviation = tods.feature_analysis.StatisticalMedianAbsoluteDeviation:StatisticalMedianAbsoluteDeviationPrimitive +tods.feature_analysis.statistical_kurtosis = tods.feature_analysis.StatisticalKurtosis:StatisticalKurtosisPrimitive +tods.feature_analysis.statistical_skew = tods.feature_analysis.StatisticalSkew:StatisticalSkewPrimitive +tods.feature_analysis.statistical_std = tods.feature_analysis.StatisticalStd:StatisticalStdPrimitive +tods.feature_analysis.statistical_var = tods.feature_analysis.StatisticalVar:StatisticalVarPrimitive +tods.feature_analysis.statistical_variation = tods.feature_analysis.StatisticalVariation:StatisticalVariationPrimitive +tods.feature_analysis.statistical_vec_sum = tods.feature_analysis.StatisticalVecSum:StatisticalVecSumPrimitive +tods.feature_analysis.statistical_willison_amplitude = tods.feature_analysis.StatisticalWillisonAmplitude:StatisticalWillisonAmplitudePrimitive +tods.feature_analysis.statistical_zero_crossing = tods.feature_analysis.StatisticalZeroCrossing:StatisticalZeroCrossingPrimitive +tods.feature_analysis.spectral_residual_transform = tods.feature_analysis.SpectralResidualTransform:SpectralResidualTransformPrimitive +tods.feature_analysis.fast_fourier_transform = tods.feature_analysis.FastFourierTransform:FastFourierTransform +tods.feature_analysis.discrete_cosine_transform = tods.feature_analysis.DiscreteCosineTransform:DiscreteCosineTransform +tods.feature_analysis.non_negative_matrix_factorization = tods.feature_analysis.NonNegativeMatrixFactorization:NonNegativeMatrixFactorization +tods.feature_analysis.bk_filter = tods.feature_analysis.BKFilter:BKFilter +tods.feature_analysis.hp_filter = tods.feature_analysis.HPFilter:HPFilter +tods.feature_analysis.truncated_svd = tods.feature_analysis.SKTruncatedSVD:SKTruncatedSVD +tods.feature_analysis.wavelet_transform = tods.feature_analysis.WaveletTransform:WaveletTransformer +tods.feature_analysis.trmf = tods.feature_analysis.TRMF:TRMF -tods.detection_algorithm.pyod_ae = detection_algorithm.PyodAE:AutoEncoder -tods.detection_algorithm.pyod_vae = detection_algorithm.PyodVAE:VariationalAutoEncoder -tods.detection_algorithm.pyod_cof = detection_algorithm.PyodCOF:PyodCOF -tods.detection_algorithm.pyod_sod = detection_algorithm.PyodSOD:SODPrimitive -tods.detection_algorithm.pyod_abod = detection_algorithm.PyodABOD:ABODPrimitive -tods.detection_algorithm.pyod_hbos = detection_algorithm.PyodHBOS:HBOSPrimitive -tods.detection_algorithm.pyod_iforest = detection_algorithm.PyodIsolationForest:IsolationForest -tods.detection_algorithm.pyod_lof = detection_algorithm.PyodLOF:LOFPrimitive -tods.detection_algorithm.pyod_autoencoder = detection_algorithm.PyodAutoEncoder:AutoEncoderPrimitive -tods.detection_algorithm.pyod_knn = detection_algorithm.PyodKNN:KNNPrimitive -tods.detection_algorithm.pyod_ocsvm = detection_algorithm.PyodOCSVM:OCSVMPrimitive -tods.detection_algorithm.pyod_loda = detection_algorithm.PyodLODA:LODAPrimitive -tods.detection_algorithm.pyod_cblof = detection_algorithm.PyodCBLOF:CBLOFPrimitive -tods.detection_algorithm.pyod_sogaal = detection_algorithm.PyodSoGaal:So_GaalPrimitive -tods.detection_algorithm.pyod_mogaal = detection_algorithm.PyodMoGaal:Mo_GaalPrimitive +tods.detection_algorithm.pyod_ae = tods.detection_algorithm.PyodAE:AutoEncoder +tods.detection_algorithm.pyod_vae = tods.detection_algorithm.PyodVAE:VariationalAutoEncoder +tods.detection_algorithm.pyod_cof = tods.detection_algorithm.PyodCOF:PyodCOF +tods.detection_algorithm.pyod_sod = tods.detection_algorithm.PyodSOD:SODPrimitive +tods.detection_algorithm.pyod_abod = tods.detection_algorithm.PyodABOD:ABODPrimitive +tods.detection_algorithm.pyod_hbos = tods.detection_algorithm.PyodHBOS:HBOSPrimitive +tods.detection_algorithm.pyod_iforest = tods.detection_algorithm.PyodIsolationForest:IsolationForest +tods.detection_algorithm.pyod_lof = tods.detection_algorithm.PyodLOF:LOFPrimitive +tods.detection_algorithm.pyod_autoencoder = tods.detection_algorithm.PyodAutoEncoder:AutoEncoderPrimitive +tods.detection_algorithm.pyod_knn = tods.detection_algorithm.PyodKNN:KNNPrimitive +tods.detection_algorithm.pyod_ocsvm = tods.detection_algorithm.PyodOCSVM:OCSVMPrimitive +tods.detection_algorithm.pyod_loda = tods.detection_algorithm.PyodLODA:LODAPrimitive +tods.detection_algorithm.pyod_cblof = tods.detection_algorithm.PyodCBLOF:CBLOFPrimitive +tods.detection_algorithm.pyod_sogaal = tods.detection_algorithm.PyodSoGaal:So_GaalPrimitive +tods.detection_algorithm.pyod_mogaal = tods.detection_algorithm.PyodMoGaal:Mo_GaalPrimitive -tods.detection_algorithm.matrix_profile = detection_algorithm.MatrixProfile:MatrixProfile -tods.detection_algorithm.AutoRegODetector = detection_algorithm.AutoRegODetect:AutoRegODetector +tods.detection_algorithm.matrix_profile = tods.detection_algorithm.MatrixProfile:MatrixProfile +tods.detection_algorithm.AutoRegODetector = tods.detection_algorithm.AutoRegODetect:AutoRegODetector -tods.detection_algorithm.LSTMODetector = detection_algorithm.LSTMODetect:LSTMODetector -tods.detection_algorithm.AutoRegODetector = detection_algorithm.AutoRegODetect:AutoRegODetector -tods.detection_algorithm.PCAODetector = detection_algorithm.PCAODetect:PCAODetector -tods.detection_algorithm.KDiscordODetector = detection_algorithm.KDiscordODetect:KDiscordODetector -tods.detection_algorithm.deeplog = detection_algorithm.DeepLog:DeepLogPrimitive -tods.detection_algorithm.telemanom = detection_algorithm.Telemanom:TelemanomPrimitive +tods.detection_algorithm.LSTMODetector = tods.detection_algorithm.LSTMODetect:LSTMODetector +tods.detection_algorithm.AutoRegODetector = tods.detection_algorithm.AutoRegODetect:AutoRegODetector +tods.detection_algorithm.PCAODetector = tods.detection_algorithm.PCAODetect:PCAODetector +tods.detection_algorithm.KDiscordODetector = tods.detection_algorithm.KDiscordODetect:KDiscordODetector +tods.detection_algorithm.deeplog = tods.detection_algorithm.DeepLog:DeepLogPrimitive +tods.detection_algorithm.telemanom = tods.detection_algorithm.Telemanom:TelemanomPrimitive -tods.reinforcement.rule_filter = reinforcement.RuleBasedFilter:RuleBasedFilter +tods.reinforcement.rule_filter = tods.reinforcement.RuleBasedFilter:RuleBasedFilter diff --git a/entry_points_common.ini b/entry_points_common.ini new file mode 100644 index 0000000..6676944 --- /dev/null +++ b/entry_points_common.ini @@ -0,0 +1,63 @@ +[d3m.primitives] +data_preprocessing.one_hot_encoder.MakerCommon = tods.common_primitives.one_hot_maker:OneHotMakerPrimitive +data_preprocessing.one_hot_encoder.PandasCommon = tods.common_primitives.pandas_onehot_encoder:PandasOneHotEncoderPrimitive +data_transformation.extract_columns.Common = tods.common_primitives.extract_columns:ExtractColumnsPrimitive +data_transformation.extract_columns_by_semantic_types.Common = tods.common_primitives.extract_columns_semantic_types:ExtractColumnsBySemanticTypesPrimitive +data_transformation.extract_columns_by_structural_types.Common = tods.common_primitives.extract_columns_structural_types:ExtractColumnsByStructuralTypesPrimitive +data_transformation.remove_columns.Common = tods.common_primitives.remove_columns:RemoveColumnsPrimitive +data_transformation.remove_duplicate_columns.Common = tods.common_primitives.remove_duplicate_columns:RemoveDuplicateColumnsPrimitive +data_transformation.horizontal_concat.DataFrameCommon = tods.common_primitives.horizontal_concat:HorizontalConcatPrimitive +data_transformation.cast_to_type.Common = tods.common_primitives.cast_to_type:CastToTypePrimitive +data_transformation.column_parser.Common = tods.common_primitives.column_parser:ColumnParserPrimitive +data_transformation.construct_predictions.Common = tods.common_primitives.construct_predictions:ConstructPredictionsPrimitive +data_transformation.dataframe_to_ndarray.Common = tods.common_primitives.dataframe_to_ndarray:DataFrameToNDArrayPrimitive +data_transformation.ndarray_to_dataframe.Common = tods.common_primitives.ndarray_to_dataframe:NDArrayToDataFramePrimitive +data_transformation.dataframe_to_list.Common = tods.common_primitives.dataframe_to_list:DataFrameToListPrimitive +data_transformation.list_to_dataframe.Common = tods.common_primitives.list_to_dataframe:ListToDataFramePrimitive +data_transformation.ndarray_to_list.Common = tods.common_primitives.ndarray_to_list:NDArrayToListPrimitive +data_transformation.list_to_ndarray.Common = tods.common_primitives.list_to_ndarray:ListToNDArrayPrimitive +data_transformation.stack_ndarray_column.Common = tods.common_primitives.stack_ndarray_column:StackNDArrayColumnPrimitive +data_transformation.add_semantic_types.Common = tods.common_primitives.add_semantic_types:AddSemanticTypesPrimitive +data_transformation.remove_semantic_types.Common = tods.common_primitives.remove_semantic_types:RemoveSemanticTypesPrimitive +data_transformation.replace_semantic_types.Common = tods.common_primitives.replace_semantic_types:ReplaceSemanticTypesPrimitive +data_transformation.denormalize.Common = tods.common_primitives.denormalize:DenormalizePrimitive +data_transformation.datetime_field_compose.Common = tods.common_primitives.datetime_field_compose:DatetimeFieldComposePrimitive +data_transformation.grouping_field_compose.Common = tods.common_primitives.grouping_field_compose:GroupingFieldComposePrimitive +data_transformation.dataset_to_dataframe.Common = tods.common_primitives.dataset_to_dataframe:DatasetToDataFramePrimitive +data_transformation.cut_audio.Common = tods.common_primitives.cut_audio:CutAudioPrimitive +data_transformation.rename_duplicate_name.DataFrameCommon = tods.common_primitives.rename_duplicate_columns:RenameDuplicateColumnsPrimitive +#data_transformation.normalize_column_references.Common = tods.common_primitives.normalize_column_references:NormalizeColumnReferencesPrimitive +#data_transformation.normalize_graphs.Common = tods.common_primitives.normalize_graphs:NormalizeGraphsPrimitive +data_transformation.ravel.DataFrameRowCommon = tods.common_primitives.ravel:RavelAsRowPrimitive +data_preprocessing.label_encoder.Common = tods.common_primitives.unseen_label_encoder:UnseenLabelEncoderPrimitive +data_preprocessing.label_decoder.Common = tods.common_primitives.unseen_label_decoder:UnseenLabelDecoderPrimitive +data_preprocessing.image_reader.Common = tods.common_primitives.dataframe_image_reader:DataFrameImageReaderPrimitive +data_preprocessing.text_reader.Common = tods.common_primitives.text_reader:TextReaderPrimitive +data_preprocessing.video_reader.Common = tods.common_primitives.video_reader:VideoReaderPrimitive +data_preprocessing.csv_reader.Common = tods.common_primitives.csv_reader:CSVReaderPrimitive +data_preprocessing.audio_reader.Common = tods.common_primitives.audio_reader:AudioReaderPrimitive +data_preprocessing.regex_filter.Common = tods.common_primitives.regex_filter:RegexFilterPrimitive +data_preprocessing.term_filter.Common = tods.common_primitives.term_filter:TermFilterPrimitive +data_preprocessing.numeric_range_filter.Common = tods.common_primitives.numeric_range_filter:NumericRangeFilterPrimitive +data_preprocessing.datetime_range_filter.Common = tods.common_primitives.datetime_range_filter:DatetimeRangeFilterPrimitive +data_preprocessing.dataset_sample.Common = tods.common_primitives.dataset_sample:DatasetSamplePrimitive +#data_preprocessing.time_interval_transform.Common = tods.common_primitives.time_interval_transform:TimeIntervalTransformPrimitive +data_cleaning.tabular_extractor.Common = tods.common_primitives.tabular_extractor:AnnotatedTabularExtractorPrimitive +evaluation.redact_columns.Common = tods.common_primitives.redact_columns:RedactColumnsPrimitive +evaluation.kfold_dataset_split.Common = tods.common_primitives.kfold_split:KFoldDatasetSplitPrimitive +evaluation.kfold_time_series_split.Common = tods.common_primitives.kfold_split_timeseries:KFoldTimeSeriesSplitPrimitive +evaluation.train_score_dataset_split.Common = tods.common_primitives.train_score_split:TrainScoreDatasetSplitPrimitive +evaluation.no_split_dataset_split.Common = tods.common_primitives.no_split:NoSplitDatasetSplitPrimitive +evaluation.fixed_split_dataset_split.Commmon = tods.common_primitives.fixed_split:FixedSplitDatasetSplitPrimitive +classification.random_forest.Common = tods.common_primitives.random_forest:RandomForestClassifierPrimitive +classification.light_gbm.Common = tods.common_primitives.lgbm_classifier:LightGBMClassifierPrimitive +classification.xgboost_gbtree.Common = tods.common_primitives.xgboost_gbtree:XGBoostGBTreeClassifierPrimitive +classification.xgboost_dart.Common = tods.common_primitives.xgboost_dart:XGBoostDartClassifierPrimitive +regression.xgboost_gbtree.Common = tods.common_primitives.xgboost_regressor:XGBoostGBTreeRegressorPrimitive +schema_discovery.profiler.Common = tods.common_primitives.simple_profiler:SimpleProfilerPrimitive +operator.column_map.Common = tods.common_primitives.column_map:DataFrameColumnMapPrimitive +operator.dataset_map.DataFrameCommon = tods.common_primitives.dataset_map:DataFrameDatasetMapPrimitive +data_preprocessing.flatten.DataFrameCommon = tods.common_primitives.dataframe_flatten:DataFrameFlattenPrimitive +metalearning.metafeature_extractor.Common = tods.common_primitives.compute_metafeatures:ComputeMetafeaturesPrimitive +data_augmentation.datamart_augmentation.Common = tods.common_primitives.datamart_augment:DataMartAugmentPrimitive +data_augmentation.datamart_download.Common = tods.common_primitives.datamart_download:DataMartDownloadPrimitive diff --git a/setup.py b/setup.py index 80506cd..04ce8e3 100644 --- a/setup.py +++ b/setup.py @@ -10,7 +10,7 @@ def read_file_entry_points(fname): return entry_points.read() def merge_entry_points(): - entry_list = ['entry_points.ini'] + entry_list = ['entry_points.ini', 'entry_points_common.ini'] merge_entry = [] for entry_name in entry_list: entry_point = read_file_entry_points(entry_name).replace(' ', '') @@ -29,6 +29,7 @@ setup( install_requires=[ 'd3m', 'Jinja2', + 'GitPython==3.1.0', 'simplejson==3.12.0', 'scikit-learn==0.22.0', 'statsmodels==0.11.1', @@ -38,7 +39,9 @@ setup( 'pyod', 'nimfa==1.4.0', 'stumpy==1.4.0', - 'more-itertools==8.5.0' + 'more-itertools==8.5.0', + 'gitdb2==2.0.6', + 'gitdb==0.6.4' ], entry_points = merge_entry_points() diff --git a/test.sh b/test.sh index af60b73..51587ff 100644 --- a/test.sh +++ b/test.sh @@ -1,7 +1,7 @@ #!/bin/bash -#test_scripts=$(ls tests) -test_scripts=$(ls tests | grep -v -f tested_file.txt) +test_scripts=$(ls tests) +#test_scripts=$(ls tests | grep -v -f tested_file.txt) for file in $test_scripts do diff --git a/common-primitives/common_primitives/__init__.py b/tods/common_primitives/__init__.py similarity index 100% rename from common-primitives/common_primitives/__init__.py rename to tods/common_primitives/__init__.py diff --git a/common-primitives/common_primitives/add_semantic_types.py b/tods/common_primitives/add_semantic_types.py similarity index 100% rename from common-primitives/common_primitives/add_semantic_types.py rename to tods/common_primitives/add_semantic_types.py diff --git a/common-primitives/common_primitives/audio_reader.py b/tods/common_primitives/audio_reader.py similarity index 100% rename from common-primitives/common_primitives/audio_reader.py rename to tods/common_primitives/audio_reader.py diff --git a/common-primitives/common_primitives/base.py b/tods/common_primitives/base.py similarity index 100% rename from common-primitives/common_primitives/base.py rename to tods/common_primitives/base.py diff --git a/common-primitives/common_primitives/cast_to_type.py b/tods/common_primitives/cast_to_type.py similarity index 100% rename from common-primitives/common_primitives/cast_to_type.py rename to tods/common_primitives/cast_to_type.py diff --git a/common-primitives/common_primitives/column_map.py b/tods/common_primitives/column_map.py similarity index 100% rename from common-primitives/common_primitives/column_map.py rename to tods/common_primitives/column_map.py diff --git a/common-primitives/common_primitives/column_parser.py b/tods/common_primitives/column_parser.py similarity index 100% rename from common-primitives/common_primitives/column_parser.py rename to tods/common_primitives/column_parser.py diff --git a/common-primitives/common_primitives/compute_metafeatures.py b/tods/common_primitives/compute_metafeatures.py similarity index 100% rename from common-primitives/common_primitives/compute_metafeatures.py rename to tods/common_primitives/compute_metafeatures.py diff --git a/common-primitives/common_primitives/construct_predictions.py b/tods/common_primitives/construct_predictions.py similarity index 100% rename from common-primitives/common_primitives/construct_predictions.py rename to tods/common_primitives/construct_predictions.py diff --git a/common-primitives/common_primitives/csv_reader.py b/tods/common_primitives/csv_reader.py similarity index 100% rename from common-primitives/common_primitives/csv_reader.py rename to tods/common_primitives/csv_reader.py diff --git a/common-primitives/common_primitives/cut_audio.py b/tods/common_primitives/cut_audio.py similarity index 100% rename from common-primitives/common_primitives/cut_audio.py rename to tods/common_primitives/cut_audio.py diff --git a/common-primitives/common_primitives/dataframe_flatten.py b/tods/common_primitives/dataframe_flatten.py similarity index 100% rename from common-primitives/common_primitives/dataframe_flatten.py rename to tods/common_primitives/dataframe_flatten.py diff --git a/common-primitives/common_primitives/dataframe_image_reader.py b/tods/common_primitives/dataframe_image_reader.py similarity index 100% rename from common-primitives/common_primitives/dataframe_image_reader.py rename to tods/common_primitives/dataframe_image_reader.py diff --git a/common-primitives/common_primitives/dataframe_to_list.py b/tods/common_primitives/dataframe_to_list.py similarity index 100% rename from common-primitives/common_primitives/dataframe_to_list.py rename to tods/common_primitives/dataframe_to_list.py diff --git a/common-primitives/common_primitives/dataframe_to_ndarray.py b/tods/common_primitives/dataframe_to_ndarray.py similarity index 100% rename from common-primitives/common_primitives/dataframe_to_ndarray.py rename to tods/common_primitives/dataframe_to_ndarray.py diff --git a/common-primitives/common_primitives/dataframe_utils.py b/tods/common_primitives/dataframe_utils.py similarity index 100% rename from common-primitives/common_primitives/dataframe_utils.py rename to tods/common_primitives/dataframe_utils.py diff --git a/common-primitives/common_primitives/datamart_augment.py b/tods/common_primitives/datamart_augment.py similarity index 100% rename from common-primitives/common_primitives/datamart_augment.py rename to tods/common_primitives/datamart_augment.py diff --git a/common-primitives/common_primitives/datamart_download.py b/tods/common_primitives/datamart_download.py similarity index 100% rename from common-primitives/common_primitives/datamart_download.py rename to tods/common_primitives/datamart_download.py diff --git a/common-primitives/common_primitives/dataset_map.py b/tods/common_primitives/dataset_map.py similarity index 100% rename from common-primitives/common_primitives/dataset_map.py rename to tods/common_primitives/dataset_map.py diff --git a/common-primitives/common_primitives/dataset_sample.py b/tods/common_primitives/dataset_sample.py similarity index 100% rename from common-primitives/common_primitives/dataset_sample.py rename to tods/common_primitives/dataset_sample.py diff --git a/common-primitives/common_primitives/dataset_to_dataframe.py b/tods/common_primitives/dataset_to_dataframe.py similarity index 99% rename from common-primitives/common_primitives/dataset_to_dataframe.py rename to tods/common_primitives/dataset_to_dataframe.py index 4f8abe3..a499e8c 100644 --- a/common-primitives/common_primitives/dataset_to_dataframe.py +++ b/tods/common_primitives/dataset_to_dataframe.py @@ -6,7 +6,7 @@ from d3m.base import utils as base_utils from d3m.metadata import base as metadata_base, hyperparams from d3m.primitive_interfaces import base, transformer import logging -import common_primitives +import common_primitives __all__ = ('DatasetToDataFramePrimitive',) diff --git a/common-primitives/common_primitives/dataset_utils.py b/tods/common_primitives/dataset_utils.py similarity index 100% rename from common-primitives/common_primitives/dataset_utils.py rename to tods/common_primitives/dataset_utils.py diff --git a/common-primitives/common_primitives/datetime_field_compose.py b/tods/common_primitives/datetime_field_compose.py similarity index 100% rename from common-primitives/common_primitives/datetime_field_compose.py rename to tods/common_primitives/datetime_field_compose.py diff --git a/common-primitives/common_primitives/datetime_range_filter.py b/tods/common_primitives/datetime_range_filter.py similarity index 100% rename from common-primitives/common_primitives/datetime_range_filter.py rename to tods/common_primitives/datetime_range_filter.py diff --git a/common-primitives/common_primitives/denormalize.py b/tods/common_primitives/denormalize.py similarity index 100% rename from common-primitives/common_primitives/denormalize.py rename to tods/common_primitives/denormalize.py diff --git a/common-primitives/common_primitives/extract_columns.py b/tods/common_primitives/extract_columns.py similarity index 100% rename from common-primitives/common_primitives/extract_columns.py rename to tods/common_primitives/extract_columns.py diff --git a/common-primitives/common_primitives/extract_columns_semantic_types.py b/tods/common_primitives/extract_columns_semantic_types.py similarity index 100% rename from common-primitives/common_primitives/extract_columns_semantic_types.py rename to tods/common_primitives/extract_columns_semantic_types.py diff --git a/common-primitives/common_primitives/extract_columns_structural_types.py b/tods/common_primitives/extract_columns_structural_types.py similarity index 100% rename from common-primitives/common_primitives/extract_columns_structural_types.py rename to tods/common_primitives/extract_columns_structural_types.py diff --git a/common-primitives/common_primitives/fixed_split.py b/tods/common_primitives/fixed_split.py similarity index 100% rename from common-primitives/common_primitives/fixed_split.py rename to tods/common_primitives/fixed_split.py diff --git a/common-primitives/common_primitives/grouping_field_compose.py b/tods/common_primitives/grouping_field_compose.py similarity index 100% rename from common-primitives/common_primitives/grouping_field_compose.py rename to tods/common_primitives/grouping_field_compose.py diff --git a/common-primitives/common_primitives/holt_smoothing.py b/tods/common_primitives/holt_smoothing.py similarity index 100% rename from common-primitives/common_primitives/holt_smoothing.py rename to tods/common_primitives/holt_smoothing.py diff --git a/common-primitives/common_primitives/holt_winters_exponential_smoothing.py b/tods/common_primitives/holt_winters_exponential_smoothing.py similarity index 100% rename from common-primitives/common_primitives/holt_winters_exponential_smoothing.py rename to tods/common_primitives/holt_winters_exponential_smoothing.py diff --git a/common-primitives/common_primitives/horizontal_concat.py b/tods/common_primitives/horizontal_concat.py similarity index 100% rename from common-primitives/common_primitives/horizontal_concat.py rename to tods/common_primitives/horizontal_concat.py diff --git a/common-primitives/common_primitives/kfold_split.py b/tods/common_primitives/kfold_split.py similarity index 100% rename from common-primitives/common_primitives/kfold_split.py rename to tods/common_primitives/kfold_split.py diff --git a/common-primitives/common_primitives/kfold_split_timeseries.py b/tods/common_primitives/kfold_split_timeseries.py similarity index 100% rename from common-primitives/common_primitives/kfold_split_timeseries.py rename to tods/common_primitives/kfold_split_timeseries.py diff --git a/common-primitives/common_primitives/lgbm_classifier.py b/tods/common_primitives/lgbm_classifier.py similarity index 100% rename from common-primitives/common_primitives/lgbm_classifier.py rename to tods/common_primitives/lgbm_classifier.py diff --git a/common-primitives/common_primitives/list_to_dataframe.py b/tods/common_primitives/list_to_dataframe.py similarity index 100% rename from common-primitives/common_primitives/list_to_dataframe.py rename to tods/common_primitives/list_to_dataframe.py diff --git a/common-primitives/common_primitives/list_to_ndarray.py b/tods/common_primitives/list_to_ndarray.py similarity index 100% rename from common-primitives/common_primitives/list_to_ndarray.py rename to tods/common_primitives/list_to_ndarray.py diff --git a/common-primitives/common_primitives/mean_average_transform.py b/tods/common_primitives/mean_average_transform.py similarity index 100% rename from common-primitives/common_primitives/mean_average_transform.py rename to tods/common_primitives/mean_average_transform.py diff --git a/common-primitives/common_primitives/ndarray_to_dataframe.py b/tods/common_primitives/ndarray_to_dataframe.py similarity index 100% rename from common-primitives/common_primitives/ndarray_to_dataframe.py rename to tods/common_primitives/ndarray_to_dataframe.py diff --git a/common-primitives/common_primitives/ndarray_to_list.py b/tods/common_primitives/ndarray_to_list.py similarity index 100% rename from common-primitives/common_primitives/ndarray_to_list.py rename to tods/common_primitives/ndarray_to_list.py diff --git a/common-primitives/common_primitives/no_split.py b/tods/common_primitives/no_split.py similarity index 100% rename from common-primitives/common_primitives/no_split.py rename to tods/common_primitives/no_split.py diff --git a/common-primitives/common_primitives/normalize_column_references.py b/tods/common_primitives/normalize_column_references.py similarity index 100% rename from common-primitives/common_primitives/normalize_column_references.py rename to tods/common_primitives/normalize_column_references.py diff --git a/common-primitives/common_primitives/normalize_graphs.py b/tods/common_primitives/normalize_graphs.py similarity index 100% rename from common-primitives/common_primitives/normalize_graphs.py rename to tods/common_primitives/normalize_graphs.py diff --git a/common-primitives/common_primitives/numeric_range_filter.py b/tods/common_primitives/numeric_range_filter.py similarity index 100% rename from common-primitives/common_primitives/numeric_range_filter.py rename to tods/common_primitives/numeric_range_filter.py diff --git a/common-primitives/common_primitives/one_hot_maker.py b/tods/common_primitives/one_hot_maker.py similarity index 100% rename from common-primitives/common_primitives/one_hot_maker.py rename to tods/common_primitives/one_hot_maker.py diff --git a/common-primitives/common_primitives/pandas_onehot_encoder.py b/tods/common_primitives/pandas_onehot_encoder.py similarity index 100% rename from common-primitives/common_primitives/pandas_onehot_encoder.py rename to tods/common_primitives/pandas_onehot_encoder.py diff --git a/common-primitives/common_primitives/random_forest.py b/tods/common_primitives/random_forest.py similarity index 100% rename from common-primitives/common_primitives/random_forest.py rename to tods/common_primitives/random_forest.py diff --git a/common-primitives/common_primitives/ravel.py b/tods/common_primitives/ravel.py similarity index 100% rename from common-primitives/common_primitives/ravel.py rename to tods/common_primitives/ravel.py diff --git a/common-primitives/common_primitives/redact_columns.py b/tods/common_primitives/redact_columns.py similarity index 100% rename from common-primitives/common_primitives/redact_columns.py rename to tods/common_primitives/redact_columns.py diff --git a/common-primitives/common_primitives/regex_filter.py b/tods/common_primitives/regex_filter.py similarity index 100% rename from common-primitives/common_primitives/regex_filter.py rename to tods/common_primitives/regex_filter.py diff --git a/common-primitives/common_primitives/remove_columns.py b/tods/common_primitives/remove_columns.py similarity index 100% rename from common-primitives/common_primitives/remove_columns.py rename to tods/common_primitives/remove_columns.py diff --git a/common-primitives/common_primitives/remove_duplicate_columns.py b/tods/common_primitives/remove_duplicate_columns.py similarity index 100% rename from common-primitives/common_primitives/remove_duplicate_columns.py rename to tods/common_primitives/remove_duplicate_columns.py diff --git a/common-primitives/common_primitives/remove_semantic_types.py b/tods/common_primitives/remove_semantic_types.py similarity index 100% rename from common-primitives/common_primitives/remove_semantic_types.py rename to tods/common_primitives/remove_semantic_types.py diff --git a/common-primitives/common_primitives/rename_duplicate_columns.py b/tods/common_primitives/rename_duplicate_columns.py similarity index 100% rename from common-primitives/common_primitives/rename_duplicate_columns.py rename to tods/common_primitives/rename_duplicate_columns.py diff --git a/common-primitives/common_primitives/replace_semantic_types.py b/tods/common_primitives/replace_semantic_types.py similarity index 100% rename from common-primitives/common_primitives/replace_semantic_types.py rename to tods/common_primitives/replace_semantic_types.py diff --git a/common-primitives/common_primitives/simple_exponential_smoothing.py b/tods/common_primitives/simple_exponential_smoothing.py similarity index 100% rename from common-primitives/common_primitives/simple_exponential_smoothing.py rename to tods/common_primitives/simple_exponential_smoothing.py diff --git a/common-primitives/common_primitives/simple_profiler.py b/tods/common_primitives/simple_profiler.py similarity index 100% rename from common-primitives/common_primitives/simple_profiler.py rename to tods/common_primitives/simple_profiler.py diff --git a/common-primitives/common_primitives/slacker/README.md b/tods/common_primitives/slacker/README.md similarity index 100% rename from common-primitives/common_primitives/slacker/README.md rename to tods/common_primitives/slacker/README.md diff --git a/common-primitives/common_primitives/slacker/__init__.py b/tods/common_primitives/slacker/__init__.py similarity index 100% rename from common-primitives/common_primitives/slacker/__init__.py rename to tods/common_primitives/slacker/__init__.py diff --git a/common-primitives/common_primitives/slacker/base.py b/tods/common_primitives/slacker/base.py similarity index 100% rename from common-primitives/common_primitives/slacker/base.py rename to tods/common_primitives/slacker/base.py diff --git a/common-primitives/common_primitives/slacker/estimation.py b/tods/common_primitives/slacker/estimation.py similarity index 100% rename from common-primitives/common_primitives/slacker/estimation.py rename to tods/common_primitives/slacker/estimation.py diff --git a/common-primitives/common_primitives/slacker/feature_extraction.py b/tods/common_primitives/slacker/feature_extraction.py similarity index 100% rename from common-primitives/common_primitives/slacker/feature_extraction.py rename to tods/common_primitives/slacker/feature_extraction.py diff --git a/common-primitives/common_primitives/slacker/feature_selection.py b/tods/common_primitives/slacker/feature_selection.py similarity index 100% rename from common-primitives/common_primitives/slacker/feature_selection.py rename to tods/common_primitives/slacker/feature_selection.py diff --git a/common-primitives/common_primitives/stack_ndarray_column.py b/tods/common_primitives/stack_ndarray_column.py similarity index 100% rename from common-primitives/common_primitives/stack_ndarray_column.py rename to tods/common_primitives/stack_ndarray_column.py diff --git a/common-primitives/common_primitives/tabular_extractor.py b/tods/common_primitives/tabular_extractor.py similarity index 100% rename from common-primitives/common_primitives/tabular_extractor.py rename to tods/common_primitives/tabular_extractor.py diff --git a/common-primitives/common_primitives/term_filter.py b/tods/common_primitives/term_filter.py similarity index 100% rename from common-primitives/common_primitives/term_filter.py rename to tods/common_primitives/term_filter.py diff --git a/common-primitives/common_primitives/text_reader.py b/tods/common_primitives/text_reader.py similarity index 100% rename from common-primitives/common_primitives/text_reader.py rename to tods/common_primitives/text_reader.py diff --git a/common-primitives/common_primitives/train_score_split.py b/tods/common_primitives/train_score_split.py similarity index 100% rename from common-primitives/common_primitives/train_score_split.py rename to tods/common_primitives/train_score_split.py diff --git a/common-primitives/common_primitives/unseen_label_decoder.py b/tods/common_primitives/unseen_label_decoder.py similarity index 100% rename from common-primitives/common_primitives/unseen_label_decoder.py rename to tods/common_primitives/unseen_label_decoder.py diff --git a/common-primitives/common_primitives/unseen_label_encoder.py b/tods/common_primitives/unseen_label_encoder.py similarity index 100% rename from common-primitives/common_primitives/unseen_label_encoder.py rename to tods/common_primitives/unseen_label_encoder.py diff --git a/common-primitives/common_primitives/utils.py b/tods/common_primitives/utils.py similarity index 100% rename from common-primitives/common_primitives/utils.py rename to tods/common_primitives/utils.py diff --git a/common-primitives/common_primitives/video_reader.py b/tods/common_primitives/video_reader.py similarity index 100% rename from common-primitives/common_primitives/video_reader.py rename to tods/common_primitives/video_reader.py diff --git a/common-primitives/common_primitives/xgboost_dart.py b/tods/common_primitives/xgboost_dart.py similarity index 100% rename from common-primitives/common_primitives/xgboost_dart.py rename to tods/common_primitives/xgboost_dart.py diff --git a/common-primitives/common_primitives/xgboost_gbtree.py b/tods/common_primitives/xgboost_gbtree.py similarity index 100% rename from common-primitives/common_primitives/xgboost_gbtree.py rename to tods/common_primitives/xgboost_gbtree.py diff --git a/common-primitives/common_primitives/xgboost_regressor.py b/tods/common_primitives/xgboost_regressor.py similarity index 100% rename from common-primitives/common_primitives/xgboost_regressor.py rename to tods/common_primitives/xgboost_regressor.py diff --git a/data_processing/CategoricalToBinary.py b/tods/data_processing/CategoricalToBinary.py similarity index 100% rename from data_processing/CategoricalToBinary.py rename to tods/data_processing/CategoricalToBinary.py diff --git a/data_processing/ColumnFilter.py b/tods/data_processing/ColumnFilter.py similarity index 100% rename from data_processing/ColumnFilter.py rename to tods/data_processing/ColumnFilter.py diff --git a/data_processing/ContinuityValidation.py b/tods/data_processing/ContinuityValidation.py similarity index 100% rename from data_processing/ContinuityValidation.py rename to tods/data_processing/ContinuityValidation.py diff --git a/data_processing/DatasetToDataframe.py b/tods/data_processing/DatasetToDataframe.py similarity index 100% rename from data_processing/DatasetToDataframe.py rename to tods/data_processing/DatasetToDataframe.py diff --git a/data_processing/DuplicationValidation.py b/tods/data_processing/DuplicationValidation.py similarity index 100% rename from data_processing/DuplicationValidation.py rename to tods/data_processing/DuplicationValidation.py diff --git a/data_processing/TimeIntervalTransform.py b/tods/data_processing/TimeIntervalTransform.py similarity index 100% rename from data_processing/TimeIntervalTransform.py rename to tods/data_processing/TimeIntervalTransform.py diff --git a/data_processing/TimeStampValidation.py b/tods/data_processing/TimeStampValidation.py similarity index 100% rename from data_processing/TimeStampValidation.py rename to tods/data_processing/TimeStampValidation.py diff --git a/data_processing/__init__.py b/tods/data_processing/__init__.py similarity index 100% rename from data_processing/__init__.py rename to tods/data_processing/__init__.py diff --git a/detection_algorithm/AutoRegODetect.py b/tods/detection_algorithm/AutoRegODetect.py similarity index 100% rename from detection_algorithm/AutoRegODetect.py rename to tods/detection_algorithm/AutoRegODetect.py diff --git a/detection_algorithm/DeepLog.py b/tods/detection_algorithm/DeepLog.py similarity index 100% rename from detection_algorithm/DeepLog.py rename to tods/detection_algorithm/DeepLog.py diff --git a/detection_algorithm/KDiscordODetect.py b/tods/detection_algorithm/KDiscordODetect.py similarity index 100% rename from detection_algorithm/KDiscordODetect.py rename to tods/detection_algorithm/KDiscordODetect.py diff --git a/detection_algorithm/LSTMODetect.py b/tods/detection_algorithm/LSTMODetect.py similarity index 100% rename from detection_algorithm/LSTMODetect.py rename to tods/detection_algorithm/LSTMODetect.py diff --git a/detection_algorithm/MatrixProfile.py b/tods/detection_algorithm/MatrixProfile.py similarity index 100% rename from detection_algorithm/MatrixProfile.py rename to tods/detection_algorithm/MatrixProfile.py diff --git a/detection_algorithm/PCAODetect.py b/tods/detection_algorithm/PCAODetect.py similarity index 100% rename from detection_algorithm/PCAODetect.py rename to tods/detection_algorithm/PCAODetect.py diff --git a/detection_algorithm/PyodABOD.py b/tods/detection_algorithm/PyodABOD.py similarity index 100% rename from detection_algorithm/PyodABOD.py rename to tods/detection_algorithm/PyodABOD.py diff --git a/detection_algorithm/PyodAE.py b/tods/detection_algorithm/PyodAE.py similarity index 100% rename from detection_algorithm/PyodAE.py rename to tods/detection_algorithm/PyodAE.py diff --git a/detection_algorithm/PyodCBLOF.py b/tods/detection_algorithm/PyodCBLOF.py similarity index 100% rename from detection_algorithm/PyodCBLOF.py rename to tods/detection_algorithm/PyodCBLOF.py diff --git a/detection_algorithm/PyodCOF.py b/tods/detection_algorithm/PyodCOF.py similarity index 100% rename from detection_algorithm/PyodCOF.py rename to tods/detection_algorithm/PyodCOF.py diff --git a/detection_algorithm/PyodHBOS.py b/tods/detection_algorithm/PyodHBOS.py similarity index 100% rename from detection_algorithm/PyodHBOS.py rename to tods/detection_algorithm/PyodHBOS.py diff --git a/detection_algorithm/PyodIsolationForest.py b/tods/detection_algorithm/PyodIsolationForest.py similarity index 100% rename from detection_algorithm/PyodIsolationForest.py rename to tods/detection_algorithm/PyodIsolationForest.py diff --git a/detection_algorithm/PyodKNN.py b/tods/detection_algorithm/PyodKNN.py similarity index 100% rename from detection_algorithm/PyodKNN.py rename to tods/detection_algorithm/PyodKNN.py diff --git a/detection_algorithm/PyodLODA.py b/tods/detection_algorithm/PyodLODA.py similarity index 100% rename from detection_algorithm/PyodLODA.py rename to tods/detection_algorithm/PyodLODA.py diff --git a/detection_algorithm/PyodLOF.py b/tods/detection_algorithm/PyodLOF.py similarity index 100% rename from detection_algorithm/PyodLOF.py rename to tods/detection_algorithm/PyodLOF.py diff --git a/detection_algorithm/PyodMoGaal.py b/tods/detection_algorithm/PyodMoGaal.py similarity index 100% rename from detection_algorithm/PyodMoGaal.py rename to tods/detection_algorithm/PyodMoGaal.py diff --git a/detection_algorithm/PyodOCSVM.py b/tods/detection_algorithm/PyodOCSVM.py similarity index 100% rename from detection_algorithm/PyodOCSVM.py rename to tods/detection_algorithm/PyodOCSVM.py diff --git a/detection_algorithm/PyodSOD.py b/tods/detection_algorithm/PyodSOD.py similarity index 100% rename from detection_algorithm/PyodSOD.py rename to tods/detection_algorithm/PyodSOD.py diff --git a/detection_algorithm/PyodSoGaal.py b/tods/detection_algorithm/PyodSoGaal.py similarity index 100% rename from detection_algorithm/PyodSoGaal.py rename to tods/detection_algorithm/PyodSoGaal.py diff --git a/detection_algorithm/PyodVAE.py b/tods/detection_algorithm/PyodVAE.py similarity index 100% rename from detection_algorithm/PyodVAE.py rename to tods/detection_algorithm/PyodVAE.py diff --git a/detection_algorithm/Telemanom.py b/tods/detection_algorithm/Telemanom.py similarity index 100% rename from detection_algorithm/Telemanom.py rename to tods/detection_algorithm/Telemanom.py diff --git a/detection_algorithm/UODBasePrimitive.py b/tods/detection_algorithm/UODBasePrimitive.py similarity index 100% rename from detection_algorithm/UODBasePrimitive.py rename to tods/detection_algorithm/UODBasePrimitive.py diff --git a/detection_algorithm/core/AutoRegOD.py b/tods/detection_algorithm/core/AutoRegOD.py similarity index 100% rename from detection_algorithm/core/AutoRegOD.py rename to tods/detection_algorithm/core/AutoRegOD.py diff --git a/detection_algorithm/core/CollectiveBase.py b/tods/detection_algorithm/core/CollectiveBase.py similarity index 100% rename from detection_algorithm/core/CollectiveBase.py rename to tods/detection_algorithm/core/CollectiveBase.py diff --git a/detection_algorithm/core/CollectiveCommonTest.py b/tods/detection_algorithm/core/CollectiveCommonTest.py similarity index 100% rename from detection_algorithm/core/CollectiveCommonTest.py rename to tods/detection_algorithm/core/CollectiveCommonTest.py diff --git a/detection_algorithm/core/KDiscord.py b/tods/detection_algorithm/core/KDiscord.py similarity index 100% rename from detection_algorithm/core/KDiscord.py rename to tods/detection_algorithm/core/KDiscord.py diff --git a/detection_algorithm/core/LSTMOD.py b/tods/detection_algorithm/core/LSTMOD.py similarity index 100% rename from detection_algorithm/core/LSTMOD.py rename to tods/detection_algorithm/core/LSTMOD.py diff --git a/detection_algorithm/core/MultiAutoRegOD.py b/tods/detection_algorithm/core/MultiAutoRegOD.py similarity index 100% rename from detection_algorithm/core/MultiAutoRegOD.py rename to tods/detection_algorithm/core/MultiAutoRegOD.py diff --git a/detection_algorithm/core/PCA.py b/tods/detection_algorithm/core/PCA.py similarity index 100% rename from detection_algorithm/core/PCA.py rename to tods/detection_algorithm/core/PCA.py diff --git a/detection_algorithm/core/UODCommonTest.py b/tods/detection_algorithm/core/UODCommonTest.py similarity index 100% rename from detection_algorithm/core/UODCommonTest.py rename to tods/detection_algorithm/core/UODCommonTest.py diff --git a/detection_algorithm/core/algorithm_implementation.py b/tods/detection_algorithm/core/algorithm_implementation.py similarity index 100% rename from detection_algorithm/core/algorithm_implementation.py rename to tods/detection_algorithm/core/algorithm_implementation.py diff --git a/detection_algorithm/core/test_CollectiveBase.py b/tods/detection_algorithm/core/test_CollectiveBase.py similarity index 100% rename from detection_algorithm/core/test_CollectiveBase.py rename to tods/detection_algorithm/core/test_CollectiveBase.py diff --git a/detection_algorithm/core/utility.py b/tods/detection_algorithm/core/utility.py similarity index 100% rename from detection_algorithm/core/utility.py rename to tods/detection_algorithm/core/utility.py diff --git a/detection_algorithm/core/utils/channel.py b/tods/detection_algorithm/core/utils/channel.py similarity index 100% rename from detection_algorithm/core/utils/channel.py rename to tods/detection_algorithm/core/utils/channel.py diff --git a/detection_algorithm/core/utils/errors.py b/tods/detection_algorithm/core/utils/errors.py similarity index 100% rename from detection_algorithm/core/utils/errors.py rename to tods/detection_algorithm/core/utils/errors.py diff --git a/detection_algorithm/core/utils/modeling.py b/tods/detection_algorithm/core/utils/modeling.py similarity index 100% rename from detection_algorithm/core/utils/modeling.py rename to tods/detection_algorithm/core/utils/modeling.py diff --git a/detection_algorithm/core/utils/utils.py b/tods/detection_algorithm/core/utils/utils.py similarity index 100% rename from detection_algorithm/core/utils/utils.py rename to tods/detection_algorithm/core/utils/utils.py diff --git a/feature_analysis/AutoCorrelation.py b/tods/feature_analysis/AutoCorrelation.py similarity index 100% rename from feature_analysis/AutoCorrelation.py rename to tods/feature_analysis/AutoCorrelation.py diff --git a/feature_analysis/BKFilter.py b/tods/feature_analysis/BKFilter.py similarity index 100% rename from feature_analysis/BKFilter.py rename to tods/feature_analysis/BKFilter.py diff --git a/feature_analysis/DiscreteCosineTransform.py b/tods/feature_analysis/DiscreteCosineTransform.py similarity index 100% rename from feature_analysis/DiscreteCosineTransform.py rename to tods/feature_analysis/DiscreteCosineTransform.py diff --git a/feature_analysis/FastFourierTransform.py b/tods/feature_analysis/FastFourierTransform.py similarity index 100% rename from feature_analysis/FastFourierTransform.py rename to tods/feature_analysis/FastFourierTransform.py diff --git a/feature_analysis/HPFilter.py b/tods/feature_analysis/HPFilter.py similarity index 100% rename from feature_analysis/HPFilter.py rename to tods/feature_analysis/HPFilter.py diff --git a/feature_analysis/NonNegativeMatrixFactorization.py b/tods/feature_analysis/NonNegativeMatrixFactorization.py similarity index 100% rename from feature_analysis/NonNegativeMatrixFactorization.py rename to tods/feature_analysis/NonNegativeMatrixFactorization.py diff --git a/feature_analysis/SKTruncatedSVD.py b/tods/feature_analysis/SKTruncatedSVD.py similarity index 100% rename from feature_analysis/SKTruncatedSVD.py rename to tods/feature_analysis/SKTruncatedSVD.py diff --git a/feature_analysis/SpectralResidualTransform.py b/tods/feature_analysis/SpectralResidualTransform.py similarity index 100% rename from feature_analysis/SpectralResidualTransform.py rename to tods/feature_analysis/SpectralResidualTransform.py diff --git a/feature_analysis/StatisticalAbsEnergy.py b/tods/feature_analysis/StatisticalAbsEnergy.py similarity index 100% rename from feature_analysis/StatisticalAbsEnergy.py rename to tods/feature_analysis/StatisticalAbsEnergy.py diff --git a/feature_analysis/StatisticalAbsSum.py b/tods/feature_analysis/StatisticalAbsSum.py similarity index 100% rename from feature_analysis/StatisticalAbsSum.py rename to tods/feature_analysis/StatisticalAbsSum.py diff --git a/feature_analysis/StatisticalGmean.py b/tods/feature_analysis/StatisticalGmean.py similarity index 100% rename from feature_analysis/StatisticalGmean.py rename to tods/feature_analysis/StatisticalGmean.py diff --git a/feature_analysis/StatisticalHmean.py b/tods/feature_analysis/StatisticalHmean.py similarity index 100% rename from feature_analysis/StatisticalHmean.py rename to tods/feature_analysis/StatisticalHmean.py diff --git a/feature_analysis/StatisticalKurtosis.py b/tods/feature_analysis/StatisticalKurtosis.py similarity index 100% rename from feature_analysis/StatisticalKurtosis.py rename to tods/feature_analysis/StatisticalKurtosis.py diff --git a/feature_analysis/StatisticalMaximum.py b/tods/feature_analysis/StatisticalMaximum.py similarity index 100% rename from feature_analysis/StatisticalMaximum.py rename to tods/feature_analysis/StatisticalMaximum.py diff --git a/feature_analysis/StatisticalMean.py b/tods/feature_analysis/StatisticalMean.py similarity index 100% rename from feature_analysis/StatisticalMean.py rename to tods/feature_analysis/StatisticalMean.py diff --git a/feature_analysis/StatisticalMeanAbs.py b/tods/feature_analysis/StatisticalMeanAbs.py similarity index 100% rename from feature_analysis/StatisticalMeanAbs.py rename to tods/feature_analysis/StatisticalMeanAbs.py diff --git a/feature_analysis/StatisticalMeanAbsTemporalDerivative.py b/tods/feature_analysis/StatisticalMeanAbsTemporalDerivative.py similarity index 100% rename from feature_analysis/StatisticalMeanAbsTemporalDerivative.py rename to tods/feature_analysis/StatisticalMeanAbsTemporalDerivative.py diff --git a/feature_analysis/StatisticalMeanTemporalDerivative.py b/tods/feature_analysis/StatisticalMeanTemporalDerivative.py similarity index 100% rename from feature_analysis/StatisticalMeanTemporalDerivative.py rename to tods/feature_analysis/StatisticalMeanTemporalDerivative.py diff --git a/feature_analysis/StatisticalMedian.py b/tods/feature_analysis/StatisticalMedian.py similarity index 100% rename from feature_analysis/StatisticalMedian.py rename to tods/feature_analysis/StatisticalMedian.py diff --git a/feature_analysis/StatisticalMedianAbsoluteDeviation.py b/tods/feature_analysis/StatisticalMedianAbsoluteDeviation.py similarity index 100% rename from feature_analysis/StatisticalMedianAbsoluteDeviation.py rename to tods/feature_analysis/StatisticalMedianAbsoluteDeviation.py diff --git a/feature_analysis/StatisticalMinimum.py b/tods/feature_analysis/StatisticalMinimum.py similarity index 100% rename from feature_analysis/StatisticalMinimum.py rename to tods/feature_analysis/StatisticalMinimum.py diff --git a/feature_analysis/StatisticalSkew.py b/tods/feature_analysis/StatisticalSkew.py similarity index 100% rename from feature_analysis/StatisticalSkew.py rename to tods/feature_analysis/StatisticalSkew.py diff --git a/feature_analysis/StatisticalStd.py b/tods/feature_analysis/StatisticalStd.py similarity index 100% rename from feature_analysis/StatisticalStd.py rename to tods/feature_analysis/StatisticalStd.py diff --git a/feature_analysis/StatisticalVar.py b/tods/feature_analysis/StatisticalVar.py similarity index 100% rename from feature_analysis/StatisticalVar.py rename to tods/feature_analysis/StatisticalVar.py diff --git a/feature_analysis/StatisticalVariation.py b/tods/feature_analysis/StatisticalVariation.py similarity index 100% rename from feature_analysis/StatisticalVariation.py rename to tods/feature_analysis/StatisticalVariation.py diff --git a/feature_analysis/StatisticalVecSum.py b/tods/feature_analysis/StatisticalVecSum.py similarity index 100% rename from feature_analysis/StatisticalVecSum.py rename to tods/feature_analysis/StatisticalVecSum.py diff --git a/feature_analysis/StatisticalWillisonAmplitude.py b/tods/feature_analysis/StatisticalWillisonAmplitude.py similarity index 100% rename from feature_analysis/StatisticalWillisonAmplitude.py rename to tods/feature_analysis/StatisticalWillisonAmplitude.py diff --git a/feature_analysis/StatisticalZeroCrossing.py b/tods/feature_analysis/StatisticalZeroCrossing.py similarity index 100% rename from feature_analysis/StatisticalZeroCrossing.py rename to tods/feature_analysis/StatisticalZeroCrossing.py diff --git a/feature_analysis/TRMF.py b/tods/feature_analysis/TRMF.py similarity index 100% rename from feature_analysis/TRMF.py rename to tods/feature_analysis/TRMF.py diff --git a/feature_analysis/WaveletTransform.py b/tods/feature_analysis/WaveletTransform.py similarity index 100% rename from feature_analysis/WaveletTransform.py rename to tods/feature_analysis/WaveletTransform.py diff --git a/feature_analysis/__init__.py b/tods/feature_analysis/__init__.py similarity index 100% rename from feature_analysis/__init__.py rename to tods/feature_analysis/__init__.py diff --git a/reinforcement/RuleBasedFilter.py b/tods/reinforcement/RuleBasedFilter.py similarity index 100% rename from reinforcement/RuleBasedFilter.py rename to tods/reinforcement/RuleBasedFilter.py diff --git a/timeseries_processing/.HoltSmoothing.py.swo b/tods/timeseries_processing/.HoltSmoothing.py.swo similarity index 100% rename from timeseries_processing/.HoltSmoothing.py.swo rename to tods/timeseries_processing/.HoltSmoothing.py.swo diff --git a/timeseries_processing/HoltSmoothing.py b/tods/timeseries_processing/HoltSmoothing.py similarity index 100% rename from timeseries_processing/HoltSmoothing.py rename to tods/timeseries_processing/HoltSmoothing.py diff --git a/timeseries_processing/HoltWintersExponentialSmoothing.py b/tods/timeseries_processing/HoltWintersExponentialSmoothing.py similarity index 100% rename from timeseries_processing/HoltWintersExponentialSmoothing.py rename to tods/timeseries_processing/HoltWintersExponentialSmoothing.py diff --git a/timeseries_processing/MovingAverageTransform.py b/tods/timeseries_processing/MovingAverageTransform.py similarity index 100% rename from timeseries_processing/MovingAverageTransform.py rename to tods/timeseries_processing/MovingAverageTransform.py diff --git a/timeseries_processing/SKAxiswiseScaler.py b/tods/timeseries_processing/SKAxiswiseScaler.py similarity index 100% rename from timeseries_processing/SKAxiswiseScaler.py rename to tods/timeseries_processing/SKAxiswiseScaler.py diff --git a/timeseries_processing/SKPowerTransformer.py b/tods/timeseries_processing/SKPowerTransformer.py similarity index 100% rename from timeseries_processing/SKPowerTransformer.py rename to tods/timeseries_processing/SKPowerTransformer.py diff --git a/timeseries_processing/SKQuantileTransformer.py b/tods/timeseries_processing/SKQuantileTransformer.py similarity index 100% rename from timeseries_processing/SKQuantileTransformer.py rename to tods/timeseries_processing/SKQuantileTransformer.py diff --git a/timeseries_processing/SKStandardScaler.py b/tods/timeseries_processing/SKStandardScaler.py similarity index 100% rename from timeseries_processing/SKStandardScaler.py rename to tods/timeseries_processing/SKStandardScaler.py diff --git a/timeseries_processing/SimpleExponentialSmoothing.py b/tods/timeseries_processing/SimpleExponentialSmoothing.py similarity index 100% rename from timeseries_processing/SimpleExponentialSmoothing.py rename to tods/timeseries_processing/SimpleExponentialSmoothing.py diff --git a/timeseries_processing/TimeSeriesSeasonalityTrendDecomposition.py b/tods/timeseries_processing/TimeSeriesSeasonalityTrendDecomposition.py similarity index 100% rename from timeseries_processing/TimeSeriesSeasonalityTrendDecomposition.py rename to tods/timeseries_processing/TimeSeriesSeasonalityTrendDecomposition.py diff --git a/timeseries_processing/__init__.py b/tods/timeseries_processing/__init__.py similarity index 100% rename from timeseries_processing/__init__.py rename to tods/timeseries_processing/__init__.py