@@ -4,6 +4,8 @@ | |||
[](https://travis-ci.org/datamllab/tods) | |||
[中文文档](README.zh-CN.md) | |||
TODS is a full-stack automated machine learning system for outlier detection on multivariate time-series data. TODS provides exhaustive modules for building machine learning-based outlier detection systems, including: data processing, time series processing, feature analysis (extraction), detection algorithms, and reinforcement module. The functionalities provided via these modules include data preprocessing for general purposes, time series data smoothing/transformation, extracting features from time/frequency domains, various detection algorithms, and involving human expertise to calibrate the system. Three common outlier detection scenarios on time-series data can be performed: point-wise detection (time points as outliers), pattern-wise detection (subsequences as outliers), and system-wise detection (sets of time series as outliers), and a wide-range of corresponding algorithms are provided in TODS. This package is developed by [DATA Lab @ Texas A&M University](https://people.engr.tamu.edu/xiahu/index.html). | |||
TODS is featured for: | |||
@@ -16,18 +18,21 @@ TODS is featured for: | |||
## Resources | |||
* API Documentations: [http://tods-doc.github.io](http://tods-doc.github.io) | |||
* Paper: [https://arxiv.org/abs/2009.09822](https://arxiv.org/abs/2009.09822) | |||
* Related Project: [AutoVideo: An Automated Video Action Recognition System](https://github.com/datamllab/autovideo) | |||
## Cite this Work: | |||
If you find this work useful, you may cite this work: | |||
``` | |||
@misc{lai2020tods, | |||
title={TODS: An Automated Time Series Outlier Detection System}, | |||
author={Kwei-Harng Lai and Daochen Zha and Guanchu Wang and Junjie Xu and Yue Zhao and Devesh Kumar and Yile Chen and Purav Zumkhawaka and Minyang Wan and Diego Martinez and Xia Hu}, | |||
year={2020}, | |||
eprint={2009.09822}, | |||
archivePrefix={arXiv}, | |||
primaryClass={cs.DB} | |||
@article{Lai_Zha_Wang_Xu_Zhao_Kumar_Chen_Zumkhawaka_Wan_Martinez_Hu_2021, | |||
title={TODS: An Automated Time Series Outlier Detection System}, | |||
volume={35}, | |||
number={18}, | |||
journal={Proceedings of the AAAI Conference on Artificial Intelligence}, | |||
author={Lai, Kwei-Herng and Zha, Daochen and Wang, Guanchu and Xu, Junjie and Zhao, Yue and Kumar, Devesh and Chen, Yile and Zumkhawaka, Purav and Wan, Minyang and Martinez, Diego and Hu, Xia}, | |||
year={2021}, month={May}, | |||
pages={16060-16062} | |||
} | |||
``` | |||
## Installation | |||
@@ -37,7 +42,7 @@ This package works with **Python 3.6** and pip 19+. You need to have the followi | |||
sudo apt-get install libssl-dev libcurl4-openssl-dev libyaml-dev build-essential libopenblas-dev libcap-dev ffmpeg | |||
``` | |||
Clone the repository: | |||
Clone the repository (if you are in China and Github is slow, you can use the mirror in [Gitee](https://gitee.com/daochenzha/tods)): | |||
``` | |||
git clone https://github.com/datamllab/tods.git | |||
``` | |||
@@ -0,0 +1,113 @@ | |||
# TODS: Automated Time-series Outlier Detection System 自动化时间序列异常检测系统 | |||
<img width="500" src="./docs/img/tods_logo.png" alt="Logo" /> | |||
[](https://travis-ci.org/datamllab/tods) | |||
[English README](README.md) | |||
TODS是一个全栈的自动化机器学习系统,主要针对多变量时间序列数据的异常检测。TODS提供了详尽的用于构建基于机器学习的异常检测系统的模块,它们包括:数据处理(data processing),时间序列处理( time series processing),特征分析(feature analysis),检测算法(detection algorithms),和强化模块( reinforcement module)。这些模块所提供的功能包括常见的数据预处理、时间序列数据的平滑或变换,从时域或频域中抽取特征、多种多样的检测算法以及让人类专家来校准系统。该系统可以处理三种常见的时间序列异常检测场景:点的异常检测(异常是时间点)、模式的异常检测(异常是子序列)、系统的异常检测(异常是时间序列的集合)。TODS提供了一系列相应的算法。该包由 [DATA Lab @ Texas A&M University](https://people.engr.tamu.edu/xiahu/index.html) 开发。 | |||
TODS具有如下特点: | |||
* **全栈式机器学习系统**:支持从数据预处理、特征提取、到检测算法和人为规则每一个步骤并提供相应的接口。 | |||
* **广泛的算法支持**:包括[PyOD](https://github.com/yzhao062/pyod) 提供的点的异常检测算法、最先进的模式的异常检测算法(例如 [DeepLog](https://www.cs.utah.edu/~lifeifei/papers/deeplog.pdf), [Telemanon](https://arxiv.org/pdf/1802.04431.pdf) ),以及用于系统的异常检测的集合算法。 | |||
* **自动化的机器学习**:旨在提供无需专业知识的过程,通过自动搜索所有现有模块中的最佳组合,基于给定数据构造最优管道。 | |||
## 相关资源 | |||
* API文档: [http://tods-doc.github.io](http://tods-doc.github.io) | |||
* 论文: [https://arxiv.org/abs/2009.09822](https://arxiv.org/abs/2009.09822) | |||
* 相关项目:[AutoVideo: An Automated Video Action Recognition System](https://github.com/datamllab/autovideo) | |||
## 引用该工作: | |||
如何您觉得我们的工作有用,请引用该工作: | |||
``` | |||
@misc{lai2020tods, | |||
title={TODS: An Automated Time Series Outlier Detection System}, | |||
author={Kwei-Harng Lai and Daochen Zha and Guanchu Wang and Junjie Xu and Yue Zhao and Devesh Kumar and Yile Chen and Purav Zumkhawaka and Minyang Wan and Diego Martinez and Xia Hu}, | |||
year={2020}, | |||
eprint={2009.09822}, | |||
archivePrefix={arXiv}, | |||
primaryClass={cs.DB} | |||
} | |||
``` | |||
## 安装 | |||
这个包的运行环境是 **Python 3.6** 和pip 19+。对于Debian和Ubuntu的使用者,您需要在系统上安装如下的包: | |||
``` | |||
sudo apt-get install libssl-dev libcurl4-openssl-dev libyaml-dev build-essential libopenblas-dev libcap-dev ffmpeg | |||
``` | |||
克隆该仓库(如果您访问Github较慢,国内用户可以使用[Gitee镜像](https://gitee.com/daochenzha/tods)): | |||
``` | |||
git clone https://github.com/datamllab/tods.git | |||
``` | |||
用`pip`在本地安装: | |||
``` | |||
cd tods | |||
pip install -e . | |||
``` | |||
# 举例 | |||
例子在 [/examples](examples/) 中. 对于最基本的使用,你可以评估某个管道在某数据集上的表现。下面我们提供的例子演示了如何加载默认的管道,并评估它在yahoo数据集的子集上的表现。 | |||
```python | |||
import pandas as pd | |||
from tods import schemas as schemas_utils | |||
from tods import generate_dataset, evaluate_pipeline | |||
table_path = 'datasets/anomaly/raw_data/yahoo_sub_5.csv' | |||
target_index = 6 # what column is the target | |||
metric = 'F1_MACRO' # F1 on both label 0 and 1 | |||
# Read data and generate dataset | |||
df = pd.read_csv(table_path) | |||
dataset = generate_dataset(df, target_index) | |||
# Load the default pipeline | |||
pipeline = schemas_utils.load_default_pipeline() | |||
# Run the pipeline | |||
pipeline_result = evaluate_pipeline(dataset, pipeline, metric) | |||
print(pipeline_result) | |||
``` | |||
我们也提供AutoML的支持来自动帮您找到最适合您数据的管道。 | |||
```python | |||
import pandas as pd | |||
from axolotl.backend.simple import SimpleRunner | |||
from tods import generate_dataset, generate_problem | |||
from tods.searcher import BruteForceSearch | |||
# Some information | |||
table_path = 'datasets/yahoo_sub_5.csv' | |||
target_index = 6 # what column is the target | |||
time_limit = 30 # How many seconds you wanna search | |||
metric = 'F1_MACRO' # F1 on both label 0 and 1 | |||
# Read data and generate dataset and problem | |||
df = pd.read_csv(table_path) | |||
dataset = generate_dataset(df, target_index=target_index) | |||
problem_description = generate_problem(dataset, metric) | |||
# Start backend | |||
backend = SimpleRunner(random_seed=0) | |||
# Start search algorithm | |||
search = BruteForceSearch(problem_description=problem_description, | |||
backend=backend) | |||
# Find the best pipeline | |||
best_runtime, best_pipeline_result = search.search_fit(input_data=[dataset], time_limit=time_limit) | |||
best_pipeline = best_runtime.pipeline | |||
best_output = best_pipeline_result.output | |||
# Evaluate the best pipeline | |||
best_scores = search.evaluate(best_pipeline).scores | |||
``` | |||
# 致谢 | |||
我们诚挚地感谢DRAPA的Data Driven Discovery of Models (D3M)项目。 | |||
@@ -13,8 +13,8 @@ parser.add_argument('--table_path', type=str, default=default_data_path, | |||
help='Input the path of the input data table') | |||
parser.add_argument('--target_index', type=int, default=6, | |||
help='Index of the ground truth (for evaluation)') | |||
parser.add_argument('--metric',type=str, default='F1_MACRO', | |||
help='Evaluation Metric (F1, F1_MACRO)') | |||
parser.add_argument('--metric',type=str, default='ALL', | |||
help='Evaluation Metric (F1, F1_MACRO, RECALL, PRECISION, ALL)') | |||
parser.add_argument('--pipeline_path', | |||
default=os.path.join(this_path, './example_pipelines/autoencoder_pipeline.json'), | |||
help='Input the path of the pre-built pipeline description') | |||
@@ -35,6 +35,6 @@ pipeline = load_pipeline(pipeline_path) | |||
# Run the pipeline | |||
pipeline_result = evaluate_pipeline(dataset, pipeline, metric) | |||
print(pipeline_result) | |||
print(pipeline_result.scores) | |||
#raise pipeline_result.error[0] | |||
@@ -9,7 +9,7 @@ from tods.searcher import BruteForceSearch | |||
#table_path = 'datasets/NAB/realTweets/labeled_Twitter_volume_GOOG.csv' # The path of the dataset | |||
#target_index = 2 # what column is the target | |||
table_path = 'datasets/yahoo_sub_5.csv' | |||
table_path = '../../datasets/anomaly/raw_data/yahoo_sub_5.csv' | |||
target_index = 6 # what column is the target | |||
#table_path = 'datasets/NAB/realTweets/labeled_Twitter_volume_IBM.csv' # The path of the dataset | |||
time_limit = 30 # How many seconds you wanna search | |||
@@ -1,9 +1,8 @@ | |||
#!/bin/bash | |||
#modules="data_processing timeseries_processing feature_analysis detection_algorithms reinforcement" | |||
modules="data_processing timeseries_processing feature_analysis detection_algorithm reinforcement" | |||
#modules="data_processing timeseries_processing" | |||
modules="detection_algorithm" | |||
#test_scripts=$(ls primitive_tests | grep -v -f tested_file.txt) | |||
#modules="detection_algorithm" | |||
for module in $modules | |||
do | |||
@@ -1,9 +0,0 @@ | |||
# !/bin/bash | |||
files=$(ls primitive_tests) | |||
for f in $files | |||
do | |||
f_path="./primitive_tests/"$f | |||
save_path="./new_tests/"$f | |||
cat $f_path | sed 's/d3m.primitives.data_transformation.dataset_to_dataframe.Common/d3m.primitives.tods.data_processing.dataset_to_dataframe/g'| sed 's/d3m.primitives.data_transformation.column_parser.Common/d3m.primitives.tods.data_processing.column_parser/g' | sed 's/d3m.primitives.data_transformation.extract_columns_by_semantic_types.Common/d3m.primitives.tods.data_processing.extract_columns_by_semantic_types/g' | sed 's/d3m.primitives.data_transformation.construct_predictions.Common/d3m.primitives.tods.data_processing.construct_predictions/g' > $save_path | |||
done |
@@ -35,13 +35,14 @@ setup( | |||
] | |||
}, | |||
install_requires=[ | |||
'tamu_d3m', | |||
'tamu_axolotl', | |||
'Jinja2', | |||
#'tamu_d3m', | |||
#'tamu_axolotl', | |||
#'Jinja2', | |||
'numpy==1.18.2', | |||
'combo', | |||
'simplejson==3.12.0', | |||
'scikit-learn==0.22.0', | |||
#'scikit-learn==0.22.0', | |||
'scikit-learn', | |||
'statsmodels==0.11.1', | |||
'PyWavelets>=1.1.1', | |||
'pillow==7.1.2', | |||
@@ -7,10 +7,10 @@ import sklearn | |||
import numpy | |||
import typing | |||
import numpy as np | |||
from keras.models import Sequential | |||
from keras.layers import Dense, Dropout , LSTM | |||
from keras.regularizers import l2 | |||
from keras.losses import mean_squared_error | |||
from tensorflow.keras.models import Sequential | |||
from tensorflow.keras.layers import Dense, Dropout , LSTM | |||
from tensorflow.keras.regularizers import l2 | |||
from tensorflow.keras.losses import mean_squared_error | |||
from sklearn.preprocessing import StandardScaler | |||
from sklearn.utils import check_array | |||
from sklearn.utils.validation import check_is_fitted | |||
@@ -196,7 +196,7 @@ class LSTMODetectorPrimitive(UnsupervisedOutlierDetectorBase[Inputs, Outputs, Pa | |||
"python_path": "d3m.primitives.tods.detection_algorithm.LSTMODetector", | |||
"source": {'name': "DATALAB @Taxes A&M University", 'contact': 'mailto:khlai037@tamu.edu', | |||
'uris': ['https://gitlab.com/lhenry15/tods.git', 'https://gitlab.com/lhenry15/tods/-/blob/Junjie/anomaly-primitives/anomaly_primitives/LSTMOD.py']}, | |||
"algorithm_types": [metadata_base.PrimitiveAlgorithmType.ISOLATION_FOREST, ], # up to update | |||
"algorithm_types": [metadata_base.PrimitiveAlgorithmType.TODS_PRIMITIVE ], # up to update | |||
"primitive_family": metadata_base.PrimitiveFamily.ANOMALY_DETECTION, | |||
"version": "0.0.1", | |||
"hyperparams_to_tune": ['contamination', 'train_contamination', 'min_attack_time', | |||
@@ -160,7 +160,7 @@ class Hyperparams(Hyperparams_ODBase): | |||
contamination = hyperparams.Uniform( | |||
lower=0., | |||
upper=0.5, | |||
default=0.1, | |||
default=0.01, | |||
description='The amount of contamination of the data set, i.e. the proportion of outliers in the data set. ', | |||
semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'] | |||
) | |||
@@ -9,11 +9,11 @@ import typing | |||
import pandas as pd | |||
from keras.models import Sequential, load_model | |||
from keras.callbacks import History, EarlyStopping, Callback | |||
from keras.layers.recurrent import LSTM | |||
from keras.layers.core import Dense, Activation, Dropout | |||
from keras.layers import Flatten | |||
from tensorflow.keras.models import Sequential, load_model | |||
from tensorflow.keras.callbacks import History, EarlyStopping, Callback | |||
from tensorflow.keras.layers import LSTM | |||
from tensorflow.keras.layers import Dense, Activation, Dropout | |||
from tensorflow.keras.layers import Flatten | |||
from d3m import container, utils | |||
from d3m.base import utils as base_ut | |||
@@ -11,8 +11,8 @@ from .CollectiveBase import CollectiveBaseDetector | |||
# from tod.utility import get_sub_matrices | |||
from keras.layers import Dense, LSTM | |||
from keras.models import Sequential | |||
from tensorflow.keras.layers import Dense, LSTM | |||
from tensorflow.keras.models import Sequential | |||
class LSTMOutlierDetector(CollectiveBaseDetector): | |||
@@ -1,8 +1,8 @@ | |||
from keras.models import Sequential, load_model | |||
from keras.callbacks import History, EarlyStopping, Callback | |||
from keras.layers.recurrent import LSTM | |||
from keras.layers.core import Dense, Activation, Dropout | |||
from keras.layers import Flatten | |||
from tensorflow.keras.models import Sequential, load_model | |||
from tensorflow.keras.callbacks import History, EarlyStopping, Callback | |||
from tensorflow.keras.layers import LSTM | |||
from tensorflow.keras.layers import Dense, Activation, Dropout | |||
from tensorflow.keras.layers import Flatten | |||
import numpy as np | |||
import os | |||
import logging | |||
@@ -115,8 +115,8 @@ class RuleBasedFilter(transformer.TransformerPrimitiveBase[Inputs, Outputs, Hype | |||
"python_path": "d3m.primitives.tods.reinforcement.rule_filter", | |||
"source": {'name': 'DATA Lab at Texas A&M University', 'contact': 'mailto:khlai037@tamu.edu', | |||
'uris': ['https://gitlab.com/lhenry15/tods.git', ]}, | |||
"algorithm_types": [metadata_base.PrimitiveAlgorithmType.RULE_BASED_FILTER,], | |||
"primitive_family": metadata_base.PrimitiveFamily.REINFORCEMENT, | |||
"algorithm_types": [metadata_base.PrimitiveAlgorithmType.TODS_PRIMITIVE,], | |||
"primitive_family": metadata_base.PrimitiveFamily.ANOMALY_DETECTION, | |||
"id": "42744c37-8879-4785-9f18-6de9d612ea93", | |||
"hyperparams_to_tune": ['rule',], | |||
"version": "0.0.1", | |||
@@ -291,4 +291,4 @@ def _generate_pipelines(primitive_python_paths, cpu_count=40): # pragma: no cove | |||
#for p in results: | |||
# piplines.extend(p.get()) | |||
return piplines | |||
return piplines |
@@ -4,11 +4,11 @@ import sys | |||
import unittest | |||
runner = unittest.TextTestRunner(verbosity=1) | |||
tests = unittest.TestLoader().discover('./') | |||
if not runner.run(tests).wasSuccessful(): | |||
sys.exit(1) | |||
#tests = unittest.TestLoader().discover('./') | |||
#if not runner.run(tests).wasSuccessful(): | |||
# sys.exit(1) | |||
#for each in ['data_processing', 'timeseries_processing', 'feature_analysis', 'detection_algorithm']: | |||
# tests = unittest.TestLoader().discover(each) | |||
# if not runner.run(tests).wasSuccessful(): | |||
# sys.exit(1) | |||
for each in ['data_processing', 'timeseries_processing', 'feature_analysis', 'detection_algorithm']: | |||
tests = unittest.TestLoader().discover(each) | |||
if not runner.run(tests).wasSuccessful(): | |||
sys.exit(1) |