You can not select more than 25 topics Topics must start with a chinese character,a letter or number, can include dashes ('-') and can be up to 35 characters long.

load_csv.ipynb 16 kB

first commit Former-commit-id: 08bc23ba02cffbce3cf63962390a65459a132e48 [formerly 0795edd4834b9b7dc66db8d10d4cbaf42bbf82cb] [formerly b5010b42541add7e2ea2578bf2da537efc457757 [formerly a7ca09c2c34c4fc8b3d8e01fcfa08eeeb2cae99d]] [formerly 615058473a2177ca5b89e9edbb797f4c2a59c7e5 [formerly 743d8dfc6843c4c205051a8ab309fbb2116c895e] [formerly bb0ea98b1e14154ef464e2f7a16738705894e54b [formerly 960a69da74b81ef8093820e003f2d6c59a34974c]]] [formerly 2fa3be52c1b44665bc81a7cc7d4cea4bbf0d91d5 [formerly 2054589f0898627e0a17132fd9d4cc78efc91867] [formerly 3b53730e8a895e803dfdd6ca72bc05e17a4164c1 [formerly 8a2fa8ab7baf6686d21af1f322df46fd58c60e69]] [formerly 87d1e3a07a19d03c7d7c94d93ab4fa9f58dada7c [formerly f331916385a5afac1234854ee8d7f160f34b668f] [formerly 69fb3c78a483343f5071da4f7e2891b83a49dd18 [formerly 386086f05aa9487f65bce2ee54438acbdce57650]]]] Former-commit-id: a00aed8c934a6460c4d9ac902b9a74a3d6864697 [formerly 26fdeca29c2f07916d837883983ca2982056c78e] [formerly 0e3170d41a2f99ecf5c918183d361d4399d793bf [formerly 3c12ad4c88ac5192e0f5606ac0d88dd5bf8602dc]] [formerly d5894f84f2fd2e77a6913efdc5ae388cf1be0495 [formerly ad3e7bc670ff92c992730d29c9d3aa1598d844e8] [formerly 69fb3c78a483343f5071da4f7e2891b83a49dd18]] Former-commit-id: 3c19c9fae64f6106415fbc948a4dc613b9ee12f8 [formerly 467ddc0549c74bb007e8f01773bb6dc9103b417d] [formerly 5fa518345d958e2760e443b366883295de6d991c [formerly 3530e130b9fdb7280f638dbc2e785d2165ba82aa]] Former-commit-id: 9f5d473d42a435ec0d60149939d09be1acc25d92 [formerly be0b25c4ec2cde052a041baf0e11f774a158105d] Former-commit-id: 9eca71cb73ba9edccd70ac06a3b636b8d4093b04
4 years ago
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424
  1. {
  2. "cells": [
  3. {
  4. "cell_type": "markdown",
  5. "metadata": {},
  6. "source": [
  7. "# Axolotl CSV manipulation [Binary Classification]."
  8. ]
  9. },
  10. {
  11. "cell_type": "markdown",
  12. "metadata": {},
  13. "source": [
  14. "In this example, we are showcasing different components of the system.\n",
  15. "- Loading syntethic data for a univariate regression task.\n",
  16. "- Easy use of the backend.\n",
  17. "- Use of simple interface for search predefined method.\n",
  18. "- Exploring searched pipelines."
  19. ]
  20. },
  21. {
  22. "cell_type": "markdown",
  23. "metadata": {},
  24. "source": [
  25. "## Import multiple utils we will be using"
  26. ]
  27. },
  28. {
  29. "cell_type": "code",
  30. "execution_count": 1,
  31. "metadata": {},
  32. "outputs": [
  33. {
  34. "name": "stderr",
  35. "output_type": "stream",
  36. "text": [
  37. "2020-07-12 15:23:25,435\tINFO resource_spec.py:212 -- Starting Ray with 4.39 GiB memory available for workers and up to 2.2 GiB for objects. You can adjust these settings with ray.init(memory=<bytes>, object_store_memory=<bytes>).\n",
  38. "2020-07-12 15:23:25,965\tINFO services.py:1170 -- View the Ray dashboard at localhost:8265\n"
  39. ]
  40. }
  41. ],
  42. "source": [
  43. "import os\n",
  44. "from pprint import pprint\n",
  45. "import pandas as pd\n",
  46. "from sklearn.datasets import make_regression\n",
  47. "\n",
  48. "from d3m import container\n",
  49. "from d3m.metadata.pipeline import Pipeline\n",
  50. "\n",
  51. "from axolotl.utils import data_problem, pipeline as pipeline_utils\n",
  52. "from axolotl.backend.ray import RayRunner\n",
  53. "from axolotl.algorithms.random_search import RandomSearch\n",
  54. "\n",
  55. "# init runner\n",
  56. "backend = RayRunner(random_seed=42, volumes_dir=None, n_workers=3)"
  57. ]
  58. },
  59. {
  60. "cell_type": "markdown",
  61. "metadata": {},
  62. "source": [
  63. "### Load csv file and transform it as dataset"
  64. ]
  65. },
  66. {
  67. "cell_type": "code",
  68. "execution_count": 2,
  69. "metadata": {},
  70. "outputs": [],
  71. "source": [
  72. "table_path = os.path.join('..', 'tests', 'data', 'datasets', 'iris_dataset_1', 'tables', 'learningData.csv')\n",
  73. "df = pd.read_csv(table_path)\n",
  74. "dataset, problem_description = data_problem.generate_dataset_problem(df, task='binary_classification', target_index=5)"
  75. ]
  76. },
  77. {
  78. "cell_type": "markdown",
  79. "metadata": {},
  80. "source": [
  81. "### Create an instance of the search and fit with the input_data."
  82. ]
  83. },
  84. {
  85. "cell_type": "code",
  86. "execution_count": 3,
  87. "metadata": {},
  88. "outputs": [],
  89. "source": [
  90. "# The method fit search for the best pipeline based on the time butget and fit the best pipeline based on the rank with the input_data.\n",
  91. "search = RandomSearch(problem_description=problem_description, backend=backend)"
  92. ]
  93. },
  94. {
  95. "cell_type": "code",
  96. "execution_count": 4,
  97. "metadata": {},
  98. "outputs": [
  99. {
  100. "name": "stderr",
  101. "output_type": "stream",
  102. "text": [
  103. "Current trial is failed. Error: [StepFailedError('Step 7 for pipeline 47ec5c86-46b8-4dee-9562-1e5ebc3d0824 failed.',)]\n",
  104. "Current trial is failed. Error: [StepFailedError('Step 7 for pipeline 64da5190-c2ee-4b8e-abef-697b54cfa32b failed.',)]\n",
  105. "Current trial is failed. Error: [StepFailedError('Step 7 for pipeline 9e03188f-2120-49ac-a087-1e4fb1b29754 failed.',)]\n",
  106. "Current trial is failed. Error: [StepFailedError('Step 7 for pipeline af32bc20-64fa-44a5-ab34-bbe810b671b1 failed.',)]\n",
  107. "Current trial is failed. Error: [StepFailedError('Step 7 for pipeline 5dbc9e87-19be-4cda-ac51-c1d7ea9328c1 failed.',)]\n"
  108. ]
  109. },
  110. {
  111. "name": "stdout",
  112. "output_type": "stream",
  113. "text": [
  114. "(pid=85426) class_weight presets \"balanced\" or \"balanced_subsample\" are not recommended for warm_start if the fitted data differs from the full dataset. In order to use \"balanced\" weights, use compute_class_weight (\"balanced\", classes, y). In place of y you can use a large enough sample of the full training set target to properly estimate the class frequency distributions. Pass the resulting weights as the class_weight parameter.\n"
  115. ]
  116. },
  117. {
  118. "name": "stderr",
  119. "output_type": "stream",
  120. "text": [
  121. "Current trial is failed. Error: [StepFailedError('Step 7 for pipeline 918c088e-58dd-4991-8336-deb0b41cb5eb failed.',)]\n",
  122. "Current trial is failed. Error: [StepFailedError('Step 7 for pipeline 41dfec8f-0b07-4f8e-8ff3-cdbb1dab11c7 failed.',)]\n",
  123. "Current trial is failed. Error: [StepFailedError('Step 7 for pipeline d465a878-1ea5-4b72-b8a7-3a4122d1a482 failed.',)]\n",
  124. "Current trial is failed. Error: [StepFailedError('Step 7 for pipeline 8c39e981-f446-4fde-8744-5606c35a7fdf failed.',)]\n",
  125. "Current trial is failed. Error: [StepFailedError('Step 7 for pipeline df127bce-11af-4fae-b8bb-722cb0666484 failed.',)]\n"
  126. ]
  127. },
  128. {
  129. "name": "stdout",
  130. "output_type": "stream",
  131. "text": [
  132. "(pid=85426) class_weight presets \"balanced\" or \"balanced_subsample\" are not recommended for warm_start if the fitted data differs from the full dataset. In order to use \"balanced\" weights, use compute_class_weight (\"balanced\", classes, y). In place of y you can use a large enough sample of the full training set target to properly estimate the class frequency distributions. Pass the resulting weights as the class_weight parameter.\n",
  133. "(pid=85426) The parameter 'presort' is deprecated and has no effect. It will be removed in v0.24. You can suppress this warning by not passing any value to the 'presort' parameter. We also recommend using HistGradientBoosting models instead.\n"
  134. ]
  135. },
  136. {
  137. "name": "stderr",
  138. "output_type": "stream",
  139. "text": [
  140. "Current trial is failed. Error: [StepFailedError('Step 7 for pipeline 0985e11e-8db0-4c1c-9f34-3ce8fbc626c1 failed.',)]\n",
  141. "Current trial is failed. Error: [StepFailedError('Step 7 for pipeline 8977a9c0-dd79-4771-9dc1-455586b80947 failed.',)]\n",
  142. "Current trial is failed. Error: [StepFailedError('Step 7 for pipeline c0238551-5fbb-41cd-8187-d3d23bc5571d failed.',)]\n"
  143. ]
  144. },
  145. {
  146. "name": "stdout",
  147. "output_type": "stream",
  148. "text": [
  149. "(pid=85426) class_weight presets \"balanced\" or \"balanced_subsample\" are not recommended for warm_start if the fitted data differs from the full dataset. In order to use \"balanced\" weights, use compute_class_weight (\"balanced\", classes, y). In place of y you can use a large enough sample of the full training set target to properly estimate the class frequency distributions. Pass the resulting weights as the class_weight parameter.\n"
  150. ]
  151. }
  152. ],
  153. "source": [
  154. "fitted_pipeline, fitted_pipelineine_result = search.search_fit(input_data=[dataset], time_limit=30)"
  155. ]
  156. },
  157. {
  158. "cell_type": "code",
  159. "execution_count": 5,
  160. "metadata": {},
  161. "outputs": [],
  162. "source": [
  163. "produce_results = search.produce(fitted_pipeline, [dataset])"
  164. ]
  165. },
  166. {
  167. "cell_type": "code",
  168. "execution_count": 6,
  169. "metadata": {},
  170. "outputs": [
  171. {
  172. "data": {
  173. "text/html": [
  174. "<div>\n",
  175. "<style scoped>\n",
  176. " .dataframe tbody tr th:only-of-type {\n",
  177. " vertical-align: middle;\n",
  178. " }\n",
  179. "\n",
  180. " .dataframe tbody tr th {\n",
  181. " vertical-align: top;\n",
  182. " }\n",
  183. "\n",
  184. " .dataframe thead th {\n",
  185. " text-align: right;\n",
  186. " }\n",
  187. "</style>\n",
  188. "<table border=\"1\" class=\"dataframe\">\n",
  189. " <thead>\n",
  190. " <tr style=\"text-align: right;\">\n",
  191. " <th></th>\n",
  192. " <th>d3mIndex</th>\n",
  193. " <th>species</th>\n",
  194. " </tr>\n",
  195. " </thead>\n",
  196. " <tbody>\n",
  197. " <tr>\n",
  198. " <th>0</th>\n",
  199. " <td>0</td>\n",
  200. " <td>Iris-setosa</td>\n",
  201. " </tr>\n",
  202. " <tr>\n",
  203. " <th>1</th>\n",
  204. " <td>1</td>\n",
  205. " <td>Iris-setosa</td>\n",
  206. " </tr>\n",
  207. " <tr>\n",
  208. " <th>2</th>\n",
  209. " <td>2</td>\n",
  210. " <td>Iris-setosa</td>\n",
  211. " </tr>\n",
  212. " <tr>\n",
  213. " <th>3</th>\n",
  214. " <td>3</td>\n",
  215. " <td>Iris-setosa</td>\n",
  216. " </tr>\n",
  217. " <tr>\n",
  218. " <th>4</th>\n",
  219. " <td>4</td>\n",
  220. " <td>Iris-setosa</td>\n",
  221. " </tr>\n",
  222. " <tr>\n",
  223. " <th>...</th>\n",
  224. " <td>...</td>\n",
  225. " <td>...</td>\n",
  226. " </tr>\n",
  227. " <tr>\n",
  228. " <th>145</th>\n",
  229. " <td>145</td>\n",
  230. " <td>Iris-virginica</td>\n",
  231. " </tr>\n",
  232. " <tr>\n",
  233. " <th>146</th>\n",
  234. " <td>146</td>\n",
  235. " <td>Iris-virginica</td>\n",
  236. " </tr>\n",
  237. " <tr>\n",
  238. " <th>147</th>\n",
  239. " <td>147</td>\n",
  240. " <td>Iris-virginica</td>\n",
  241. " </tr>\n",
  242. " <tr>\n",
  243. " <th>148</th>\n",
  244. " <td>148</td>\n",
  245. " <td>Iris-virginica</td>\n",
  246. " </tr>\n",
  247. " <tr>\n",
  248. " <th>149</th>\n",
  249. " <td>149</td>\n",
  250. " <td>Iris-virginica</td>\n",
  251. " </tr>\n",
  252. " </tbody>\n",
  253. "</table>\n",
  254. "<p>150 rows × 2 columns</p>\n",
  255. "</div>"
  256. ],
  257. "text/plain": [
  258. " d3mIndex species\n",
  259. "0 0 Iris-setosa\n",
  260. "1 1 Iris-setosa\n",
  261. "2 2 Iris-setosa\n",
  262. "3 3 Iris-setosa\n",
  263. "4 4 Iris-setosa\n",
  264. ".. ... ...\n",
  265. "145 145 Iris-virginica\n",
  266. "146 146 Iris-virginica\n",
  267. "147 147 Iris-virginica\n",
  268. "148 148 Iris-virginica\n",
  269. "149 149 Iris-virginica\n",
  270. "\n",
  271. "[150 rows x 2 columns]"
  272. ]
  273. },
  274. "execution_count": 6,
  275. "metadata": {},
  276. "output_type": "execute_result"
  277. }
  278. ],
  279. "source": [
  280. "produce_results.output"
  281. ]
  282. },
  283. {
  284. "cell_type": "markdown",
  285. "metadata": {},
  286. "source": [
  287. "### Print information about scores of the succeded pipelines."
  288. ]
  289. },
  290. {
  291. "cell_type": "code",
  292. "execution_count": 7,
  293. "metadata": {},
  294. "outputs": [
  295. {
  296. "name": "stdout",
  297. "output_type": "stream",
  298. "text": [
  299. "----------------------------------------------------\n",
  300. "Pipeline id: 676360d8-71ac-401c-b44a-31a810c4e8d3\n",
  301. "Rank: 0.22667216466666668\n",
  302. " metric value normalized randomSeed fold\n",
  303. "0 ACCURACY 0.773333 0.773333 42 0\n",
  304. "----------------------------------------------------\n",
  305. "Pipeline id: 85d44359-0dac-4260-aea8-c78950025c3f\n",
  306. "Rank: 0.33333446433333336\n",
  307. " metric value normalized randomSeed fold\n",
  308. "0 ACCURACY 0.666667 0.666667 42 0\n",
  309. "----------------------------------------------------\n",
  310. "Pipeline id: 3efb07be-28ff-45d8-b1fb-1c49f96b3381\n",
  311. "Rank: 0.6666653826666668\n",
  312. " metric value normalized randomSeed fold\n",
  313. "0 ACCURACY 0.333333 0.333333 42 0\n",
  314. "----------------------------------------------------\n",
  315. "Pipeline id: abd9eb99-a4ba-4210-bb34-c2dec7c3ccfa\n",
  316. "Rank: 0.6666606186666667\n",
  317. " metric value normalized randomSeed fold\n",
  318. "0 ACCURACY 0.333333 0.333333 42 0\n",
  319. "----------------------------------------------------\n",
  320. "Pipeline id: 8948a194-0dfe-4d07-a7c8-d1f5136f68c6\n",
  321. "Rank: 0.21333939733333337\n",
  322. " metric value normalized randomSeed fold\n",
  323. "0 ACCURACY 0.786667 0.786667 42 0\n",
  324. "----------------------------------------------------\n",
  325. "Pipeline id: 22866f54-ba68-49e5-8f84-a2a6aba98253\n",
  326. "Rank: 0.16000235200000004\n",
  327. " metric value normalized randomSeed fold\n",
  328. "0 ACCURACY 0.84 0.84 42 0\n",
  329. "----------------------------------------------------\n",
  330. "Pipeline id: 37a1c72a-9efd-4b0a-9d3d-811d47571b45\n",
  331. "Rank: 0.6666753326666668\n",
  332. " metric value normalized randomSeed fold\n",
  333. "0 ACCURACY 0.333333 0.333333 42 0\n",
  334. "----------------------------------------------------\n",
  335. "Pipeline id: 2d3cae0f-66f6-46e0-9fa5-128bf02b4d7e\n",
  336. "Rank: 0.6666655736666668\n",
  337. " metric value normalized randomSeed fold\n",
  338. "0 ACCURACY 0.333333 0.333333 42 0\n",
  339. "----------------------------------------------------\n",
  340. "Pipeline id: d1e5a59d-be50-42f3-a71b-cf8ba59b3c47\n",
  341. "Rank: 0.08666869166666667\n",
  342. " metric value normalized randomSeed fold\n",
  343. "0 ACCURACY 0.913333 0.913333 42 0\n",
  344. "----------------------------------------------------\n",
  345. "Pipeline id: 35d47611-bded-4669-9803-9d259f686ec1\n",
  346. "Rank: 0.35999672099999996\n",
  347. " metric value normalized randomSeed fold\n",
  348. "0 ACCURACY 0.64 0.64 42 0\n",
  349. "----------------------------------------------------\n",
  350. "Pipeline id: 7398d17f-e91f-4c75-9a95-c9f85763c858\n",
  351. "Rank: 0.6666598006666667\n",
  352. " metric value normalized randomSeed fold\n",
  353. "0 ACCURACY 0.333333 0.333333 42 0\n",
  354. "----------------------------------------------------\n",
  355. "Pipeline id: 5293503b-4cb6-4b8b-bf8e-8b9d981c3b03\n",
  356. "Rank: 0.04666429966666663\n",
  357. " metric value normalized randomSeed fold\n",
  358. "0 ACCURACY 0.953333 0.953333 42 0\n",
  359. "----------------------------------------------------\n",
  360. "Pipeline id: 756e2a15-3315-4aa1-8620-f73ffc69f8a4\n",
  361. "Rank: 0.6666748276666667\n",
  362. " metric value normalized randomSeed fold\n",
  363. "0 ACCURACY 0.333333 0.333333 42 0\n",
  364. "----------------------------------------------------\n",
  365. "Pipeline id: 46633510-6f46-479e-982e-263aaa2e187a\n",
  366. "Rank: 0.17999182400000005\n",
  367. " metric value normalized randomSeed fold\n",
  368. "0 ACCURACY 0.82 0.82 42 0\n",
  369. "----------------------------------------------------\n",
  370. "Pipeline id: 49a750b0-5c86-4ff3-9b2d-c58c6390dd0d\n",
  371. "Rank: 0.6666588986666667\n",
  372. " metric value normalized randomSeed fold\n",
  373. "0 ACCURACY 0.333333 0.333333 42 0\n",
  374. "----------------------------------------------------\n",
  375. "Pipeline id: 84c24452-b2cf-41a2-813c-a135eaeef480\n",
  376. "Rank: 0.36000324699999997\n",
  377. " metric value normalized randomSeed fold\n",
  378. "0 ACCURACY 0.64 0.64 42 0\n",
  379. "----------------------------------------------------\n",
  380. "Pipeline id: 82117b6b-6960-48bb-b1f4-91355acf51d6\n",
  381. "Rank: 0.026667331666666617\n",
  382. " metric value normalized randomSeed fold\n",
  383. "0 ACCURACY 0.973333 0.973333 42 0\n"
  384. ]
  385. }
  386. ],
  387. "source": [
  388. "for pipeline_result in search.history:\n",
  389. " print('-' * 52)\n",
  390. " print('Pipeline id:', pipeline_result.pipeline.id)\n",
  391. " print('Rank:', pipeline_result.rank)\n",
  392. " print(pipeline_result.scores)"
  393. ]
  394. },
  395. {
  396. "cell_type": "code",
  397. "execution_count": null,
  398. "metadata": {},
  399. "outputs": [],
  400. "source": []
  401. }
  402. ],
  403. "metadata": {
  404. "kernelspec": {
  405. "display_name": "Python 3",
  406. "language": "python",
  407. "name": "python3"
  408. },
  409. "language_info": {
  410. "codemirror_mode": {
  411. "name": "ipython",
  412. "version": 3
  413. },
  414. "file_extension": ".py",
  415. "mimetype": "text/x-python",
  416. "name": "python",
  417. "nbconvert_exporter": "python",
  418. "pygments_lexer": "ipython3",
  419. "version": "3.6.5"
  420. }
  421. },
  422. "nbformat": 4,
  423. "nbformat_minor": 4
  424. }

全栈的自动化机器学习系统,主要针对多变量时间序列数据的异常检测。TODS提供了详尽的用于构建基于机器学习的异常检测系统的模块,它们包括:数据处理(data processing),时间序列处理( time series processing),特征分析(feature analysis),检测算法(detection algorithms),和强化模块( reinforcement module)。这些模块所提供的功能包括常见的数据预处理、时间序列数据的平滑或变换,从时域或频域中抽取特征、多种多样的检测算

Contributors (1)