@@ -1,89 +0,0 @@ | |||
{ | |||
"cells": [ | |||
{ | |||
"cell_type": "markdown", | |||
"metadata": {}, | |||
"source": [ | |||
"# Python & Machine Learning Exercises" | |||
] | |||
}, | |||
{ | |||
"cell_type": "markdown", | |||
"metadata": {}, | |||
"source": [ | |||
"## Python\n", | |||
"\n", | |||
"### (1)字符串\n", | |||
"给定一个文章,找出每个单词的出现次数\n", | |||
"\n", | |||
"```\n", | |||
"One is always on a strange road, watching strange scenery and listening to strange music. Then one day, you will find that the things you try hard to forget are already gone. \n", | |||
"```\n", | |||
"\n", | |||
"### (2)组合\n", | |||
"有 1、2、3、4 个数字,能组成多少个互不相同且无重复数字的三位数?都是多少?\n", | |||
"\n", | |||
"\n", | |||
"### (3) 判断\n", | |||
"企业发放的奖金根据利润提成。利润(I): \n", | |||
"* 低于或等于 10 万元时,奖金可提 10%; \n", | |||
"* 高于 10 万元,低于 20 万元时,低于 10 万元的部分按 10%提成,高于 10 万元的部分,可提成 7.5%; \n", | |||
"* 20 万到 40 万之间时,高于 20 万元的部分,可提成 5%; \n", | |||
"* 40 万到 60 万之间时,高于 40 万元的部分,可提成 3%; \n", | |||
"* 60 万到 100 万之间时,高于 60 万元的部分,可提成 1.5%, \n", | |||
"* 高于 100 万元时, 超过 100 万元的部分按 1%提成, \n", | |||
"从键盘输入当月利润 I,求应发放奖金总数?\n", | |||
"\n", | |||
"\n", | |||
"### (4)循环\n", | |||
"输出9x9的乘法口诀表\n", | |||
"\n", | |||
"\n", | |||
"### (5)使用while循环实现输出2-3+4-5+6.....+100的和\n", | |||
"\n", | |||
"\n", | |||
"### (6)算法\n", | |||
"给一个数字列表,将其按照由大到小的顺序排列\n", | |||
"\n", | |||
"例如\n", | |||
"```\n", | |||
"1, 10, 4, 2, 9, 2, 34, 5, 9, 8, 5, 0\n", | |||
"```\n", | |||
"\n", | |||
"### (7)应用1\n", | |||
"做为 Apple Store App 独立开发者,你要搞限时促销,为你的应用生成激活码(或者优惠券),使用 Python 如何生成 200 个激活码(或者优惠券)?\n", | |||
"\n", | |||
"需要考虑什么是激活码?有什么特性?例如`KR603guyVvR`是一个激活码\n", | |||
"\n", | |||
"### (8)应用2\n", | |||
"需要把某个目录下面所有的某种类型的文件找到。\n", | |||
"例如把`c:`下面所有的`.dll`文件找到\n", | |||
"\n", | |||
"### (9)应用3\n", | |||
"你有个目录,里面是程序(假如是C或者是Python),统计一下你写过多少行代码。包括空行和注释,但是要分别列出来。\n", | |||
"\n" | |||
] | |||
} | |||
], | |||
"metadata": { | |||
"kernelspec": { | |||
"display_name": "Python 3", | |||
"language": "python", | |||
"name": "python3" | |||
}, | |||
"language_info": { | |||
"codemirror_mode": { | |||
"name": "ipython", | |||
"version": 3 | |||
}, | |||
"file_extension": ".py", | |||
"mimetype": "text/x-python", | |||
"name": "python", | |||
"nbconvert_exporter": "python", | |||
"pygments_lexer": "ipython3", | |||
"version": "3.5.2" | |||
} | |||
}, | |||
"nbformat": 4, | |||
"nbformat_minor": 2 | |||
} |
@@ -1,74 +0,0 @@ | |||
# -*- coding: utf-8 -*- | |||
# --- | |||
# jupyter: | |||
# jupytext_format_version: '1.2' | |||
# kernelspec: | |||
# display_name: Python 3 | |||
# language: python | |||
# name: python3 | |||
# language_info: | |||
# codemirror_mode: | |||
# name: ipython | |||
# version: 3 | |||
# file_extension: .py | |||
# mimetype: text/x-python | |||
# name: python | |||
# nbconvert_exporter: python | |||
# pygments_lexer: ipython3 | |||
# version: 3.5.2 | |||
# --- | |||
# # Python & Machine Learning Exercises | |||
# ## Python | |||
# | |||
# ### (1)字符串 | |||
# 给定一个文章,找出每个单词的出现次数 | |||
# | |||
# ``` | |||
# One is always on a strange road, watching strange scenery and listening to strange music. Then one day, you will find that the things you try hard to forget are already gone. | |||
# ``` | |||
# | |||
# ### (2)组合 | |||
# 有 1、2、3、4 个数字,能组成多少个互不相同且无重复数字的三位数?都是多少? | |||
# | |||
# | |||
# ### (3) 判断 | |||
# 企业发放的奖金根据利润提成。利润(I): | |||
# * 低于或等于 10 万元时,奖金可提 10%; | |||
# * 高于 10 万元,低于 20 万元时,低于 10 万元的部分按 10%提成,高于 10 万元的部分,可提成 7.5%; | |||
# * 20 万到 40 万之间时,高于 20 万元的部分,可提成 5%; | |||
# * 40 万到 60 万之间时,高于 40 万元的部分,可提成 3%; | |||
# * 60 万到 100 万之间时,高于 60 万元的部分,可提成 1.5%, | |||
# * 高于 100 万元时, 超过 100 万元的部分按 1%提成, | |||
# 从键盘输入当月利润 I,求应发放奖金总数? | |||
# | |||
# | |||
# ### (4)循环 | |||
# 输出9x9的乘法口诀表 | |||
# | |||
# | |||
# ### (5)使用while循环实现输出2-3+4-5+6.....+100的和 | |||
# | |||
# | |||
# ### (6)算法 | |||
# 给一个数字列表,将其按照由大到小的顺序排列 | |||
# | |||
# 例如 | |||
# ``` | |||
# 1, 10, 4, 2, 9, 2, 34, 5, 9, 8, 5, 0 | |||
# ``` | |||
# | |||
# ### (7)应用1 | |||
# 做为 Apple Store App 独立开发者,你要搞限时促销,为你的应用生成激活码(或者优惠券),使用 Python 如何生成 200 个激活码(或者优惠券)? | |||
# | |||
# 需要考虑什么是激活码?有什么特性?例如`KR603guyVvR`是一个激活码 | |||
# | |||
# ### (8)应用2 | |||
# 需要把某个目录下面所有的某种类型的文件找到。 | |||
# 例如把`c:`下面所有的`.dll`文件找到 | |||
# | |||
# ### (9)应用3 | |||
# 你有个目录,里面是程序(假如是C或者是Python),统计一下你写过多少行代码。包括空行和注释,但是要分别列出来。 | |||
# | |||
# |
@@ -1,82 +0,0 @@ | |||
{ | |||
"cells": [ | |||
{ | |||
"cell_type": "markdown", | |||
"metadata": {}, | |||
"source": [ | |||
"## 数值计算\n", | |||
"\n", | |||
"\n", | |||
"### (1)对于一个存在在数组,如何添加一个用0填充的边界?\n", | |||
"例如对一个二维矩阵\n", | |||
"```\n", | |||
"10, 34, 54, 23\n", | |||
"31, 87, 53, 68\n", | |||
"98, 49, 25, 11\n", | |||
"84, 32, 67, 88\n", | |||
"```\n", | |||
"\n", | |||
"变换成\n", | |||
"```\n", | |||
" 0, 0, 0, 0, 0, 0\n", | |||
" 0, 10, 34, 54, 23, 0\n", | |||
" 0, 31, 87, 53, 68, 0\n", | |||
" 0, 98, 49, 25, 11, 0\n", | |||
" 0, 84, 32, 67, 88, 0\n", | |||
" 0, 0, 0, 0, 0, 0\n", | |||
"```\n", | |||
"\n", | |||
"### (2) 创建一个 5x5的矩阵,并设置值1,2,3,4落在其对角线下方位置\n", | |||
"\n", | |||
"\n", | |||
"### (3) 创建一个8x8 的矩阵,并且设置成国际象棋棋盘样式(黑可以用0, 白可以用1)\n", | |||
"\n", | |||
"\n", | |||
"### (4)求解线性方程组\n", | |||
"\n", | |||
"给定一个方程组,如何求出其的方程解。有多种方法,分析各种方法的优缺点(最简单的方式是消元方)。\n", | |||
"\n", | |||
"例如\n", | |||
"```\n", | |||
"3x + 4y + 2z = 10\n", | |||
"5x + 3y + 4z = 14\n", | |||
"8x + 2y + 7z = 20\n", | |||
"```\n", | |||
"\n", | |||
"编程写出求解的程序\n", | |||
"\n", | |||
"\n", | |||
"### (5) 翻转一个数组(第一个元素变成最后一个)\n", | |||
"\n", | |||
"\n", | |||
"### (6) 产生一个十乘十随机数组,并且找出最大和最小值\n", | |||
"\n", | |||
"\n", | |||
"## Reference\n", | |||
"* [100 numpy exercises](https://github.com/rougier/numpy-100)" | |||
] | |||
} | |||
], | |||
"metadata": { | |||
"kernelspec": { | |||
"display_name": "Python 3", | |||
"language": "python", | |||
"name": "python3" | |||
}, | |||
"language_info": { | |||
"codemirror_mode": { | |||
"name": "ipython", | |||
"version": 3 | |||
}, | |||
"file_extension": ".py", | |||
"mimetype": "text/x-python", | |||
"name": "python", | |||
"nbconvert_exporter": "python", | |||
"pygments_lexer": "ipython3", | |||
"version": "3.5.2" | |||
}, | |||
"main_language": "python" | |||
}, | |||
"nbformat": 4, | |||
"nbformat_minor": 2 | |||
} |
@@ -1,70 +0,0 @@ | |||
# -*- coding: utf-8 -*- | |||
# --- | |||
# jupyter: | |||
# jupytext_format_version: '1.2' | |||
# kernelspec: | |||
# display_name: Python 3 | |||
# language: python | |||
# name: python3 | |||
# language_info: | |||
# codemirror_mode: | |||
# name: ipython | |||
# version: 3 | |||
# file_extension: .py | |||
# mimetype: text/x-python | |||
# name: python | |||
# nbconvert_exporter: python | |||
# pygments_lexer: ipython3 | |||
# version: 3.5.2 | |||
# --- | |||
# ## 数值计算 | |||
# | |||
# | |||
# ### (1)对于一个存在在数组,如何添加一个用0填充的边界? | |||
# 例如对一个二维矩阵 | |||
# ``` | |||
# 10, 34, 54, 23 | |||
# 31, 87, 53, 68 | |||
# 98, 49, 25, 11 | |||
# 84, 32, 67, 88 | |||
# ``` | |||
# | |||
# 变换成 | |||
# ``` | |||
# 0, 0, 0, 0, 0, 0 | |||
# 0, 10, 34, 54, 23, 0 | |||
# 0, 31, 87, 53, 68, 0 | |||
# 0, 98, 49, 25, 11, 0 | |||
# 0, 84, 32, 67, 88, 0 | |||
# 0, 0, 0, 0, 0, 0 | |||
# ``` | |||
# | |||
# ### (2) 创建一个 5x5的矩阵,并设置值1,2,3,4落在其对角线下方位置 | |||
# | |||
# | |||
# ### (3) 创建一个8x8 的矩阵,并且设置成国际象棋棋盘样式(黑可以用0, 白可以用1) | |||
# | |||
# | |||
# ### (4)求解线性方程组 | |||
# | |||
# 给定一个方程组,如何求出其的方程解。有多种方法,分析各种方法的优缺点(最简单的方式是消元方)。 | |||
# | |||
# 例如 | |||
# ``` | |||
# 3x + 4y + 2z = 10 | |||
# 5x + 3y + 4z = 14 | |||
# 8x + 2y + 7z = 20 | |||
# ``` | |||
# | |||
# 编程写出求解的程序 | |||
# | |||
# | |||
# ### (5) 翻转一个数组(第一个元素变成最后一个) | |||
# | |||
# | |||
# ### (6) 产生一个十乘十随机数组,并且找出最大和最小值 | |||
# | |||
# | |||
# ## Reference | |||
# * [100 numpy exercises](https://github.com/rougier/numpy-100) |
@@ -1,37 +0,0 @@ | |||
{ | |||
"cells": [ | |||
{ | |||
"cell_type": "markdown", | |||
"metadata": {}, | |||
"source": [ | |||
"# Matplotlib\n", | |||
"\n", | |||
"\n", | |||
"## (1) 画出一个二次函数,同时画出梯形法求积分时的各个梯形\n", | |||
"\n" | |||
] | |||
} | |||
], | |||
"metadata": { | |||
"kernelspec": { | |||
"display_name": "Python 3", | |||
"language": "python", | |||
"name": "python3" | |||
}, | |||
"language_info": { | |||
"codemirror_mode": { | |||
"name": "ipython", | |||
"version": 3 | |||
}, | |||
"file_extension": ".py", | |||
"mimetype": "text/x-python", | |||
"name": "python", | |||
"nbconvert_exporter": "python", | |||
"pygments_lexer": "ipython3", | |||
"version": "3.5.2" | |||
}, | |||
"main_language": "python" | |||
}, | |||
"nbformat": 4, | |||
"nbformat_minor": 2 | |||
} |
@@ -1,26 +0,0 @@ | |||
# -*- coding: utf-8 -*- | |||
# --- | |||
# jupyter: | |||
# jupytext_format_version: '1.2' | |||
# kernelspec: | |||
# display_name: Python 3 | |||
# language: python | |||
# name: python3 | |||
# language_info: | |||
# codemirror_mode: | |||
# name: ipython | |||
# version: 3 | |||
# file_extension: .py | |||
# mimetype: text/x-python | |||
# name: python | |||
# nbconvert_exporter: python | |||
# pygments_lexer: ipython3 | |||
# version: 3.5.2 | |||
# --- | |||
# # Matplotlib | |||
# | |||
# | |||
# ## (1) 画出一个二次函数,同时画出梯形法求积分时的各个梯形 | |||
# | |||
# |
@@ -1,169 +0,0 @@ | |||
# -*- coding: utf-8 -*- | |||
# --- | |||
# jupyter: | |||
# jupytext_format_version: '1.2' | |||
# kernelspec: | |||
# display_name: Python 3 | |||
# language: python | |||
# name: python3 | |||
# language_info: | |||
# codemirror_mode: | |||
# name: ipython | |||
# version: 3 | |||
# file_extension: .py | |||
# mimetype: text/x-python | |||
# name: python | |||
# nbconvert_exporter: python | |||
# pygments_lexer: ipython3 | |||
# version: 3.5.2 | |||
# --- | |||
# # Exercise - 交通事故理赔审核预测 | |||
# | |||
# | |||
# 这个比赛的链接:http://sofasofa.io/competition.php?id=2 | |||
# | |||
# | |||
# * 任务类型:二元分类 | |||
# | |||
# * 背景介绍:在交通摩擦(事故)发生后,理赔员会前往现场勘察、采集信息,这些信息往往影响着车主是否能够得到保险公司的理赔。训练集数据包括理赔人员在现场对该事故方采集的36条信息,信息已经被编码,以及该事故方最终是否获得理赔。我们的任务是根据这36条信息预测该事故方没有被理赔的概率。 | |||
# | |||
# * 数据介绍:训练集中共有200000条样本,预测集中有80000条样本。 | |||
#  | |||
# | |||
# * 评价方法:Precision-Recall AUC | |||
# | |||
# ## Demo code | |||
# | |||
import pandas as pd | |||
import numpy as np | |||
import os | |||
import matplotlib.pyplot as plt | |||
# %matplotlib inline | |||
# read data | |||
homePath = "data" | |||
trainPath = os.path.join(homePath, "train.csv") | |||
testPath = os.path.join(homePath, "test.csv") | |||
submitPath = os.path.join(homePath, "sample_submit.csv") | |||
trainData = pd.read_csv(trainPath) | |||
testData = pd.read_csv(testPath) | |||
submitData = pd.read_csv(submitPath) | |||
# 参照数据说明,CaseID这列是没有意义的编号,因此这里将他丢弃。 | |||
# | |||
# ~drop()函数:axis指沿着哪个轴,0为行,1为列;inplace指是否在原数据上直接操作 | |||
# | |||
# 去掉没有意义的一列 | |||
trainData.drop("CaseId", axis=1, inplace=True) | |||
testData.drop("CaseId", axis=1, inplace=True) | |||
# # 快速了解数据 | |||
# | |||
# ~head():默认显示前5行数据,可指定显示多行,例如.head(15)显示前15行 | |||
# | |||
trainData.head(15) | |||
# 显示数据简略信息,可以每列有多少非空的值,以及每列数据对应的数据类型。 | |||
# | |||
# | |||
trainData.info() | |||
# ~hist():绘制直方图,参数figsize可指定输出图片的尺寸。 | |||
# | |||
trainData.hist(figsize=(20, 20)) | |||
# 想要了解特征之间的相关性,可计算相关系数矩阵。然后可对某个特征来排序。 | |||
# | |||
# | |||
corr_matrix = trainData.corr() | |||
corr_matrix["Evaluation"].sort_values(ascending=False) # ascending=False 降序排列 | |||
# 从训练集中分离标签 | |||
y = trainData['Evaluation'] | |||
trainData.drop("Evaluation", axis=1, inplace=True) | |||
# 使用K-Means训练模型 | |||
# | |||
# KMeans(): | |||
# * `n_clusters`指要预测的有几个类; | |||
# * `init`指初始化中心的方法,默认使用的是`k-means++`方法,而非经典的K-means方法的随机采样初始化,当然你可以设置为random使用随机初始化; | |||
# * `n_jobs`指定使用CPU核心数,-1为使用全部CPU。 | |||
# + | |||
# do k-means | |||
from sklearn.cluster import KMeans | |||
est = KMeans(n_clusters=2, init="k-means++", n_jobs=-1) | |||
est.fit(trainData, y) | |||
y_train = est.predict(trainData) | |||
y_pred = est.predict(testData) | |||
# 保存预测的结果 | |||
submitData['Evaluation'] = y_pred | |||
submitData.to_csv("submit_data.csv", index=False) | |||
# + | |||
# calculate accuracy | |||
from sklearn.metrics import accuracy_score | |||
acc_train = accuracy_score(y, y_train) | |||
print("acc_train = %f" % (acc_train)) | |||
# - | |||
# ## 随机森林 | |||
# | |||
# 使用K-means可能得到的结果没那么理想。在官网上,举办方给出了两个标杆模型,效果最好的是随机森林。以下是代码,读者可以自己测试。 | |||
# | |||
# | |||
# + | |||
import pandas as pd | |||
from sklearn.ensemble import RandomForestClassifier | |||
from sklearn.metrics import accuracy_score | |||
# 读取数据 | |||
train = pd.read_csv("data/train.csv") | |||
test = pd.read_csv("data/test.csv") | |||
submit = pd.read_csv("data/sample_submit.csv") | |||
# 删除id | |||
train.drop('CaseId', axis=1, inplace=True) | |||
test.drop('CaseId', axis=1, inplace=True) | |||
# 取出训练集的y | |||
y_train = train.pop('Evaluation') | |||
# 建立随机森林模型 | |||
clf = RandomForestClassifier(n_estimators=100, random_state=0) | |||
clf.fit(train, y_train) | |||
y_pred = clf.predict_proba(test)[:, 1] | |||
# 输出预测结果至my_RF_prediction.csv | |||
submit['Evaluation'] = y_pred | |||
submit.to_csv('my_RF_prediction.csv', index=False) | |||
# + | |||
# freature importances | |||
print(clf.feature_importances_) | |||
# Train accuracy | |||
from sklearn.metrics import accuracy_score | |||
y_train_pred = clf.predict(train) | |||
print(y_train_pred) | |||
acc_train = accuracy_score(y_train, y_train_pred) | |||
print("acc_train = %f" % (acc_train)) |
@@ -1,71 +0,0 @@ | |||
{ | |||
"cells": [ | |||
{ | |||
"cell_type": "markdown", | |||
"metadata": {}, | |||
"source": [ | |||
"# Titanic\n", | |||
"\n", | |||
"## Competition Description\n", | |||
"The sinking of the RMS Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.\n", | |||
"\n", | |||
"One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.\n", | |||
"\n", | |||
"In this challenge, we ask you to complete the analysis of what sorts of people were likely to survive. In particular, we ask you to apply the tools of machine learning to predict which passengers survived the tragedy.\n", | |||
"\n", | |||
"## Practice Skills\n", | |||
"* Binary classification\n", | |||
"* Python & SKLearn\n", | |||
"\n", | |||
"## Data\n", | |||
"The data has been split into two groups:\n", | |||
"\n", | |||
"* training set (train.csv)\n", | |||
"* test set (test.csv)\n", | |||
"\n", | |||
"The training set should be used to build your machine learning models. For the training set, we provide the outcome (also known as the `ground truth`) for each passenger. Your model will be based on `features` like passengers' gender and class. You can also use feature engineering to create new features.\n", | |||
"\n", | |||
"The test set should be used to see how well your model performs on unseen data. For the test set, we do not provide the ground truth for each passenger. It is your job to predict these outcomes. For each passenger in the test set, use the model you trained to predict whether or not they survived the sinking of the Titanic.\n", | |||
"\n", | |||
"We also include `gender_submission.csv`, a set of predictions that assume all and only female passengers survive, as an example of what a submission file should look like.\n", | |||
"\n", | |||
"### Data description\n", | |||
"\n", | |||
"\n", | |||
"\n", | |||
"\n", | |||
"### Variable Notes\n", | |||
"pclass: A proxy for socio-economic status (SES)\n", | |||
"* 1st = Upper\n", | |||
"* 2nd = Middle\n", | |||
"* 3rd = Lower\n", | |||
"\n", | |||
"\n", | |||
"## Links\n", | |||
"* [Titanic: Machine Learning from Disaster](https://www.kaggle.com/c/titanic)" | |||
] | |||
} | |||
], | |||
"metadata": { | |||
"kernelspec": { | |||
"display_name": "Python 3", | |||
"language": "python", | |||
"name": "python3" | |||
}, | |||
"language_info": { | |||
"codemirror_mode": { | |||
"name": "ipython", | |||
"version": 3 | |||
}, | |||
"file_extension": ".py", | |||
"mimetype": "text/x-python", | |||
"name": "python", | |||
"nbconvert_exporter": "python", | |||
"pygments_lexer": "ipython3", | |||
"version": "3.5.2" | |||
}, | |||
"main_language": "python" | |||
}, | |||
"nbformat": 4, | |||
"nbformat_minor": 2 | |||
} |
@@ -1,58 +0,0 @@ | |||
# --- | |||
# jupyter: | |||
# jupytext_format_version: '1.2' | |||
# kernelspec: | |||
# display_name: Python 3 | |||
# language: python | |||
# name: python3 | |||
# language_info: | |||
# codemirror_mode: | |||
# name: ipython | |||
# version: 3 | |||
# file_extension: .py | |||
# mimetype: text/x-python | |||
# name: python | |||
# nbconvert_exporter: python | |||
# pygments_lexer: ipython3 | |||
# version: 3.5.2 | |||
# --- | |||
# # Titanic | |||
# | |||
# ## Competition Description | |||
# The sinking of the RMS Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships. | |||
# | |||
# One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class. | |||
# | |||
# In this challenge, we ask you to complete the analysis of what sorts of people were likely to survive. In particular, we ask you to apply the tools of machine learning to predict which passengers survived the tragedy. | |||
# | |||
# ## Practice Skills | |||
# * Binary classification | |||
# * Python & SKLearn | |||
# | |||
# ## Data | |||
# The data has been split into two groups: | |||
# | |||
# * training set (train.csv) | |||
# * test set (test.csv) | |||
# | |||
# The training set should be used to build your machine learning models. For the training set, we provide the outcome (also known as the `ground truth`) for each passenger. Your model will be based on `features` like passengers' gender and class. You can also use feature engineering to create new features. | |||
# | |||
# The test set should be used to see how well your model performs on unseen data. For the test set, we do not provide the ground truth for each passenger. It is your job to predict these outcomes. For each passenger in the test set, use the model you trained to predict whether or not they survived the sinking of the Titanic. | |||
# | |||
# We also include `gender_submission.csv`, a set of predictions that assume all and only female passengers survive, as an example of what a submission file should look like. | |||
# | |||
# ### Data description | |||
#  | |||
#  | |||
# | |||
# | |||
# ### Variable Notes | |||
# pclass: A proxy for socio-economic status (SES) | |||
# * 1st = Upper | |||
# * 2nd = Middle | |||
# * 3rd = Lower | |||
# | |||
# | |||
# ## Links | |||
# * [Titanic: Machine Learning from Disaster](https://www.kaggle.com/c/titanic) |
@@ -302,8 +302,7 @@ def forward(n, X): | |||
# use random weight to perdict | |||
forward(nn, X) | |||
y_pred = np.zeros(nn.z2.shape[0]) | |||
y_pred[np.where(nn.z2[:,0]<nn.z2[:,1])] = 1 | |||
y_pred = np.argmax(nn.z2, axis=1) | |||
# plot data | |||
plt.scatter(X[:, 0], X[:, 1], c=y_pred, cmap=plt.cm.Spectral) | |||
@@ -323,8 +322,7 @@ def backpropagation(n, X, y): | |||
# print loss, accuracy | |||
L = np.sum((n.z2 - y)**2) | |||
y_pred = np.zeros(nn.z2.shape[0]) | |||
y_pred[np.where(nn.z2[:,0]<nn.z2[:,1])] = 1 | |||
y_pred = np.argmax(nn.z2, axis=1) | |||
acc = accuracy_score(y_true, y_pred) | |||
print("epoch [%4d] L = %f, acc = %f" % (i, L, acc)) | |||
@@ -345,8 +343,7 @@ backpropagation(nn, X, t) | |||
# + | |||
# plot data | |||
y_pred = np.zeros(nn.z2.shape[0]) | |||
y_pred[np.where(nn.z2[:,0]<nn.z2[:,1])] = 1 | |||
y_pred = np.argmax(nn.z2, axis=1) | |||
plt.scatter(X[:, 0], X[:, 1], c=nn.y, cmap=plt.cm.Spectral) | |||
plt.title("ground truth") | |||
@@ -361,14 +358,13 @@ plt.show() | |||
# ## 如何使用类的方法封装多层神经网络? | |||
# + | |||
% matplotlib inline | |||
import numpy as np | |||
from sklearn import datasets, linear_model | |||
from sklearn.metrics import accuracy_score | |||
import matplotlib.pyplot as plt | |||
# defin sigmod & its derivate function | |||
# define sigmod | |||
def sigmod(X): | |||
return 1.0/(1+np.exp(-X)) | |||
@@ -409,6 +405,7 @@ class NN_Model: | |||
Z.append(z) | |||
self.Z = Z | |||
return Z[-1] | |||
# back-propagation | |||
def backpropagation(self, X, y, n_epoch=None, epsilon=None): | |||
@@ -421,6 +418,8 @@ class NN_Model: | |||
for i in range(n_epoch): | |||
# forward to calculate each node's output | |||
self.forward(X) | |||
self.evaluate() | |||
# calc weights update | |||
W = self.W | |||
@@ -437,7 +436,7 @@ class NN_Model: | |||
if j == n_layer - 1: | |||
d = z*(1-z)*(d0 - z) | |||
else: | |||
d = z*(1-z)*np.dot(d0, W[jj].T) | |||
d = z*(1-z)*np.dot(d0, W[j].T) | |||
d0 = d | |||
D.insert(0, d) | |||
@@ -453,21 +452,20 @@ class NN_Model: | |||
B[jj] += epsilon * np.sum(D[jj], axis=0) | |||
def evaulate(self): | |||
def evaluate(self): | |||
z = self.Z[-1] | |||
# print loss, accuracy | |||
L = np.sum((z - self.Y)**2) | |||
y_pred = np.argmax(z) | |||
y_true = np.argmax(self.Y) | |||
y_pred = np.argmax(z, axis=1) | |||
y_true = np.argmax(self.Y, axis=1) | |||
acc = accuracy_score(y_true, y_pred) | |||
print("L = %f, acc = %f" % (L, acc)) | |||
# + | |||
# generate sample data | |||
np.random.seed(0) | |||
X, y = datasets.make_moons(200, noise=0.20) | |||
@@ -481,17 +479,94 @@ t[np.where(y==1), 1] = 1 | |||
plt.scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.Spectral) | |||
plt.show() | |||
nn = NN_Model([2, 4, 2]) | |||
# + | |||
# use the NN model and training | |||
nn = NN_Model([2, 6, 2]) | |||
nn.init_weight() | |||
nn.backpropagation(X, t) | |||
nn.backpropagation(X, t, 2000) | |||
nn.evaluate() | |||
# + | |||
# predict results & plot results | |||
y_res = nn.forward(X) | |||
y_pred = np.argmax(y_res, axis=1) | |||
# plot data | |||
plt.scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.Spectral) | |||
plt.title("ground truth") | |||
plt.show() | |||
plt.scatter(X[:, 0], X[:, 1], c=y_pred, cmap=plt.cm.Spectral) | |||
plt.title("predicted") | |||
plt.show() | |||
# - | |||
# ## 深入分析 | |||
# + | |||
# print some results | |||
print(y_res[1:10, :]) | |||
# - | |||
# **问题** | |||
# 1. 我们希望得到的每个类别的概率 | |||
# 2. 如何做多分类问题? | |||
# 3. 如何能让神经网络更快的训练好? | |||
# 4. 如何抽象,让神经网络的类支持更多的类型的层 | |||
# ## Softmax & 交叉熵代价函数 | |||
# | |||
# softmax经常被添加在分类任务的神经网络中的输出层,神经网络的反向传播中关键的步骤就是求导,从这个过程也可以更深刻地理解反向传播的过程,还可以对梯度传播的问题有更多的思考。 | |||
# | |||
# ### softmax 函数 | |||
# | |||
# softmax(柔性最大值)函数,一般在神经网络中, softmax可以作为分类任务的输出层。其实可以认为softmax输出的是几个类别选择的概率,比如我有一个分类任务,要分为三个类,softmax函数可以根据它们相对的大小,输出三个类别选取的概率,并且概率和为1。 | |||
# | |||
# softmax函数的公式是这种形式: | |||
#  | |||
# | |||
# * $S_i$是经过softmax的类别概率输出 | |||
# * $z_k$是神经元的输出 | |||
# | |||
# 更形象的如下图表示: | |||
#  | |||
# softmax直白来说就是将原来输出是3,1,-3通过softmax函数一作用,就映射成为(0,1)的值,而这些值的累和为1(满足概率的性质),那么我们就可以将它理解成概率,在最后选取输出结点的时候,我们就可以选取概率最大(也就是值对应最大的)结点,作为我们的预测目标! | |||
# | |||
# | |||
# | |||
# 首先是神经元的输出,一个神经元如下图: | |||
#  | |||
# | |||
# 神经元的输出设为: | |||
#  | |||
# 其中$W_{ij}$是第$i$个神经元的第$j$个权重,$b$是偏置。$z_i$表示该网络的第$i$个输出。 | |||
# | |||
# 给这个输出加上一个softmax函数,那就变成了这样: | |||
#  | |||
# $a_i$代表softmax的第$i$个输出值,右侧套用了softmax函数。 | |||
# | |||
# | |||
# ### 损失函数 loss function | |||
# | |||
# 在神经网络反向传播中,要求一个损失函数,这个损失函数其实表示的是真实值与网络的估计值的误差,知道误差了,才能知道怎样去修改网络中的权重。 | |||
# | |||
# 损失函数可以有很多形式,这里用的是交叉熵函数,主要是由于这个求导结果比较简单,易于计算,并且交叉熵解决某些损失函数学习缓慢的问题。交叉熵的函数是这样的: | |||
# | |||
#  | |||
# | |||
# 其中$y_i$表示真实的分类结果。 | |||
# | |||
# | |||
# ## References | |||
# * [零基础入门深度学习(3) - 神经网络和反向传播算法](https://www.zybuluo.com/hanbingtao/note/476663) | |||
# * [Neural Network Using Python and Numpy](https://www.python-course.eu/neural_networks_with_python_numpy.php) | |||
# * http://www.cedar.buffalo.edu/%7Esrihari/CSE574/Chap5/Chap5.3-BackProp.pdf | |||
# * https://mattmazur.com/2015/03/17/a-step-by-step-backpropagation-example/ | |||
# * 反向传播算法 | |||
# * [零基础入门深度学习(3) - 神经网络和反向传播算法](https://www.zybuluo.com/hanbingtao/note/476663) | |||
# * [Neural Network Using Python and Numpy](https://www.python-course.eu/neural_networks_with_python_numpy.php) | |||
# * http://www.cedar.buffalo.edu/%7Esrihari/CSE574/Chap5/Chap5.3-BackProp.pdf | |||
# * https://mattmazur.com/2015/03/17/a-step-by-step-backpropagation-example/ | |||
# * Softmax & 交叉熵 | |||
# * [交叉熵代价函数(作用及公式推导)](https://blog.csdn.net/u014313009/article/details/51043064) | |||
# * [手打例子一步一步带你看懂softmax函数以及相关求导过程](https://www.jianshu.com/p/ffa51250ba2e) | |||
# * [简单易懂的softmax交叉熵损失函数求导](https://www.jianshu.com/p/c02a1fbffad6) |
@@ -2,10 +2,11 @@ | |||
import numpy as np | |||
from sklearn import datasets, linear_model | |||
from sklearn.metrics import accuracy_score | |||
import matplotlib.pyplot as plt | |||
# define sigmod & its derivate function | |||
# define sigmod | |||
def sigmod(X): | |||
return 1.0/(1+np.exp(-X)) | |||
@@ -58,6 +59,8 @@ class NN_Model: | |||
for i in range(n_epoch): | |||
# forward to calculate each node's output | |||
self.forward(X) | |||
self.evaluate() | |||
# calc weights update | |||
W = self.W | |||
@@ -74,7 +77,7 @@ class NN_Model: | |||
if j == n_layer - 1: | |||
d = z*(1-z)*(d0 - z) | |||
else: | |||
d = z*(1-z)*np.dot(d0, W[jj].T) | |||
d = z*(1-z)*np.dot(d0, W[j].T) | |||
d0 = d | |||
D.insert(0, d) | |||
@@ -90,14 +93,14 @@ class NN_Model: | |||
B[jj] += epsilon * np.sum(D[jj], axis=0) | |||
def evaulate(self): | |||
def evaluate(self): | |||
z = self.Z[-1] | |||
# print loss, accuracy | |||
L = np.sum((z - self.Y)**2) | |||
y_pred = np.argmax(z) | |||
y_true = np.argmax(self.Y) | |||
y_pred = np.argmax(z, axis=1) | |||
y_true = np.argmax(self.Y, axis=1) | |||
acc = accuracy_score(y_true, y_pred) | |||
print("L = %f, acc = %f" % (L, acc)) | |||
@@ -114,12 +117,11 @@ t[np.where(y==0), 0] = 1 | |||
t[np.where(y==1), 1] = 1 | |||
# plot data | |||
plt.scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.Spectral) | |||
plt.show() | |||
#plt.scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.Spectral) | |||
#plt.show() | |||
nn = NN_Model([2, 4, 2]) | |||
nn = NN_Model([2, 3, 2]) | |||
nn.init_weight() | |||
nn.backpropagation(X, t) | |||
nn.evaluate() |
@@ -1,3 +1,4 @@ | |||
# -*- coding: utf-8 -*- | |||
# --- | |||
# jupyter: | |||
# jupytext_format_version: '1.2' | |||
@@ -74,3 +75,73 @@ for i in range(64): | |||
# label the image with the target value | |||
ax.text(0, 7, str(digits.target[i])) | |||
# - | |||
# ## Iris | |||
# | |||
# This data sets consists of 3 different types of irises’ (Setosa, Versicolour, and Virginica) petal and sepal length, stored in a 150x4 numpy.ndarray | |||
# | |||
# The rows being the samples and the columns being: Sepal Length, Sepal Width, Petal Length and Petal Width. | |||
# | |||
# + | |||
import matplotlib.pyplot as plt | |||
from mpl_toolkits.mplot3d import Axes3D | |||
from sklearn import datasets | |||
from sklearn.decomposition import PCA | |||
# import some data to play with | |||
iris = datasets.load_iris() | |||
X = iris.data[:, :] | |||
y = iris.target | |||
# Plot the samples | |||
plt.figure(figsize=(15, 5)) | |||
plt.subplots_adjust(bottom=.05, top=.9, left=.05, right=.95) | |||
plt.subplot(121) | |||
plt.scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.Set1, | |||
edgecolor='k') | |||
plt.xlabel('Sepal length') | |||
plt.ylabel('Sepal width') | |||
plt.subplot(122) | |||
plt.scatter(X[:, 2], X[:, 3], c=y, cmap=plt.cm.Set1, | |||
edgecolor='k') | |||
plt.xlabel('Petal Length') | |||
plt.ylabel('Petal Width') | |||
plt.show() | |||
# + | |||
from sklearn.manifold import Isomap | |||
iso = Isomap(n_neighbors=5, n_components=2) | |||
proj = iso.fit_transform(X) | |||
plt.figure(figsize=(15, 9)) | |||
plt.scatter(proj[:, 0], proj[:, 1], c=y) | |||
plt.colorbar() | |||
plt.show() | |||
# - | |||
# ## blobs | |||
# | |||
# + | |||
import matplotlib.pyplot as plt | |||
from sklearn.datasets import make_blobs | |||
# Generate 3 blobs with 2 classes where the second blob contains | |||
# half positive samples and half negative samples. Probability in this | |||
# blob is therefore 0.5. | |||
centers = [(-5, -5), (0, 0), (5, 5)] | |||
n_samples = 500 | |||
X, y = make_blobs(n_samples=n_samples, n_features=2, cluster_std=1.0, | |||
centers=centers, shuffle=False, random_state=42) | |||
plt.figure(figsize=(15, 9)) | |||
plt.scatter(X[:, 0], X[:, 1], c=y) | |||
plt.colorbar() | |||
plt.show() |