# KeepNotes blog

Stay hungry, Stay Foolish.

0%

``X_scaled = preprocessing.scale(X_train)``

``````scaler = preprocessing.StandardScaler().fit(X_train)
X_train_new = scaler.transform(X_train)
X_test_new = scaler.transform(X_test)``````

``````scaler = preprocessing.StandardScaler()
X_train_new = scaler.fit_transform(X_train)``````

``````min_max_scaler = preprocessing.MinMaxScaler()
X_train_minmax = min_max_scaler.fit_transform(X_train)
X_test_minmax = min_max_scaler.transform(X_test)``````

• 首先可以使用`MaxAbsScaler``maxabs_scale`方法
• 如果要使用`StandardScaler``scale`方法，则需要参数`with_mean=False`
• `RobustScaler`无法`fit`稀疏矩阵，但可以用`transform`方法

Scaling vs Whitening:

#### Non-linear transformation

``````X, y = iris.data, iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
quantile_transformer = preprocessing.QuantileTransformer(random_state=0)
X_train_trans = quantile_transformer.fit_transform(X_train)
X_test_trans = quantile_transformer.transform(X_test)``````

``````pt = preprocessing.PowerTransformer(method='box-cox', standardize=False)
pt.fit_transform(X_train)``````

#### Normalization

Normalization（归一化）我曾经一度认为是跟Standardization是一样的，但是其是属于`Transformer`中的一种；`normalize`函数提供L1/L2范式的归一化操作

``X_normalized = preprocessing.normalize(X, norm='l2')``

``````normalizer = preprocessing.Normalizer().fit(X)  # fit does nothing
normalizer
normalizer.transform(X)
normalizer.transform([[-1.,  1., 0.]])``````

#### Encoding categorical features

``````from sklearn.preprocessing import OneHotEncoder

X = [['male', 'from US', 'uses Safari'], ['female', 'from Europe', 'uses Firefox']]
enc = OneHotEncoder(handle_unknown='ignore').fit(X)
print(enc.get_feature_names())
enc.transform(X).toarray()
>>array([[0., 1., 0., 1., 0., 1.],
[1., 0., 1., 0., 1., 0.]])``````

``````import pandas as pd

data = pd.DataFrame(X)
data_dummies = pd.get_dummies(data)
print(data_dummies.columns)
# 提取Numpy数组
data_dummies.values
>>array([[0, 1, 0, 1, 0, 1],
[1, 0, 1, 0, 1, 0]], dtype=uint8)``````

#### Discretization

`KBinsDiscretizer`（K-bins discretization，K-bins离散化）有几个主要参数（了解默认参数，才能看得懂例子）：

• n_bins，分箱数，默认值为5，如果是列表则表示各个特征值的分箱数
• encode，编码方式，默认值为onehot
• onehot：返回稀疏矩阵
• onehot-dense：返回密集矩阵
• ordinal：返回分箱序号
• strategy，分箱策略，默认是quantile
• uniform：等宽分箱
• quantile：采取quantiled values，因此每个分箱中拥有相同数量的数据点
• kmeans：每个分箱中的数据点具有相同的1D k均值簇的聚类中心

``````import numpy as np
X = np.array([[ -3., 5., 15 ],
[  0., 6., 14 ],
[  6., 3., 11 ]])
est = KBinsDiscretizer(n_bins=[3, 2, 2], encode='ordinal').fit(X)
est.fit_transform(X)
>>array([[0., 1., 1.],
[1., 1., 1.],
[2., 0., 0.]])``````

``````binarizer = Binarizer(threshold=5.5)
binarizer.transform(X)
>>array([[0., 0., 1.],
[0., 1., 1.],
[1., 0., 1.]])``````

#### Generating polynomial features

`PolynomialFeatures`可以将特征值变成N的多项式（如下则是2次二项式，如果有两个特征`[a,b]`，则对应的多项式特征则是`[1, a, b, a^2, ab, b^2]`

``````from sklearn.preprocessing import PolynomialFeatures

X = np.arange(6).reshape(3, 2)
poly = PolynomialFeatures(2)
poly.fit_transform(X)``````

#### Custom transformers

``````import numpy as np
from sklearn.preprocessing import FunctionTransformer
transformer = FunctionTransformer(np.log1p, validate=True)
X = np.array([[0, 1], [2, 3]])
transformer.transform(X)``````

Preprocessing data
《Python机器学习基础教程》