# Scikit-learn Preprocessing data

#### Standardization, or mean removal and variance scaling

X_scaled = preprocessing.scale(X_train)

scaler = preprocessing.StandardScaler().fit(X_train)
X_train_new = scaler.transform(X_train)
X_test_new = scaler.transform(X_test)

scaler = preprocessing.StandardScaler()
X_train_new = scaler.fit_transform(X_train)

min_max_scaler = preprocessing.MinMaxScaler()
X_train_minmax = min_max_scaler.fit_transform(X_train)
X_test_minmax = min_max_scaler.transform(X_test)

• 首先可以使用MaxAbsScalermaxabs_scale方法
• 如果要使用StandardScalerscale方法，则需要参数with_mean=False
• RobustScaler无法fit稀疏矩阵，但可以用transform方法

Scaling vs Whitening:

#### Non-linear transformation

X, y = iris.data, iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
quantile_transformer = preprocessing.QuantileTransformer(random_state=0)
X_train_trans = quantile_transformer.fit_transform(X_train)
X_test_trans = quantile_transformer.transform(X_test)

pt = preprocessing.PowerTransformer(method='box-cox', standardize=False)
pt.fit_transform(X_train)

#### Normalization

Normalization（归一化）我曾经一度认为是跟Standardization是一样的，但是其是属于Transformer中的一种；normalize函数提供L1/L2范式的归一化操作

X_normalized = preprocessing.normalize(X, norm='l2')

normalizer = preprocessing.Normalizer().fit(X)  # fit does nothing
normalizer
normalizer.transform(X)
normalizer.transform([[-1.,  1., 0.]])

#### Encoding categorical features

from sklearn.preprocessing import OneHotEncoder

X = [['male', 'from US', 'uses Safari'], ['female', 'from Europe', 'uses Firefox']]
enc = OneHotEncoder(handle_unknown='ignore').fit(X)
print(enc.get_feature_names())
enc.transform(X).toarray()
>>array([[0., 1., 0., 1., 0., 1.],
[1., 0., 1., 0., 1., 0.]])

import pandas as pd

data = pd.DataFrame(X)
data_dummies = pd.get_dummies(data)
print(data_dummies.columns)
# 提取Numpy数组
data_dummies.values
>>array([[0, 1, 0, 1, 0, 1],
[1, 0, 1, 0, 1, 0]], dtype=uint8)

#### Discretization

KBinsDiscretizer（K-bins discretization，K-bins离散化）有几个主要参数（了解默认参数，才能看得懂例子）：

• n_bins，分箱数，默认值为5，如果是列表则表示各个特征值的分箱数
• encode，编码方式，默认值为onehot
• onehot：返回稀疏矩阵
• onehot-dense：返回密集矩阵
• ordinal：返回分箱序号
• strategy，分箱策略，默认是quantile
• uniform：等宽分箱
• quantile：采取quantiled values，因此每个分箱中拥有相同数量的数据点
• kmeans：每个分箱中的数据点具有相同的1D k均值簇的聚类中心

import numpy as np
X = np.array([[ -3., 5., 15 ],
[  0., 6., 14 ],
[  6., 3., 11 ]])
est = KBinsDiscretizer(n_bins=[3, 2, 2], encode='ordinal').fit(X)
est.fit_transform(X)
>>array([[0., 1., 1.],
[1., 1., 1.],
[2., 0., 0.]])

binarizer = Binarizer(threshold=5.5)
binarizer.transform(X)
>>array([[0., 0., 1.],
[0., 1., 1.],
[1., 0., 1.]])

#### Generating polynomial features

PolynomialFeatures可以将特征值变成N的多项式（如下则是2次二项式，如果有两个特征[a,b]，则对应的多项式特征则是[1, a, b, a^2, ab, b^2]

from sklearn.preprocessing import PolynomialFeatures

X = np.arange(6).reshape(3, 2)
poly = PolynomialFeatures(2)
poly.fit_transform(X)

#### Custom transformers

import numpy as np
from sklearn.preprocessing import FunctionTransformer
transformer = FunctionTransformer(np.log1p, validate=True)
X = np.array([[0, 1], [2, 3]])
transformer.transform(X)

Preprocessing data
《Python机器学习基础教程》