Standardization, or mean removal and variance scaling，transform the data to center it by removing the mean value of each feature, then scale it by dividing non-constant features by their standard deviation.

出发点：

many elements used in the objective function of a learning algorithm (such as the RBF kernel of Support Vector Machines or the l1 and l2 regularizers of linear models) assume that all features are centered around zero and have variance in the same order. If a feature has a variance that is orders of magnitude larger than others, it might dominate the objective function and make the estimator unable to learn from other features correctly as expected.

注意点：

they might behave badly if the individual features do not more or less look like standard normally distributed data: Gaussian with?zero mean and unit variance.
对异常值敏感

特征缩放（Scaling features to a range）

简介：

An alternative standardization is scaling features to lie between a given minimum and maximum value, often between zero and one, or so that the maximum absolute value of each feature is scaled to unit size. This can be achieved using?MinMaxScaler?or?MaxAbsScaler, respectively.

MinMaxScaler： (X - X.min) / (X.max - X.min)，scale to the?[0,?1]?range
MaxAbsScaler： scales in a way that the training data lies within the range?[-1,?1]?by dividing through the largest maximum value in each feature. It is meant for data that is already centered at zero or sparse data.

?动机：

The motivation to use this scaling include robustness to very small standard deviations of features and preserving zero entries in sparse data.
统一特征的尺度

注意点：

一是特征原有的稀疏性；Centering sparse data would destroy the sparseness structure in the data, and thus rarely is a sensible thing to do. However, it can make sense to scale sparse inputs, especially if features are on different scales. MaxAbsScaler?was specifically designed for scaling sparse data, and is the recommended way to go about this.

二是异常值，Scaling data with outliers，MinMaxScaler?or?MaxAbsScaler都是异常值敏感的，If your data contains many outliers, scaling using the mean and variance of the data is likely to not work very well. In these cases, you can use?RobustScaler?as a drop-in replacement instead. It uses more robust estimates for the center and range of your data.

附：Scaling vs Whitening

这里顺便提下Whitening（白化）和scaling的区别

简介：

A?whitening transformation?or?sphering transformation?is a?linear transformation?that transforms a vector of?random variables?with a known?covariance matrix?into a set of new variables whose covariance is the?identity matrix, meaning that they are?uncorrelated?and each have?variance?1.[1]?The transformation is called "whitening" because it changes the input vector into a?white noise?vector.

Several other transformations are closely related to whitening:

the?decorrelation transform?removes only the correlations but leaves variances intact,
the?standardization transform?sets variances to 1 but leaves correlations intact,
a?coloring transformation?transforms a vector of white random variables into a random vector with a specified covariance matrix.[2]

详见wiki百科：https://en.wikipedia.org/wiki/Whitening_transformation

动机：

It is sometimes not enough to center and scale the features independently, since a downstream model can further make some assumption on the linear independence of the features.

To address this issue you can use?PCA?with?whiten=True?to further remove the linear correlation across features.

也就是说，由于上述对特征处理的方法都是独立进行的，并无考虑特征间的相关性，而当下游模型假设特征间是无关的就会出现问题。Should I standardize the input variables (column vectors)?

Should I standardize the input variables (column vectors/特征)?

That depends primarily on how the network combines input variables to compute the net input to the next (hidden or output) layer. If the input variables are combined via a distance function (such as Euclidean distance) in an RBF network, standardizing inputs can be crucial. The contribution of an input will depend heavily on its variability relative to other inputs. If one input has a range of 0 to 1, while another input has a range of 0 to 1,000,000, then the contribution of the first input to the distance will be swamped by the second input. So it is essential to rescale the inputs so that their variability reflects their importance, or at least is not in inverse relation to their importance.

If the input variables are combined linearly, as in an MLP, then it is rarely strictly necessary to standardize the inputs, at least in theory. The reason is that any rescaling of an input vector can be effectively undone by changing the corresponding weights and biases, leaving you with the exact same outputs as you had before. However, there are a variety of practical reasons why standardizing the inputs can make training faster and reduce the chances of getting stuck in local optima. Also, weight decay and Bayesian estimation can be done more conveniently with standardized inputs.

The main emphasis in the NN literature on initial values has been on the avoidance of saturation, hence the desire to use small random values. How small these random values should be depends on the scale of the inputs as well as the number of inputs and their correlations. Standardizing inputs removes the problem of scale dependence of the initial weights.

But standardizing input variables can have far more important effects on initialization of the weights than simply avoiding saturation.With such a poor initialization, local minima are very likely to occur. It is therefore important to center the inputs to get good random initializations. In particular, scaling the inputs to [-1,1] will work better than [0,1], although any scaling that sets to zero the mean or median or other measure of central tendency is likely to be as good, and robust estimators of location and scale (Iglewicz, 1983) will be even better for input variables with extreme outliers.

Standardizing input variables has different effects on different training algorithms for MLPs. For example: o Steepest descent is very sensitive to scaling. The more ill-conditioned the Hessian is, the slower the convergence. Hence, scaling is an important consideration for gradient descent methods such as standard backprop. o Quasi-Newton and conjugate gradient methods begin with a steepest descent step and therefore are scale sensitive. However, they accumulate second-order information as training proceeds and hence are less scale sensitive than pure gradient descent. o Newton-Raphson and Gauss-Newton, if implemented correctly, are theoretically invariant under scale changes as long as none of the scaling is so extreme as to produce underflow or overflow. o Levenberg-Marquardt is scale invariant as long as no ridging is required. There are several different ways to implement ridging; some are scale invariant and some are not. Performance under bad scaling will depend on details of the implementation.

Two of the most useful ways to standardize inputs are:

Mean 0 and standard deviation 1
Midrange 0 and range 2 (i.e., minimum -1 and maximum 1)Should I standardize the target variables (column vectors)?

Should I standardize the target variables (column vectors/标签)?

Standardizing target variables is typically more a convenience for getting good initial weights than a necessity. However, if you have two or more target variables and your error function is scale-sensitive like the usual least (mean) squares error function, then the variability of each target relative to the others can effect how well the net learns that target. If the targets are of equal importance, they should typically be standardized to the same range or the same standard deviation. （如多任务学习/loss由多部分组成）

The scaling of the targets does not affect their importance in training if you use maximum likelihood estimation and estimate a separate scale parameter (such as a standard deviation) for each target variable. In this case, the importance of each target is inversely related to its estimated scale parameter. In other words, noisier targets will be given less importance. For weight decay and Bayesian estimation, the scaling of the targets affects the decay values and prior distributions. Hence it is usually most convenient to work with standardized targets.Should I standardize the variables (column vectors) for unsupervised learning?

Should I standardize the variables (column vectors) for unsupervised learning?

The most commonly used methods of unsupervised learning, including various kinds of vector quantization, Kohonen networks, Hebbian learning, etc., depend on Euclidean distances or scalar-product similarity measures. The considerations are therefore the same as for standardizing inputs in RBF networks.

If you are using unsupervised competitive learning to try to discover natural clusters in the data, rather than for data compression, simply standardizing the variables may be inadequate.

Better yet for finding natural clusters, try mixture models or nonparametric density estimation

附：saturation

outputting(activation values) mostly 0 and 1 for sigmoid, and not much in-between,?

影响：

a saturated unit can not differentiate much between patterns.
One reason why saturation is considered bad is that it stops gradient propagation: when a unit saturates, the gradients of all weights feeding into that unit become very small (since the sigmoid is almost flat) and thus the backpropagation learning signal vanishes
见：深度学习中saturation是什么意思？ - 知乎

2.非线性转化（Non-linear transformation）

Two types of transformations are available: quantile transforms and power transforms. Both quantile and power transforms are based on monotonic transformations of the features and thus preserve the rank of the values along each feature.quantile transforms

quantile transforms

QuantileTransformer?applies a non-linear transformation such that the probability density function of each feature will be mapped to a uniform or Gaussian distribution. it is a non-parametric transformation .

Quantile transforms put all features into the same desired distribution based on the formula??where??is the cumulative distribution function of the feature and??the?quantile function?of the desired output distribution?. This formula is using the two following facts: (i) if??is a random variable with a continuous cumulative distribution function??then??is uniformly distributed on?; (ii) if??is a random variable with uniform distribution on??then??has distribution?.

By performing a rank transformation, a quantile transform smooths out unusual distributions and is less influenced by outliers than scaling methods. It does, however, distort correlations and distances within and across features.

RobustScaler?and?QuantileTransformer?are robust to outliers in the sense that adding or removing outliers in the training set will yield approximately the same transformation.

But contrary to?RobustScaler,?QuantileTransformer?will also automatically collapse any outlier by setting them to the a priori defined range boundaries (0 and 1). This can result in saturation artifacts for extreme values.(这段没懂)

简单来说，就是QuantileTransformer?核心只考虑了特征排序后的秩，扭曲了特征的值之间的距离和关系，感觉太暴力了。附：Quantile function（分位数函数）

附：Quantile function（分位数函数）

the?quantile?function, associated with a?probability distribution?of a?random variable, specifies the value of the random variable such that the probability of the variable being less than or equal to that value equals the given probability. It is also called the?percentile function,?percent-point function?or?inverse cumulative distribution function.

详见?https://en.wikipedia.org/wiki/Quantile_function

power transforms

Power transforms are a family of parametric transformations that aim to map data from any distribution to as close to a Gaussian distribution.

In many modeling scenarios, normality of the features in a dataset is desirable. Power transforms are a family of parametric, monotonic transformations that aim to map data from any distribution to as close to a Gaussian distribution as possible in order to stabilize variance and minimize skewness.

PowerTransformer?currently provides two such power transformations, the Yeo-Johnson transform and the Box-Cox transform.

Box-Cox can only be applied to strictly positive data. In both methods, the transformation is parameterized by?, which is determined through maximum likelihood estimation.Should I nonlinearly transform the data?

Should I nonlinearly transform the data?

Most importantly, nonlinear transformations of the targets are important with noisy data, via their effect on the error function. Many commonly used error functions are functions solely of the difference abs(target-output). Nonlinear transformations (unlike linear transformations) change the relative sizes of these differences. With most error functions, the net will expend more effort, so to speak, trying to learn target values for which abs(target-output) is large.

Less importantly, smooth functions are usually easier to learn than rough functions. Generalization is also usually better for smooth functions. So nonlinear transformations (of either inputs or targets) that make the input-output function smoother are usually beneficial. For classification problems, you want the class boundaries to be smooth. When there are only a few inputs, it is often possible to transform the data to a linear relationship, in which case you can use a linear model instead of a more complex neural net, and many things (such as estimating generalization error and error bars) will become much simpler. A variety of NN architectures (RBF networks, B-spline networks, etc.) amount to using many nonlinear transformations, possibly involving multiple variables simultaneously, to try to make the input-output function approximately linear (Ripley 1996, chapter 4). There are particular applications, such as signal and image processing, in which very elaborate transformations are useful (Masters 1994).

It is usually advisable to choose an error function appropriate for the distribution of noise in your target variables (McCullagh and Nelder 1989). But if your software does not provide a sufficient variety of error functions, then you may need to transform the target so that the noise distribution conforms to whatever error function you are using. For example, if you have to use least-(mean-)squares training, you will get the best results if the noise distribution is approximately Gaussian with constant variance, since least-(mean-)squares is maximum likelihood in that case. Heavy-tailed distributions (those in which extreme values occur more often than in a Gaussian distribution, often as indicated by high kurtosis) are especially of concern, due to the loss of statistical efficiency of least-(mean-)square estimates (Huber 1981). Note that what is important is the distribution of the noise, not the distribution of the target values.

The distribution of inputs may suggest transformations, but this is by far the least important consideration among those listed here. If an input is strongly skewed, a logarithmic, square root, or other power (between -1 and 1) transformation may be worth trying. If an input has high kurtosis but low skewness, an arctan transform can reduce the influence of extreme values:

$arctan\left(c\ {\frac{input-mean}{std}}\right)$

，where c is a constant that controls how far the extreme values are brought in towards the mean. Arctan usually works better than tanh, which squashes the extreme values too much. Using robust estimates of location and scale (Iglewicz 1983) instead of the mean and standard deviation will work even better for pathological distributions.

3.Normalization（样本归一化）

Normalization?is the process of?scaling individual samples to have unit norm. This process can be useful if you plan to use a quadratic form such as the dot-product or any other kernel to quantify the similarity of any pair of samples.

This assumption is the base of the?Vector Space Model?often used in text classification and clustering contexts.

即对样本(模)进行归一化，而非特征维度。

动机：

This process can be useful if you plan to use a quadratic form such as the dot-product or any other kernel to quantify the similarity of any pair of samples.
另外是对于有助于训练优化过程，防止溢出，以及陷于局部最优
You may want to standardize each case if there is extraneous variability between cases

注意：

If the network is modified to operate on Euclidean distances instead of inner products, it is no longer necessary to standardize the input cases.
Standardization of cases should be approached with caution because it discards information. If that information is irrelevant, then standardizing cases can be quite helpful. If that information is important, then standardizing cases can be disastrous. Issues regarding the standardization of cases must be carefully evaluated in every application.

Should I standardize the input cases (row vectors/样本)?

Whereas standardizing variables is usually beneficial, the effect of standardizing cases (row vectors) depends on the particular data. Cases are typically standardized only across the input variables, since including the target variable(s) in the standardization would make prediction impossible.

There are some kinds of networks, such as simple Kohonen nets, where it is necessary to standardize the input cases to a common Euclidean length; this is a side effect of the use of the inner product as a similarity measure. If the network is modified to operate on Euclidean distances instead of inner products, it is no longer necessary to standardize the input cases.

Standardization of cases should be approached with caution because it discards information. If that information is irrelevant, then standardizing cases can be quite helpful. If that information is important, then standardizing cases can be disastrous. Issues regarding the standardization of cases must be carefully evaluated in every application. There are no rules of thumb that apply to all applications.

You may want to standardize each case if there is extraneous variability between cases. Consider the common situation in which each input variable represents a pixel in an image. If the images vary in exposure, and exposure is irrelevant to the target values, then it would usually help to subtract the mean of each case to equate the exposures of different cases. If the images vary in contrast, and contrast is irrelevant to the target values, then it would usually help to divide each case by its standard deviation to equate the contrasts of different cases. Given sufficient data, a NN could learn to ignore exposure and contrast. However, training will be easier and generalization better if you can remove the extraneous exposure and contrast information before training the network. Compare the effect of different scalers on data with outliers