| |
|
开发:
C++知识库
Java知识库
JavaScript
Python
PHP知识库
人工智能
区块链
大数据
移动开发
嵌入式
开发工具
数据结构与算法
开发测试
游戏开发
网络协议
系统运维
教程: HTML教程 CSS教程 JavaScript教程 Go语言教程 JQuery教程 VUE教程 VUE3教程 Bootstrap教程 SQL数据库教程 C语言教程 C++教程 Java教程 Python教程 Python3教程 C#教程 数码: 电脑 笔记本 显卡 显示器 固态硬盘 硬盘 耳机 手机 iphone vivo oppo 小米 华为 单反 装机 图拉丁 |
-> 人工智能 -> 机器学习之特征工程--特征预处理(上) -> 正文阅读 |
|
[人工智能]机器学习之特征工程--特征预处理(上) |
机器学习特征工程--特征预处理(上)最近又重新看了下常用的特征预处理方法,主要来源是sklearn官方文档,一些关键信息记录下,留存用,有些乱和杂,抽时间再整理。 此为上篇,主要包括:线性转化,非线性转化,及样本归一化,每部分还会有是否应该采用这样的做法。 先放一句话:There are no rules of thumb that apply to all applications. 目录 特征缩放(Scaling features to a range) Should I standardize the input variables (column vectors/特征)? Should I standardize the target variables (column vectors/标签)? Should I standardize the variables (column vectors) for unsupervised learning? 2.非线性转化(Non-linear transformation) Should I nonlinearly transform the data? Should I standardize the input cases (row vectors/样本)? 附:Compare the effect of different scalers on data with outliers 1.线性转化标准化简介: Standardization, or mean removal and variance scaling,transform the data to center it by removing the mean value of each feature, then scale it by dividing non-constant features by their standard deviation. 出发点: many elements used in the objective function of a learning algorithm (such as the RBF kernel of Support Vector Machines or the l1 and l2 regularizers of linear models) assume that all features are centered around zero and have variance in the same order. If a feature has a variance that is orders of magnitude larger than others, it might dominate the objective function and make the estimator unable to learn from other features correctly as expected. 注意点:
特征缩放(Scaling features to a range)简介: An alternative standardization is scaling features to lie between a given minimum and maximum value, often between zero and one, or so that the maximum absolute value of each feature is scaled to unit size. This can be achieved using?MinMaxScaler?or?MaxAbsScaler, respectively.
?动机:
注意点:
附:Scaling vs Whitening这里顺便提下Whitening(白化)和scaling的区别 简介: A?whitening transformation?or?sphering transformation?is a?linear transformation?that transforms a vector of?random variables?with a known?covariance matrix?into a set of new variables whose covariance is the?identity matrix, meaning that they are?uncorrelated?and each have?variance?1.[1]?The transformation is called "whitening" because it changes the input vector into a?white noise?vector. Several other transformations are closely related to whitening:
详见wiki百科:https://en.wikipedia.org/wiki/Whitening_transformation 动机: It is sometimes not enough to center and scale the features independently, since a downstream model can further make some assumption on the linear independence of the features. To address this issue you can use?PCA?with?whiten=True?to further remove the linear correlation across features. 也就是说,由于上述对特征处理的方法都是独立进行的,并无考虑特征间的相关性,而当下游模型假设特征间是无关的就会出现问题。Should I standardize the input variables (column vectors)? Should I standardize the input variables (column vectors/特征)?That depends primarily on how the network combines input variables to compute the net input to the next (hidden or output) layer. If the input variables are combined via a distance function (such as Euclidean distance) in an RBF network, standardizing inputs can be crucial. The contribution of an input will depend heavily on its variability relative to other inputs. If one input has a range of 0 to 1, while another input has a range of 0 to 1,000,000, then the contribution of the first input to the distance will be swamped by the second input. So it is essential to rescale the inputs so that their variability reflects their importance, or at least is not in inverse relation to their importance. If the input variables are combined linearly, as in an MLP, then it is rarely strictly necessary to standardize the inputs, at least in theory. The reason is that any rescaling of an input vector can be effectively undone by changing the corresponding weights and biases, leaving you with the exact same outputs as you had before. However, there are a variety of practical reasons why standardizing the inputs can make training faster and reduce the chances of getting stuck in local optima. Also, weight decay and Bayesian estimation can be done more conveniently with standardized inputs. The main emphasis in the NN literature on initial values has been on the avoidance of saturation, hence the desire to use small random values. How small these random values should be depends on the scale of the inputs as well as the number of inputs and their correlations. Standardizing inputs removes the problem of scale dependence of the initial weights. But standardizing input variables can have far more important effects on initialization of the weights than simply avoiding saturation.With such a poor initialization, local minima are very likely to occur. It is therefore important to center the inputs to get good random initializations. In particular, scaling the inputs to [-1,1] will work better than [0,1], although any scaling that sets to zero the mean or median or other measure of central tendency is likely to be as good, and robust estimators of location and scale (Iglewicz, 1983) will be even better for input variables with extreme outliers. Standardizing input variables has different effects on different training algorithms for MLPs. For example: o Steepest descent is very sensitive to scaling. The more ill-conditioned the Hessian is, the slower the convergence. Hence, scaling is an important consideration for gradient descent methods such as standard backprop. o Quasi-Newton and conjugate gradient methods begin with a steepest descent step and therefore are scale sensitive. However, they accumulate second-order information as training proceeds and hence are less scale sensitive than pure gradient descent. o Newton-Raphson and Gauss-Newton, if implemented correctly, are theoretically invariant under scale changes as long as none of the scaling is so extreme as to produce underflow or overflow. o Levenberg-Marquardt is scale invariant as long as no ridging is required. There are several different ways to implement ridging; some are scale invariant and some are not. Performance under bad scaling will depend on details of the implementation. Two of the most useful ways to standardize inputs are:
Should I standardize the target variables (column vectors/标签)?Standardizing target variables is typically more a convenience for getting good initial weights than a necessity. However, if you have two or more target variables and your error function is scale-sensitive like the usual least (mean) squares error function, then the variability of each target relative to the others can effect how well the net learns that target. If the targets are of equal importance, they should typically be standardized to the same range or the same standard deviation. (如多任务学习/loss由多部分组成) The scaling of the targets does not affect their importance in training if you use maximum likelihood estimation and estimate a separate scale parameter (such as a standard deviation) for each target variable. In this case, the importance of each target is inversely related to its estimated scale parameter. In other words, noisier targets will be given less importance. For weight decay and Bayesian estimation, the scaling of the targets affects the decay values and prior distributions. Hence it is usually most convenient to work with standardized targets.Should I standardize the variables (column vectors) for unsupervised learning? Should I standardize the variables (column vectors) for unsupervised learning?The most commonly used methods of unsupervised learning, including various kinds of vector quantization, Kohonen networks, Hebbian learning, etc., depend on Euclidean distances or scalar-product similarity measures. The considerations are therefore the same as for standardizing inputs in RBF networks. If you are using unsupervised competitive learning to try to discover natural clusters in the data, rather than for data compression, simply standardizing the variables may be inadequate. Better yet for finding natural clusters, try mixture models or nonparametric density estimation 附:saturationoutputting(activation values) mostly 0 and 1 for sigmoid, and not much in-between,? 影响:
2.非线性转化(Non-linear transformation)Two types of transformations are available: quantile transforms and power transforms. Both quantile and power transforms are based on monotonic transformations of the features and thus preserve the rank of the values along each feature.quantile transforms quantile transformsQuantileTransformer?applies a non-linear transformation such that the probability density function of each feature will be mapped to a uniform or Gaussian distribution. it is a non-parametric transformation . Quantile transforms put all features into the same desired distribution based on the formula??where??is the cumulative distribution function of the feature and??the?quantile function?of the desired output distribution?. This formula is using the two following facts: (i) if??is a random variable with a continuous cumulative distribution function??then??is uniformly distributed on?; (ii) if??is a random variable with uniform distribution on??then??has distribution?. By performing a rank transformation, a quantile transform smooths out unusual distributions and is less influenced by outliers than scaling methods. It does, however, distort correlations and distances within and across features. RobustScaler?and?QuantileTransformer?are robust to outliers in the sense that adding or removing outliers in the training set will yield approximately the same transformation. But contrary to?RobustScaler,?QuantileTransformer?will also automatically collapse any outlier by setting them to the a priori defined range boundaries (0 and 1). This can result in saturation artifacts for extreme values.(这段没懂) 简单来说,就是QuantileTransformer?核心只考虑了特征排序后的秩,扭曲了特征的值之间的距离和关系,感觉太暴力了。附:Quantile function(分位数函数) 附:Quantile function(分位数函数)the?quantile?function, associated with a?probability distribution?of a?random variable, specifies the value of the random variable such that the probability of the variable being less than or equal to that value equals the given probability. It is also called the?percentile function,?percent-point function?or?inverse cumulative distribution function. 详见?https://en.wikipedia.org/wiki/Quantile_function power transformsPower transforms are a family of parametric transformations that aim to map data from any distribution to as close to a Gaussian distribution. In many modeling scenarios, normality of the features in a dataset is desirable. Power transforms are a family of parametric, monotonic transformations that aim to map data from any distribution to as close to a Gaussian distribution as possible in order to stabilize variance and minimize skewness. PowerTransformer?currently provides two such power transformations, the Yeo-Johnson transform and the Box-Cox transform. Box-Cox can only be applied to strictly positive data. In both methods, the transformation is parameterized by?, which is determined through maximum likelihood estimation.Should I nonlinearly transform the data? Should I nonlinearly transform the data?Most importantly, nonlinear transformations of the targets are important with noisy data, via their effect on the error function. Many commonly used error functions are functions solely of the difference abs(target-output). Nonlinear transformations (unlike linear transformations) change the relative sizes of these differences. With most error functions, the net will expend more effort, so to speak, trying to learn target values for which abs(target-output) is large. Less importantly, smooth functions are usually easier to learn than rough functions. Generalization is also usually better for smooth functions. So nonlinear transformations (of either inputs or targets) that make the input-output function smoother are usually beneficial. For classification problems, you want the class boundaries to be smooth. When there are only a few inputs, it is often possible to transform the data to a linear relationship, in which case you can use a linear model instead of a more complex neural net, and many things (such as estimating generalization error and error bars) will become much simpler. A variety of NN architectures (RBF networks, B-spline networks, etc.) amount to using many nonlinear transformations, possibly involving multiple variables simultaneously, to try to make the input-output function approximately linear (Ripley 1996, chapter 4). There are particular applications, such as signal and image processing, in which very elaborate transformations are useful (Masters 1994). It is usually advisable to choose an error function appropriate for the distribution of noise in your target variables (McCullagh and Nelder 1989). But if your software does not provide a sufficient variety of error functions, then you may need to transform the target so that the noise distribution conforms to whatever error function you are using. For example, if you have to use least-(mean-)squares training, you will get the best results if the noise distribution is approximately Gaussian with constant variance, since least-(mean-)squares is maximum likelihood in that case. Heavy-tailed distributions (those in which extreme values occur more often than in a Gaussian distribution, often as indicated by high kurtosis) are especially of concern, due to the loss of statistical efficiency of least-(mean-)square estimates (Huber 1981). Note that what is important is the distribution of the noise, not the distribution of the target values. The distribution of inputs may suggest transformations, but this is by far the least important consideration among those listed here. If an input is strongly skewed, a logarithmic, square root, or other power (between -1 and 1) transformation may be worth trying. If an input has high kurtosis but low skewness, an arctan transform can reduce the influence of extreme values:
,where c is a constant that controls how far the extreme values are brought in towards the mean. Arctan usually works better than tanh, which squashes the extreme values too much. Using robust estimates of location and scale (Iglewicz 1983) instead of the mean and standard deviation will work even better for pathological distributions. 3.Normalization(样本归一化)Normalization?is the process of?scaling individual samples to have unit norm. This process can be useful if you plan to use a quadratic form such as the dot-product or any other kernel to quantify the similarity of any pair of samples. This assumption is the base of the?Vector Space Model?often used in text classification and clustering contexts. 即对样本(模)进行归一化,而非特征维度。 动机:
注意:
Should I standardize the input cases (row vectors/样本)?Whereas standardizing variables is usually beneficial, the effect of standardizing cases (row vectors) depends on the particular data. Cases are typically standardized only across the input variables, since including the target variable(s) in the standardization would make prediction impossible. There are some kinds of networks, such as simple Kohonen nets, where it is necessary to standardize the input cases to a common Euclidean length; this is a side effect of the use of the inner product as a similarity measure. If the network is modified to operate on Euclidean distances instead of inner products, it is no longer necessary to standardize the input cases. Standardization of cases should be approached with caution because it discards information. If that information is irrelevant, then standardizing cases can be quite helpful. If that information is important, then standardizing cases can be disastrous. Issues regarding the standardization of cases must be carefully evaluated in every application. There are no rules of thumb that apply to all applications. You may want to standardize each case if there is extraneous variability between cases. Consider the common situation in which each input variable represents a pixel in an image. If the images vary in exposure, and exposure is irrelevant to the target values, then it would usually help to subtract the mean of each case to equate the exposures of different cases. If the images vary in contrast, and contrast is irrelevant to the target values, then it would usually help to divide each case by its standard deviation to equate the contrasts of different cases. Given sufficient data, a NN could learn to ignore exposure and contrast. However, training will be easier and generalization better if you can remove the extraneous exposure and contrast information before training the network. Compare the effect of different scalers on data with outliers 附:Compare the effect of different scalers on data with outliers可以看下官方给的采用不同方法的比较示例: Compare the effect of different scalers on data with outliers — scikit-learn 1.0.1 documentation 最后,重要的话再说三遍, There are no rules of thumb that apply to all applications. There are no rules of thumb that apply to all applications. There are no rules of thumb that apply to all applications. 主要参考链接: |
|
|
上一篇文章 下一篇文章 查看所有文章 |
|
开发:
C++知识库
Java知识库
JavaScript
Python
PHP知识库
人工智能
区块链
大数据
移动开发
嵌入式
开发工具
数据结构与算法
开发测试
游戏开发
网络协议
系统运维
教程: HTML教程 CSS教程 JavaScript教程 Go语言教程 JQuery教程 VUE教程 VUE3教程 Bootstrap教程 SQL数据库教程 C语言教程 C++教程 Java教程 Python教程 Python3教程 C#教程 数码: 电脑 笔记本 显卡 显示器 固态硬盘 硬盘 耳机 手机 iphone vivo oppo 小米 华为 单反 装机 图拉丁 |
360图书馆 购物 三丰科技 阅读网 日历 万年历 2024年11日历 | -2024/11/27 4:23:23- |
|
网站联系: qq:121756557 email:121756557@qq.com IT数码 |