[人工智能] 数据挖掘（凑标题字数）

开发: C++知识库 Java知识库 JavaScript Python PHP知识库人工智能区块链大数据移动开发嵌入式开发工具数据结构与算法开发测试游戏开发网络协议系统运维
教程: HTML教程 CSS教程 JavaScript教程 Go语言教程 JQuery教程 VUE教程 VUE3教程 Bootstrap教程 SQL数据库教程 C语言教程 C++教程 Java教程 Python教程 Python3教程 C#教程
数码: 电脑笔记本显卡显示器固态硬盘硬盘耳机手机 iphone vivo oppo 小米华为单反装机图拉丁

-> 人工智能 -> 数据挖掘（凑标题字数） -> 正文阅读

[人工智能]数据挖掘（凑标题字数）

Data Mining

Chapter Two

Data dispersion characteristics

Center

Mean: $\bar{x} = \frac{1}{n} \sum_{i = 1}^n x_i$ , $\mu = \frac{\sum x}{N}$
Weighted Mean: $\bar{x} = \frac{\sum_{i = 1}^n w_i x_i}{\sum_{i = 1}^n w_i}$

Median(for grouped data): $L_1 + (\frac{n / 2 - (\sum freq)l}{freq_{median}}) width$

Mode: $\times (mean - median)$
mean > median, positively skewed
mean < median, negatively skewed

Quartiles: $Q_1$ (25th percentile), $Q_3$ (75th percentile)
Inter-quartile range: $IQR = Q_3 - Q_1$
Five number summary: min, $Q_1$ , median, $Q_3$ , max
Boxplot: ends of the box are the quartiles; median is marked; add whiskers, and plot outliers individually
Outlier: usually, a value higher/lower than 1.5 x IQR

Variance:
unbiased estimation: $s^2 = \frac{1}{n - 1} \sum_{i = 1}^n (x_i - \bar{x})^2 = \frac{1}{n - 1}[\sum_{i = 1}^n x_i^2 - \frac{1}{n}(\sum_{i = 1}^n x_i)^2]$
biased estimation: $\sigma^2 = \frac{1}{n} \sum_{i = 1}^n (x_i - \mu)^2 = \frac{1}{n} \sum_{i = 1}^n x_i^2 - \mu^2$

Pixel-Oriented Visualization Techniques

Similarity and Dissimilarity

	1	0	sum
1	q	r	q + r
0	s	t	s + t
sum	q + s	r + t	p

Distance measure for symmetric binary variables: $\frac{r + s}{q + r + s + t}$
Distance measure for asymmetric binary variables: $\frac{r + s}{q + + r + s}$

Here, the asymmetric means the loss cost is different, for some data sets, like the FP is an absolute majority.

Jaccard coefficient (similarity measure for asymmetric binary variables): $sim_{Jaccard}(i, j) = \frac{q}{q + r + s}$

Minowski distance(L-h norm): $\sqrt[h]{|x_{i1} - x_{j1}|^h + |x_{i2} - x_{j2}|^h + \cdots + |x_{ip} - x_{jp}|^h}$

Properties:

$d (i, j) > 0$ if $\neq j$ and $d (i, i) = 0$ (Positive definiteness)
$d (i, j) = d (j, i)$ (Symmetry)
$\leqslant d(i, k) + d(k, j)$ (Triangle Inequality)

A distance that satifies these properties is a metric.

$h = 1 :$ Mabhattan distance $|x_{i1} - x_{j1}| + |x_{i2} - x_{j2}| + \cdots + |x_{ip} - x_{jp}|$
$h = 2 :$ Euclidean distance $\sqrt{|x_{i1} - x_{j1}|^2 + |x_{i2} - x_{j2}|^2 + \cdots + |x_{ip} - x_{jp}|^2}$
$\rightarrow \infty:$ supernum distance $lim_{h \rightarrow \infty} (\sum_{f = 1}^p |x_{if} - x_{jf}|)^{\frac{1}{h}} = max_f^p |x_{if} - x_{jf}|$

Ordinal Variables: $z_{if} = \frac{r_{if} - 1}{M_f - 1}$

$\frac{\sum_{f = 1}^p \delta_{ij}^{(f)} d_{ij}^{(f)}}{\sum_{f = 1}^p \delta_{ij}^{(f)}}$

$cos(d_1, d_2) = \frac{d_1 \cdot d_2}{||d_1|| \cdot ||d_2||}$ to evaluate the similarity of sentences.

Chapter Three

Data Processing

Data cleaning, Data integration, Data reduction, Data transformation and data discretization.

$\chi^2$ (chi-square) test

$\chi^2 = \sum \frac{(Observerd - Expected)^2}{Excepted}$
The larger the Χ2 value, the more likely the variables are related.

Correlation coefficient(Pearson’s product moment coefficient)

$r_{A, B} = \frac{\sum_{i = 1}^n (a - \bar{A})(b - \bar{B})}{(n - 1) \sigma_A \sigma_B} = \frac{\sum_{i = 1}^n (a_i b_i) - n \bar{A} \bar{B}}{(n - 1) \sigma_A \sigma_B}$

$r_{A, B} > 0$ means A and B are positively correlated.

Let ${a_k}' = (a_k - mean(A)) / std(A), {b_k}' = (b_k - mean(B)) / std(B)$ ,
then $\cdot {B}'$

Covariance

$\bar{A})(B - \bar{B})) = \frac{\sum_{i = 1}^n (a_i - \bar{A})(b - \bar{B})}{n}$
$r_{A, B} = \frac{Cov(A, B)}{\sigma_A \sigma_B}$
$\bar{A})(B - \bar{B})) = E(A \cdot B) - \bar{A} \bar{B}$

Data reduction

Unsupervised:

Latent Semantic Indexing (LSI): truncated SVD
Principal Component Analysis (PCA)
Independent Component Analysis (ICA)
Canonical Correlation Analysis (CCA)

Supervised:

Linear Discriminant Analysis (LDA)

Semi-supervised:

Semi-supervised Discriminant Analysis (SDA)

Linear:

Latent Semantic Indexing (LSI): truncated SVD
Principal Component Analysis (PCA)
Linear Discriminant Analysis (LDA)
Canonical Correlation Analysis (CCA)

Nonlinear:

Nonlinear feature reduction using kernels
Manifold learning

Dimensionality reduction (Feature reduction):

Feature extraction
Feature selection

Selection: choose a best subset of size d from the available p features.
Extraction: given p features (set X), extract d new features (set Z) by linear or non-linear combination of all the p features.

PCA

Given $\{x_1, ..., x_n\} \in \mathbb{R}^p$ , target: get the $a$ to maxmize the $v a r (z)$ , here $z = a x$

$\begin{aligned} var(z) &= E((z - \bar{z})^2)\\ &= \frac{1}{n} \sum_{i = 1}^n (ax_i - a\bar{x})^2\\ &= \frac{1}{n} \sum_{i = 1}^n a^T(x_i - \bar{x})(x_i - \bar{x})^Ta\\ &= a^TSa\\ S &= \frac{1}{n} \sum_{i = 1}^n (x_i - \bar{x})(x_i - \bar{x})^T \end{aligned}$

which means $max_a a^TSa, s.t. a^Ta = 1$ .
We use Lagrange method to solve the problem.

$\begin{aligned} L = a^TSa - \lambda(a^Ta - 1)\\ \frac{\partial L}{\partial a} = 2Sa - 2\lambda a = 0\\ \end{aligned}$

So, $\lambda$ and $a$ is the pair of eigenvalue and eigenvector of S. Then $a^T \lambda a = \lambda$ . So the lambda is chosen from large to small.

Next, $max_{a_2} a_2^T S a_2, s.t. a_2^T a_2 = 1, cov(z^{(2)}, z^{(1)}) = 0$ if we want another PCA.
$cov(z^{(2)}, z^{(1)}) = a_2^T S a_1 = \lambda a_2^T a_1$ , so(I don’t know) $a_2 = \lambda a_2$ , and the $\lambda$ is the second largest eigenvalue.
Dimension reduction: $\chi \in \mathbb{R}^{p×n} \rightarrow A^T \chi∈\mathbb{R}^{d×n}$
Original data(Reconstruction): $A^T \chi \in \mathbb{R}^{d×n} \rightarrow \bar{X} =A(A^TX) \in \mathbb{R}^{p×n}$

Main theoretical result:
The matrix A consisting of the first d eigenvectors of the covariance matrix S solves the following optimization problem
$min_{A \in \mathbb{R}^{p \times d}} ||\chi - AA^TX||_F^2, s.t. A^TA = I_d$

LDA(Linear Discriminant Analysis)

Find a transformation a, such that the a^TX_1 and a^TX_2 are maximally separated & each class is minimally dispersed (maximum separation).

$max\ (a(\bar{x_1} - \bar{x_2}))^2, min\ var(z_1), min\ var(z_2)$
target: $max\ J = \frac{(a(\bar{x_1} - \bar{x_2}))^2}{var(z_1) + var(z_2)}$

Suppose there exists two class $w_1, w_2$
$z = a^Tx$
$\tilde{\mu_i} = \frac{1}{n_i} \sum_{z \in w_i} z$
$\mu_i = \frac{1}{n_i} \sum_{x \in w_i} x, \tilde{\mu_i} = a^T \mu_i$
$|\tilde{\mu_1} - \tilde{\mu_2}| = |a^T(\mu_1 - \mu_2)|$
$\tilde{s_i}^2 = \sum_{z \in w_i} (z - \tilde{\mu_i})^2$
$\frac{(\tilde{\mu_1} - \tilde{\mu_2})^2}{\tilde{s_1}^2 + \tilde{s_2}^2}$

$\tilde{s_i}^2 = \sum_{y \in w_i} (y - \tilde{\mu_i})^2 = \sum_{x \in w_i} (a^Tx - a^T\mu_i)^2 = \sum_{x \in w_i} (a^Tx - a^T\mu_i)(a^Tx - a^T\mu_i)^T = \sum_{x \in w_i} a^T(x - \mu_i)(x - \mu_i)^Ta = a^TS_ia$

within-in class scatter matrix: $S_W = S_1 + S_2, \tilde{s_1}^2 + \tilde{s_2}^2 = a^TS_Wa$

$(\tilde{\mu_1} - \tilde{\mu_2})^2 = (a^T\mu_1 - a^T\mu_2)^2 = a^T(\mu_1 - \mu_2) (\mu_1 - \mu_2)^Ta = a^TS_Ba$

between-class scatter matrix: $S_B = (\mu_1 - \mu_2)(\mu_1 - \mu_2)^T$

$\frac{a^TS_Ba}{a^TS_Wa}$
$S_Ba = \lambda S_Wa$
$S_W^{-1}S_Ba = \lambda a$

Chapter Four

FP mining

itemset: A set of one or more items
k-itemset $X = \{x_1, …, x_k\}$
(absolute) support, or, support count of X: Frequency or occurrence of an itemset $X$ ;
(relative) support, s, is the fraction of transactions that contains $X$ (i.e., the probability that a transaction contains $X$ ).
An itemset $X$ is frequent if $X$ ’s support is no less than a minsup threshold.

Find all the rules $\rightarrow Y$ with minimum support and confidence.
support, s, probability that a transaction contains $\cup Y$ ;
confidence, c, conditional probability that a transaction having $X$ also contains $Y$ .

closed-patterns and max-patterns
An itemset $X$ is closed if X is frequent and there exists no super-pattern $\supset X$ , with the same support as $X$ ;
An itemset $X$ is a max-pattern if $X$ is frequent and there exists no frequent super-pattern $\supset X$ .

So a max-pattern is a closed-pattern.

Apriori

An important property: ** Any subset of a frequent itemset must be frequent**.
Apriori pruning principle: If there is any itemset which is infrequent, its superset should not be generated/tested!

Method:

Initially, scan DB once to get frequent 1-itemset;
Generate length (k+1) candidate itemsets from length k frequent itemsets;
Test the candidates against DB;
Terminate when no frequent or candidate set can be generated;

Example:

Pseudo-code:
Ck: Candidate itemset of size k
Lk : frequent itemset of size k

L1 = {frequent items};
for (k = 1; Lk !=null; k++) do begin
    Ck+1 = candidates generated from Lk;
    for each transaction t in database do
  increment the count of all candidates in Ck+1 that are contained in t
    Lk+1  = candidates in Ck+1 with min_support
    end
return union(Lk);

Major computational challenges:

Multiple scans of transaction database
Huge number of candidates
Tedious workload of support counting for candidates

Improving Apriori: general ideas:

Reduce passes of transaction database scans
Shrink number of candidates
Facilitate support counting of candidates

FP-growth

Here, we link an website, it[step 3, pages 28] says that “Recursively mine conditional FP‐trees and grow frequent patterns obtained so far. If the conditional FP‐tree contains a single path, simply enumerate all the patterns”.

Mining sequential patterns

sequential patterns:

GSP

Chapter Five

Decision Tree
Bayes Classification Methods
Support Vector Machines

Decision Tree

It is derivated in the aspect of propability. We can calculate the propability of every output with the given input. If we assume every condition is independent, then $\prod P(X_i|C)$ , then $\sum logP(X_i|C)$ , so we let the cost Function be $l o g$ . To understand better, we can use the concept of thermodynamics, which is called entropy.

$\sum_{i = 1}^m p_i log(p_i)$ where $p_i = P(Y = y_i)$
$\sum_x p(x)H(Y|X = x)$

$-\sum_{i = 1}^m p_i log_2(p_i)$
$Info_A(D) = -\sum_{j = 1}^v \frac{|D_j|}{|D|} \times Info(D_j)$
$Gain(A) = Info(D) - Info_A(D)$

Bayes Classification Methods

First, we know that $\sum_{i = 1}^M P(B|A_i)P(A_i)$ , and $\frac{P(X|H)P(H)}{P(X)}$
Assume all condition is independent, then $P(X|C_i) = \prod_{k = 1}^n P(x_k|C_i)$

Na?ve Bayesian prediction requires each conditional prob. be non-zero. Otherwise, the predicted prob. will be zero.
Use Laplacian correction:

Adding 1 to each case
The “corrected” prob. estimates are close to their “uncorrected” counterparts

Support Vector Machines

SVM原理-1
SVM原理-2

Model Evaluation and Selection

Confusion Matrix:

Actual class/ Predicted class	$C_1$	$\neg C_1$
$C_1$	True Positive(TP)	False Negative(FN)
$\neg C_1$	False Positive(FP)	True Negative(TN)

Accuracy: $\frac{TP + TN}{ALL}$
Error rate: $\frac{FP + FN}{ALL}$
Sensitivity: $\frac{TP}{P}$
Specificity: $\frac{TN}{N}$

Precision: $\frac{TP}{TP + FP}$
Recall: $\frac{TP}{TP + FN}$
F measure: $\frac{2 \times Precision \times Recall}{Precision + Recall}$
F-beta measure: $\frac{(1 + \beta^2) \times Precision \times Recall}{\beta \times Precision + Recall}$