Tuesday, May 24, 2011

Data mining (2)

The data classification process includes two steps: Learning and Classification. The class label of each training item is provided, is is called supervised learning. It contrasts with unsupervised learning (clustering), in which the class label of each training item is not known, and the number of set of classes to be learned may not be known in advance either.

Evaluate the classification and prediction method:
Accuracy, speed, robustness (even given noisy data or missing data), scalability (can applied to large amount of data), interpretability

Backpropagation (BP) is a neural network learning algorithm. The advantages of neural networks include the high tolerance of noisy data as well as the ability to classify patterns on which they have not been trained.

A multiplayer neural network includes input layer, hidden layer(s), and output layer. We call it two-layer neural network if there are only there three layers (input layer is not counted because it serves only to pass the input values to the next layer). If it contains two hidden layers, it is called a three-layer neural network.

Before training begins, we have to decide: number of units in input layer, number of hidden layers, number of units in each hidden layer, and the number of units in output layer. Normalize the input data will speed up the learning process.

SVM uses a nonlinear mapping to transform the original training data into a higher dimension. Within this new dimension, it searches the linear optimal separating hyperplane (decision boundary- separate the items of one class from another). With an appropriate nonlinear mapping to a sufficiently high dimension, data from two classes can always to be separated by a hyperplane. SVM finds this hyperplane using support vectors (some “essential” training items) and margins (defined by the support vectors). SVM searches for hyperplane with the largest margin, that is the maximum marginal hyperplane (MMH). The complexity of the learned classifier is decided by the number of support vectors rather than the dimensionality of the data. Hence, SVM is less sensitive to overfitting than other method. The support vectors are essential or critical training items, they lie closet to the decision boundary (MMH). If all the other training items are removed and repeat the training process, we get the same separating hyperplane. For nonlinar SVM, we can get it by extending the approach for linear SVM: first, transform the original input data into a higher dimensional space using a nonlinear mapping. 2nd, search for a linear separating hyperplane in the new spance. For example, a 3D input vector X={x1,x2,x3} is mapped to a 6D space Z using the mappings \phi_1(X)=x1, \phi_1(X)=x2… \phi_4(X)=(x1)^2, \phi_5(X)=x1x2, \phi_6(X)=x1x3. A decision hyperplane in the new spance is d(X)=WZ+b. Instead of computing the dot product on the transformed data items, it turns out that is is mathematically equivalent to apply a kernel function K(Xi, Xj)=\phi(Xi).\phi(Xj) – In other word, every \phi(Xi).\phi(Xj)  appears in the training algorithm, we can replace it with a kernel function  K(Xi, Xj). The the calculations are made in the original input space, which is much lower dimensionality.

4 comments:

  1. Data mining is one of the most recent technologies that are currently used in data harvesting and analysis. It is an important process for every business whether large, medium or small-sized. This is because information is a key to any business success. it is quite different from other processes such as statistical analysis. It is important to note that data mining was originally developed to act as expert systems in solving problems.. More helpful...

    Data mining in New Dimension

    ReplyDelete
  2. Data mining is one of the most recent technologies that are currently used in data harvesting and analysis. It is an important process for every business whether large, medium or small-sized. This is because information is a key to any business success. it is quite different from other processes such as statistical analysis. It is important to note that data mining was originally developed to act as expert systems in solving problems.. More helpful...

    Data mining in New Dimension

    ReplyDelete
  3. This comment has been removed by the author.

    ReplyDelete
  4. I agree with the facts stated above. Core java is used to develop standalone applications, whereas advanced Java develops web and enterprise applications. The future for freelancers is quite high; Eiliana.com is a global freelancing platform that serves you with top projects globally. It's the best freelance portal I have ever come across.

    ReplyDelete