Classification in Data Mining
Classification is used to classify each item in a set of data into one of predefined set of classes or groups. The data analysis task classification is where a model or classifier is constructed to predict categorical labels (the class label attributes). Classification is a data mining function that assigns items in a collection to target categories or classes. The goal of classification is to accurately predict the target class for each case in the data. For example, a classification model could be used to identify loan applicants as low, medium, or high credit risks. A classification task begins with a data set in which the class assignments are known.
For example, a classification model that predicts credit risk could be developed based on observed data for many loan applicants over a period of time. In addition to the historical credit rating, the data might track employment history, home ownership or rental, years of residence, number and type of investments, and so on. Credit rating would be the target, the other attributes would be the predictors, and the data for each customer would constitute a case. Classifications are discrete and do not imply order. Continuous, floating-point values would indicate a numerical, rather than a categorical, target. A predictive model with a numerical target uses a regression algorithm, not a classification algorithm.
The simplest type of classification problem is binary classification. In binary classification, the target attribute has only two possible values: for example, high credit rating or low credit rating. Multiclass targets have more than two values: for example, low, medium, high, or unknown credit rating. In the model build (training) process, a classification algorithm finds relationships between the values of the predictors and the values of the target. Different classification algorithms use different techniques for finding relationships. These relationships are summarized in a model, which can then be applied to a different data set in which the class assignments are unknown. Classification has many applications in customer segmentation, business modeling, marketing, credit analysis, and biomedical and drug response modeling. Data classification is defined as two-step process.
- Classification constructs the classification model by using training data set.
- Classification predicts the value of classifying attribute or class label.
For example: Classification of credit approval on the basis of customer data.
University gives class to the students based on marks.
- If x >= 65, then First class with distinction.
- If 60<= x<= 65, then First class.
- If 55<= x<=60, then Second class.
- If 50<= x<= 55, then Pass class.
The two important steps of classification are:
1. Model construction
- A predefine class label is assigned to every sample tuple or object. These tuples or subset data are known as training data set.
- The constructed model, which is based on training set is represented as classification rules, decision trees or mathematical formulae.
2. Model usage
- The constructed model is used to perform classification of unknown objects.
- A class label of test sample is compared with the resultant class label.
- Accuracy of model is compared by calculating the percentage of test set samples, that are correctly classified by the constructed model.
- Test sample data and training data sample are always different.
Classification vs Prediction
|It uses the prediction to predict the class labels.||It is used to assess the values of an attribute of a given sample.|
|For example: If the patients are grouped on the basis of their known medical data and treatment outcome, then it is considered as classification.||For example: If a classification model is used to predict the treatment outcome for a new patient, then it is prediction.|
Issues related to Classification and Prediction
1. Data preparation
Data preparation consist of data cleaning, relevance analysis and
2. Evaluation of classification methods
i) Predictive accuracy: This is an ability of a model to predict the class label of a new
or previously unseen data.
ii) Speed and scalability: It refers to the time required to construct and use the model and increase efficiency in disk- resident databases.
It is an understanding and insight provided by the model.
Decision Tree Induction Method
- A decision tree performs the classification in the form of tree structure. It breaks down the dataset into small subsets and a decision tree can be designed simultaneously.
- The final result is a tree with decision node.
The following decision tree can be designed to declare a result, whether an applicant is eligible or not eligible to get the driving license.
Attribute selection methods
1. Gini Index (IBM intelligent Miner)
- Gini index is used in CART (Classification and Regression Trees), IBM’s Intelligent Miner system, SPRINT (Scalable Parallelizable Induction of decision Trees).
If a data set ‘T’ contains examples from ‘n’ classes, gini index, gini (T) is defined as:
After splitting T into two subsets T1, T2 with sizes N1 and N2, the gini index of the split data is defined as:
ginisplit (T) = N1/ N2 gini (T1) + N2/ N gini (T2)
- For each attribute, each of the possible binary splits is considered. For each attribute, the attribute providing smallest ginisplit (T) is chosen to split the node for continuous- valued attributes, where each possible split-point must be considered.
2. ID3 (Algorithm for inducing a decision Tree)
- Ross Quinlin developed ID3 algorithm in 1980.
- C4.5 is an extension of ID3.
- It avoids over-fitting of the data.
- It determines the depth of decision tree and reduces the error pruning.
- It also handles continuous value attributes. For example: salary or temperature.
- It works for missing value attribute and handles suitable attribute selection measure.
- It gives better efficiency of computation.
Algorithm to generate a decision tree from the training tuples of data partition, D.
Step 1: Create a node ‘N’:
Step 2: If tuple in D are all of the same class, ‘C’, then go to step 3
Step 3: Return ‘N’ as a leaf node labeled with the majority class in ‘C’
Step 4: If attribute list is empty, then return ‘N’ as a leaf node labeled with the majority class in D.
Step 5: Apply attribute_selection_method (D, attribute _list) to find the “best” splitting criteria.
Step 6: If splitting_attribute is discrete-valued and multi way, splits are allowed. Then follow step 7
Step 7: Attribute_list ← attribute_list – splitting_attribute;// remove splitting attribute.
Step 8: For each outcome j of splitting creation, Let Dj be the set of data tuples in D that satisfies outcome j, If Dj is empty, then attached leaf is labeled with the majority class in D to node N;
Step 9: Else, attach the node returned by Generate_decision_tree (Dj, attribute_list ) to node N;
Step 10: Return N;
Step 11: Stop.
3. Tree Pruning
- To avoid the overfitting problem, it is necessary to prune the tree.
- Generally, there are two possibilities while constructing a decision tree. Some record may contain noisy data, which increases the size of the decision tree. Another possibility is, if the number of training examples are too small to produce a representative sample of the true target function.
- Pruning can be possible in a top down or bottom up fashion.
Some well known methods to perform pruning are:
1. Reduced error pruning
This is simplest method of pruning. Start from the leaves. Each node is replaced with its most popular class to maintain accuracy.
2. Cost complexity pruning
- It generates a series of trees.
- Consider ‘T0‘ as the initial tree and ‘Tm‘ as root.
- Consider that the tree is created by removing a subtree from tree i- 1 and replacing it with a leaf node with value chosen as per the tree constructing algorithm.
- The subtree which is removed can be chosen as follows:
- Define the error rate of tree ‘T’ over data set ‘S’ as err (T,S).
- The subtree from tree that minimizes is chosen for removal.
- The function (T,t) defines the tree, which is obtained by pruning the subtrees ‘t’ from the tree ‘T’. After creating series of tree, the best tree is chosen by measuring a training set or cross-validation.
3. Alpha-beta pruning
- It is a search algorithm, which improves the minimax algorithm by eliminating branches which will not be able to give further outcome.
- Let alpha (α) be the value of the best choice along the path for higher value as MAX.
- Let beta (β) be the value of the best choice along the path for lower value as MIN.
- While working with decision tree, the problem of missing values (those values which are missing or wrong) may occur.
- So, one of the most common solution is to label that missing value as blank.
- Prediction deals with some variables or fields, which are available in the data set to predict unknown values regarding other variables of interest.
- Numeric prediction is the type of predicting continuous or ordered values for given input.
For example: The company may wish to predict the potential sales of a new product given with its price.
- The most widely used approach for numeric prediction is regression.