Data Analysis Using WEKA

Data Analysis using WEKA Waikato Environment for Knowledge Analysis Weka is a popular suite of machine learning software

Views 115 Downloads 2 File size 2MB

Report DMCA / Copyright

DOWNLOAD FILE

Recommend stories

Citation preview

Data Analysis using WEKA Waikato Environment for Knowledge Analysis Weka is a popular suite of machine learning software written in Java, developed at the University of Waikato, New Zealand. This paper is attempts to show the use of this software for Cluster Analysis and Decision Tree Analysis. Prabhjot Singh Bhatia – 10BM60060 Vinod Gupta School of Management, Indian Institute of Technology Kharagpur India.

Table of Contents Introduction ........................................................................................................................................................ 3 Cluster Analysis .............................................................................................................................................................. 3 K Means Clustering ................................................................................................................................................... 3 Decision Trees ................................................................................................................................................................ 4 Features ....................................................................................................................................................................... 4

Cluster Analysis using WEKA ......................................................................................................................... 5 About the dataset ........................................................................................................................................................... 5 Steps for K Means clustering ....................................................................................................................................... 5 Interpretation ................................................................................................................................................................ 14 Cluster 0 .................................................................................................................................................................... 14 Cluster 1 .................................................................................................................................................................... 14 Cluster 2 .................................................................................................................................................................... 15 Cluster 3 .................................................................................................................................................................... 15 Cluster 4 .................................................................................................................................................................... 15

Decision trees using WEKA ..........................................................................................................................16 About the Dataset ........................................................................................................................................................ 16 Steps for decision tree generation .............................................................................................................................. 16 Interpretation of Output ............................................................................................................................................. 21

Works Cited ......................................................................................................................................................23

1

Table of Figures Screenshot 1: The data file loaded into Weka................................................................................................ 6 Screenshot 2: The Visualize Tab...................................................................................................................... 6 Screenshot 3: Selecting Simple KMeans ......................................................................................................... 7 Screenshot 4: The Cluster Tab ......................................................................................................................... 7 Screenshot 5: KMeans Clustering Options .................................................................................................... 8 Screenshot 6: SimpleKMeans, 2 Clusters ....................................................................................................... 8 Screenshot 7: Simple KMeans, Results for 2-14 clusters ............................................................................. 9 Screenshot 8: Knee point for number of clusters ......................................................................................... 9 Screenshot 9: Visualizing 5 cluster solution .................................................................................................12 Screenshot 10: Visualization of Clusters - I .................................................................................................13 Screenshot 11: Visualization of Clusters - II ...............................................................................................13 Screenshot 12: The data file loaded into Weka ...........................................................................................17 Screenshot 13: Visualizing the given dataset................................................................................................17 Screenshot 14: Selecting J48 Tree Algorithm ..............................................................................................18 Screenshot 15: The Classify Tab ....................................................................................................................18 Screenshot 16: The Output of J48 Algorithm .............................................................................................19 Screenshot 17: The Decision Tree Generated. ............................................................................................19 Screenshot 18: Dissection of the Textual Decision Tree Output .............................................................22

2

Introduction Identifying patterns in data and being able to make predictions based on the patterns plays significant role in all aspects of an industry or an individual business. A plethora of methods and tools are available. This paper is an attempt to introduce two such methods namely Decision tree classification and K Means clustering using a tool called Weka.

Cluster Analysis As a part of exploratory data mining, cluster analysis is used to assign a set of objects to various groups such that they are more similar to other objects within the group than those belonging to another. Cluster Analysis as a statistical data analytical tool forms the backbone of various disciplines including marketing research, pattern finding, intelligence gathering and image analysis etc. Cluster analysis comprises of a set of algorithms, all used in specific situations. These include Hierarchical Clustering, K-Means clustering etc. K Means Clustering This method has the objective of classifying a set of n objects into k clusters, based on the closeness to the cluster centers. The closeness to cluster centers is measured by the use of a standard distance algorithm, eg. Euclidean distance. Features KMeans clustering is computationally very fast, as compared to other types of clustering algorithms. The number of clusters is expected as an input for the algorithm to work. This number “k” can be determined by a number of methods:One of them uses hierarchical clustering first to determine k and then continues with K Means. However, this method has a limitation of not being computationally efficient and takes a long time to converge. Another method, and the one that will be described here, is to use the Knee- point method. (Sugar, Gareth, & James, 2003) This comprises of running the K means algorithm for k=1 to an arbitrary number say 10-15. The distortion or average sum of squared errors are then plotted against number of clusters to find a “knee” point. This gives the number of clusters.

3

KMeans Clustering Algorithm The algorithm takes the number of clusters („k’) as a mandatory input. The algorithm follows the following steps: (Matteucci) 1. „k‟ points are placed into space represented by the objects under consideration. These are the initial cluster centres, and are the group centroids. 2. Based on a distance algorithm, the objects are assigned to the cluster closest to them. 3. Step 2 continues until all objects have been assigned to a cluster. 4. The positions of the cluster centroids are calculated again. 5. Steps 2 and 3 are repeated until the cluster centres no longer move. 6. This gives a partitioning of objects into distinct groups.

Decision Trees The aim of this method is to predict what criteria is used to determine the outcome and in what order. We start with a sample data called the training set and generate a model based on machine learning. We later use this model in the field to predict the outcome on the basis of the input criteria. The actual splitting of data into various groups is done on the basis of some rules. A majority of them depend on a technique called recursive partitioning, since it is repeated for each sub-group of data. Further partitioning is done till the point where such partitioning does not add value. There are several algorithms for decision trees including J48, BTree, Random Forest, etc. Features      

Simple to understand and interpret. No or little data preprocessing is required. Ability to handle both numerical and categorical data. Performs well with large data in a short time. Trees can get very complex and very large to visualize Attributes with more levels tend to bias the tree.

4

Cluster Analysis using WEKA About the dataset The dataset used is a german credit dataset, and is available from (Hofman). The data has been compiled by Professor Dr. Hans Hofmann, Institut f"ur Statistik und "Okonometrie, University of Hamburg, Germany. The dataset describes various customers on the basis of a set of 20 parameters, including age, housing, number of years since being a resident, car owned, etc. The data set also contains a “class” attribute to classify credit-worthiness. We do not include this attribute in our clustering algorithm.

Steps for K Means clustering 1. Open the file named credit-g.arff, available from the source above. The file is pre-processed, we can use the data as is, without need for preprocessing. (Screenshot 1) 2. Click on “Visualize” tab to have a look at the data. This shows the plots of one field against every other field, colour coded by the class. (Screenshot 2) 3. Click on “Cluster” tab (Screenshot 4). Click on “Choose” and select “SimpleKMeans” (Screenshot 3). By default, Weka chooses a commonly used value for its parameters.(Screenshot 5) The various options are self-explanatory. One major decision variable is the number of clusters. As described above, we need to perform the cluster analysis. 4. Keep the number of clusters as 2, since a 1 cluster solution would not be appropriate in this case. Click OK. 5. Click on Ignore attributes. If the data set contains any column from other classification experiments, select that attribute from the list provided. This is needed to make the clustering process unbiased. 6. Click “Start” to start the clustering algorithm. The Weka icon at the bottom right corner keeps flipping while processing is still in progress. Once the icon stops, the analysis result is displayed in the adjacent window. (Screenshot 6) 7. Click on the text box next to “Choose” button to get the list of options. Increase the number of clusters to 3, and click OK. Repeat steps 5 and 6. 8. Similarly, repeat step 7 for cluster numbers upto 14. 9. Tabulate the “Average Within Cluster Sum of Squared Errors” (as seen in Screenshot 7) and plot the graph using a spreadsheet program (Screenshot 8). 10. The knee point shows the number of clusters in the dataset. Use the output of 5 clusters as the cluster solution. 11. Right Click on the 5 cluster solution and click on “Visualize Cluster Assignments”.(Screenshot 9)

5

Screenshot 1: The data file loaded into Weka.

Screenshot 2: The Visualize Tab.

6

Screenshot 4: The Cluster Tab

Screenshot 3: Selecting Simple KMeans

7

Screenshot 5: KMeans Clustering Options

Screenshot 6: SimpleKMeans, 2 Clusters

8

Screenshot 7: Simple KMeans, Results for 2-14 clusters

Avg. Within Cluster Sum of Squared Error 2 3 4 5 6 7 8 9 10 11 12 13 14

5365.998 5145.269 4927.793 4691.713 4613.818 4530.644 4437.524 4273.035 4202.059 4197.927 4157.734 4113.83 4037.48

Avg. Within Cluster Sum of Squared Error 5600

Knee Point

5400 Avg. Within Cluster Sum of Squared Errors

Clusters

5200 5000 5, 4691.713078

4800 4600 4400 4200 4000 2

3

4

5

6

7

8

9

10

11

No. of Clusters Screenshot 8: Knee point for number of clusters

9

12

13

14

=== Run information === Scheme:weka.clusterers.SimpleKMeans -N 5 -A "weka.core.EuclideanDistance -R first-last" -I 500 -S 10 Relation: german_credit Instances: 1000 Attributes: 21 checking_status duration credit_history purpose credit_amount savings_status employment installment_commitment personal_status other_parties residence_since property_magnitude age other_payment_plans housing existing_credits job num_dependents own_telephone foreign_worker Ignored: class Test mode:evaluate on training data === Model and evaluation on training set ===

kMeans ====== Number of iterations: 9 Within cluster sum of squared errors: 4691.713078260774 Missing values globally replaced with mean/mode Cluster centroids: Cluster# Full Data 0 2 3 4 (1000) (220) (130) (230) (267) (153) ======================================================================================================= ======================================================================================================= ============= checking_status no checking no checking