Data Mining Lab

Name: Muhammad Sarfraz Seat: EP1850086 Section: A Course Code: 514 Course Name: Data Warehousing and Data Mining lab 0

Views 90 Downloads 7 File size 725KB

Report DMCA / Copyright

DOWNLOAD FILE

Recommend stories

Citation preview

Name: Muhammad Sarfraz Seat: EP1850086 Section: A Course Code: 514 Course Name: Data Warehousing and Data Mining

lab 01

LAB 01 : CONDITIONS Write an if-else statement in python, which checks if the student is enrolled in 2 or 3 subjects with extra certification. Write down proper message for the statement. One subject fee=1000 Certification fee =700 2 Subjects and 3 certifications are allowed together for a student If a student selects 3 subjects then only two certifications can be selected to enrolled

In [6]: subjectFee = 1000 certificationFee = 700 noOfSubjects = 3 noOfCertifications = 2 if(noOfSubjects == 2 and noOfCertifications == 3): print('Student is enrolled in 2 subjects and 3 certifications') elif(noOfSubjects == 3 and noOfCertifications == 2): print('Student is enrolled in 3 subjects and 2 certifications') else: print('Student cannot enrolled') Student is enrolled in 3 subjects and 2 certifications

lab 02

LAB 02 : LOOPS Initial a list of password of range 10, check if passwords numbers increasing 500 then break the operation and print else statement “Password cannot be greater than 500” print every password in a new line with a message “Your new password”. In [2]: passwords = [121,55,86,1,147,635,98,63,453,100] for p in passwords: if(p > 500): print('Password cannot be greater than 500') break else: print('Your new password : ',p) Your new Your new Your new Your new Your new Password

password : 121 password : 55 password : 86 password : 1 password : 147 cannot be greater than 500

lab 03

LAB 03 : NumPy & Pandas In [1]: import numpy as np import pandas as pd In [12]: df = pd.DataFrame(np.random.randn(4,3),index=['a','b','c','d'], columns= ['one','two', 'three']) In [13]: df Out[13]: one

two

three

a

1.968427

0.360732

0.526789

b

0.545311

-0.511318

1.771034

c -1.270482

1.454086

-0.179600

d -1.487337

-0.008176

-0.849439

In [14]: df['one'] Out[14]: a 1.968427 b 0.545311 c -1.270482 d -1.487337 Name: one, dtype: float64 In [15]: df.loc['a'] Out[15]: one two three Name: a,

1.968427 0.360732 0.526789 dtype: float64

In [16]: df = df.reindex(['a','b','c','d','e'])

lab 03

In [17]: df Out[17]: one

two

three

a

1.968427

0.360732

0.526789

b

0.545311

-0.511318

1.771034

c

-1.270482

1.454086

-0.179600

d

-1.487337

-0.008176

-0.849439

e

NaN

NaN

NaN

In [18]: df.fillna('0') Out[18]: one

two

three

1.96843

0.360732

0.526789

b 0.545311

-0.511318

1.77103

a

c

-1.27048

1.45409

-0.1796

d

-1.48734

-0.00817578

-0.849439

e

0

0

0

In [19]: df Out[19]: one

two

three

a

1.968427

0.360732

0.526789

b

0.545311

-0.511318

1.771034

c

-1.270482

1.454086

-0.179600

d

-1.487337

-0.008176

-0.849439

e

NaN

NaN

NaN

In [20]: df = df.fillna('0')

lab 03

In [21]: df Out[21]: one

two

three

1.96843

0.360732

0.526789

b 0.545311

-0.511318

1.77103

a

c

-1.27048

1.45409

-0.1796

d

-1.48734

-0.00817578

-0.849439

e

0

0

0

In [22]: df = df.reindex(columns=['one','two','three','four','fiver'])

In [23]: df Out[23]: one

two

three

four

fiver

1.96843

0.360732

0.526789

NaN

NaN

b 0.545311

-0.511318

1.77103

NaN

NaN

a

c

-1.27048

1.45409

-0.1796

NaN

NaN

d

-1.48734

-0.00817578

-0.849439

NaN

NaN

e

0

0

0

NaN

NaN

one

two

three

four

fiver

1.96843

0.360732

0.526789

1.0

1.0

b 0.545311

-0.511318

1.77103

1.0

1.0

In [24]: df= df.fillna(1) In [25]: df Out[25]:

a

c

-1.27048

1.45409

-0.1796

1.0

1.0

d

-1.48734

-0.00817578

-0.849439

1.0

1.0

e

0

0

0

1.0

1.0

lab 03

In [29]: df =df.rename(columns={'fiver':'five'}) In [30]: df Out[30]: one

two

three

four

five

1.96843

0.360732

0.526789

1.0

1.0

b 0.545311

-0.511318

1.77103

1.0

1.0

a

c

-1.27048

1.45409

-0.1796

1.0

1.0

d

-1.48734

-0.00817578

-0.849439

1.0

1.0

e

0

0

0

1.0

1.0

In [31]:

In [ ]: # -------------- CREATING NEW DATAFRAME ------------------ # In [ ]:

In [41]: data_frame = pd.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6],"C":[7,8,9]})

In [42]: data_frame Out[42]: A B

C

0

1

4

7

1

2

5

8

2

3

6

9

In [43]: data_frame = data_frame.reindex(columns=['A','B','C','D','E'])

lab 03

In [44]: data_frame Out[44]: A B

C

D

E

0

1

4

7 NaN

NaN

1

2

5

8 NaN

NaN

2

3

6

9 NaN

NaN

In [46]: for i in data_frame: print(data_frame[i]) 0 1 1 2 23 Name: A, 0 4 1 5 26 Name: B, 0 7 1 8 29 Name: C, 0 NaN 1 NaN 2 NaN Name: D, 0 NaN 1 NaN 2 NaN Name: E,

dtype: int64

dtype: int64

dtype: int64

dtype: float64

dtype: float64

lab 03

In [47]: for i in data_frame: print(data_frame[i].isnull()) 0 False 1 False 2False Name: A, dtype: 0 False 1 False 2False Name: B, dtype: 0 False 1 False 2False Name: C, dtype: 0 True 1 True 2True Name: D, dtype: 0 True 1 True 2True Name: E, dtype: In [ ]:

bool

bool

bool

bool

bool

lab 04

LAB 04 : Gradient Descent for Linear Regression In [2]: import numpy as np import pandas as pd import matplotlib.pyplot as plt In [3]: x_quad = [n/10 for n in range(0, 100)] y_quad = [(n-4)**2+5 for n in x_quad] plt.figure(figsize = (10,7)) plt.plot(x_quad, y_quad, 'k--') plt.axis([0,10,0,30]) plt.plot([1, 2, 3], [14, 9, 6], 'ro') plt.plot([5, 7, 8],[6, 14, 21], 'bo') plt.plot(4, 5, 'ko') plt.xlabel('x') plt.ylabel('f(x)') plt.title('Quadratic Equation') Out[3]: Text(0.5, 1.0, 'Quadratic Equation')

In [4]: data = pd.read_csv('../ex1data1.txt', names = ['population', 'profit'])

lab 04

In [5]: data Out[5]: population

profit

0

6.1101 17.59200

1

5.5277

2

8.5186 13.66200

3

7.0032 11.85400

4

5.8598

6.82330

...

...

...

92

5.8707

7.20290

93

5.3054

1.98690

94

8.2934

0.14454

95

13.3940

9.05510

96

5.4369

0.61705

9.13020

97 rows × 2 columns In [6]: X_df = pd.DataFrame(data.population) y_df = pd.DataFrame(data.profit) m = len(y_df) In [7]: X_df Out[7]: population 0

6.1101

1

5.5277

2

8.5186

3

7.0032

4

5.8598

...

...

92

5.8707

93

5.3054

94

8.2934

95

13.3940

96

5.4369

97 rows × 1 columns

lab 04

In [8]: plt.figure(figsize=(10,8)) plt.plot(X_df, y_df, 'kx') plt.xlabel('Population of City in 10,000s') plt.ylabel('Profit in $10,000s') Out[8]: Text(0, 0.5, 'Profit in $10,000s')

In [9]: iter = 1000 alpha = 0.01 In [10]: X_df['intercept'] = 1 In [11]: X = np.array(X_df) y = np.array(y_df).flatten() theta = np.array([0, 0])

lab 04

In [12]: def cost_function(X, y, theta): m = len(y) #

Calculate the cost with the given parameters J = np.sum((X.dot(theta)-y)**2)/2/m

return J In [13]: cost_function(X, y, theta) Out[13]: 32.072733877455676 In [14]: def gradient_descent(X, y, theta, alpha, iterations):

cost_history = [0] * iterations for iteration in range(iterations): print(X) print(np.shape(X)) hypothesis = X.dot(theta) loss = hypothesis-y gradient = X.T.dot(loss)/m theta = theta - alpha*gradient cost = cost_function(X, y, theta) cost_history[iteration] = cost return theta, cost_history In [28]: gd = gradient_descent(X,y,theta,alpha, iter) In [16]: print(theta) [0 0]

lab 04

In [17]: best_fit_x = np.linspace(0, 25, 20) best_fit_y = [theta[1] + theta[0]*xx for xx in best_fit_x] plt.figure(figsize=(10,6)) plt.plot(X_df.population, y_df, '.') plt.plot(best_fit_x, best_fit_y, '-') plt.axis([0,25,-5,25]) plt.xlabel('Population of City in 10,000s') plt.ylabel('Profit in $10,000s') plt.title('Profit vs. Population with Linear Regression Line') plt.show()

In [ ]:

In [ ]:

Search a dataset for Linear Regression and apply same algorithm on your dataset. Print the optimized parameters and visualizations and attach in your file. Also attach the code of this part in your file. In [18]: data = pd.read_csv('../exam_result.csv')

lab 04

In [19]: data.head() Out[19]:

SAT GPA 0 1714

2.40

1 1664

2.52

2 1760

2.54

3 1685

2.74

4 1693

2.83

In [20]: X_df = pd.DataFrame(data.SAT) y_df = pd.DataFrame(data.GPA) m = len(y_df)

lab 04

In [21]: plt.figure(figsize=(10,8)) plt.plot(X_df, y_df, 'kx') plt.xlabel('Score of SAT') plt.ylabel('Obtained GPA') Out[21]: Text(0, 0.5, 'Obtained GPA')

In [22]: iter = 1000 alpha = 0.01 In [23]: X_df['intercept'] = 1 In [24]: X = np.array(X_df) y = np.array(y_df).flatten() theta = np.array([0, 0]) In [25]: cost_function(X, y, theta) Out[25]: 5.581691666666667

lab 04

In [29]: gd = gradient_descent(X,y,theta,alpha, iter) In [27]: best_fit_x = np.linspace(0, 5000, 20) best_fit_y = [theta[1] + theta[0]*xx for xx in best_fit_x] plt.figure(figsize=(10,6)) plt.plot(X_df.SAT, y_df, '.') plt.plot(best_fit_x, best_fit_y, '-') plt.axis([0,5000,-1,4]) plt.xlabel('Score of SAT') plt.ylabel('Obtained GPA') plt.title('SAT Score vs. GPA') plt.show()

In [ ]:

lab 05

LAB 05 : Naive Bayes Naive Bayes uses a similar method to predict the probability of different class based on various attributes. This algorithm is mostly used in text classification and with problems having multiple classes.

In [1]: from sklearn.naive_bayes import GaussianNB import numpy as np In [2]: #assigning predictor and target variables x= np.array([[-3,7],[1,5], [1,2], [-2,0], [2,3], [-4,0], [-1,1], [1,1], [-2,2], [2,7] , [-4,1], [-2,7]]) Y = np.array([3, 3, 3, 3, 4, 3, 3, 4, 3, 4, 4, 4]) In [8]: #Create a Gaussian Classifier model = GaussianNB() # Train the model using the training sets model.fit(x, Y) Out[8]: GaussianNB() In [13]: #Predict Output predicted= model.predict([[1,2],[3,4]]) print (predicted) [3 4] In [ ]:

In [ ]:

Convert the “Play tennis” example discussed in class into numeric form and initialize x and y values based on that example. Now run the code for the new x values as discussed in class and print the output. Attach code and output in file.

lab 05

In [18]: # 0 - Overcast # 1 - Sunny # 2 - Rainy X_data = np.array([[1,0],[0,1],[2,1],[1,1],[1,1],[0,1],[2,0],[2,0], [1,1],[2,1], [1,0],[0,1],[0,1],[2,0]])

In [20]: Y_data = np.array([0,0,1,1,1,0,1,0,1,1,1,1,1,0]) In [23]: model = GaussianNB() model.fit(X_data, Y_data) Out[23]: GaussianNB() In [28]: predicted= model.predict([[2,0],[2,1],[2,2]]) print (predicted) [011] In [ ]:

lab 06

LAB 06 : Decision Tree Using Scikit Learn In [2]: import numpy as np import pandas as pd from sklearn.model_selection import train_test_split from sklearn.tree import DecisionTreeClassifier from sklearn.metrics import accuracy_score from sklearn import tree In [13]: balance_data = pd.read_csv('../balance-scale.data',sep= ',',header=None) balance_data.head() Out[13]: 0 1

2

3 4

0

B

1

1

1 1

1

R

1

1

1 2

2

R

1

1

1 3

3

R

1

1

1 4

4

R

1

1

1 5

In [14]: print("Dataset Lenght:: ", len(balance_data)) print("Dataset Shape:: ", balance_data.shape) Dataset Lenght:: 625 Dataset Shape:: (625, 5) In [15]: X = balance_data.values[:, 1:5] Y = balance_data.values[:,0] In [18]: X_train, X_test, y_train, y_test = train_test_split( X, Y, test_size = 0.3, random_state = 100) In [19]: clf_entropy = DecisionTreeClassifier(criterion = "entropy", random_state = 100, max_depth=3, min_samples_leaf=5) clf_entropy.fit(X_train, y_train) print(clf_entropy) DecisionTreeClassifier(criterion='entropy', max_depth=3, min_samples_leaf= 5, random_state=100)

lab 06

In [20]: y_pred_en = clf_entropy.predict(X_test) print(y_pred_en) ['R' 'L' 'L' 'R' 'R' 'R' 'R' 'L' 'L' 'R' 'R'

'L' 'R' 'L' 'L' 'L' 'R' 'L' 'L' 'R' 'L' 'R'

'R' 'L' 'R' 'R' 'R' 'L' 'R' 'L' 'R' 'R' 'L'

'L' 'R' 'L' 'R' 'L' 'L' 'L' 'R' 'L' 'L' 'L'

'R' 'L' 'R' 'L' 'R' 'L' 'R' 'R' 'L' 'R' 'L'

'L' 'L' 'L' 'R' 'R' 'R' 'R' 'R' 'R' 'R' 'R'

'R' 'R' 'R' 'R' 'R' 'R' 'L' 'R' 'L' 'L' 'R'

'L' 'R' 'L' 'R' 'L' 'R' 'R' 'L' 'L' 'R' 'L' 'L' 'R' 'R' 'L' 'R' 'R' 'R' 'R' 'L' 'R']

'R' 'L' 'R' 'L' 'L' 'L' 'L' 'R' 'R' 'R'

'R' 'R' 'L' 'R' 'L' 'R' 'R' 'R' 'R' 'L'

'R' 'L' 'L' 'L' 'L' 'L' 'R' 'L' 'R' 'R'

'L' 'R' 'R' 'L' 'L' 'L' 'L' 'L' 'L' 'L'

'L' 'L' 'L' 'R' 'R' 'R' 'R' 'R' 'R' 'L'

'R' 'R' 'L' 'L' 'R' 'R' 'R' 'L' 'R' 'L'

In [21]: print ("Accuracy is ", accuracy_score(y_test,y_pred_en)*100) Accuracy is

70.74468085106383

In [22]: with open("balanceScale.txt", "w") as f: f = tree.export_graphviz(clf_entropy, out_file=f) In [23]: from IPython.display import Image Image(filename='lab_06_1.PNG') Out[23]:

In [ ]:

Apply same code on any other dataset from uci machine learning repository write the outputs (accuracy, tree and its visualization)

In [149]: machine_data = pd.read_csv('../machine.data',header=None)

'L' 'L' 'R' 'L' 'L' 'R' 'R' 'R' 'R' 'L'

'R' 'L' 'L' 'L' 'R' 'R' 'L' 'L' 'R' 'L'

'L' 'L' 'L' 'R' 'L' 'R' 'L' 'R' 'R' 'R'

lab 06

In [150]: machine_data.head() Out[150]: 0

1

2

3

4

5

6

7

8

9

0

adviser

32/60

125

256

6000

256

16

128

198

199

1

amdahl

470v/7

29

8000 32000

32

8

32

269

253

2

amdahl

470v/7a

29

8000 32000

32

8

32

220

253

3

amdahl

470v/7b

29

8000 32000

32

8

32

172

253

4

amdahl

470v/7c

29

8000 16000

32

8

16

132

132

In [151]: print("Dataset Lenght:: ", len(machine_data)) print("Dataset Shape:: ", machine_data.shape) Dataset Lenght:: 209 Dataset Shape:: (209, 10) In [152]: X = machine_data.values[:, 2:3] Y = machine_data.values[:,0] In [153]: X_train, X_test, y_train, y_test = train_test_split( X, Y, test_size = 0.3, random_state = 100) In [155]: clf_entropy = DecisionTreeClassifier(criterion = "entropy", random_state = 100, max_depth=3, min_samples_leaf=5) clf_entropy.fit(X_train, y_train) print(clf_entropy) DecisionTreeClassifier(criterion='entropy', max_depth=3, min_samples_leaf= 5, random_state=100) In [156]: y_pred_en = clf_entropy.predict(X_test) print(y_pred_en) ['nas' 'amdahl' 'ibm' 'harris' 'harris' 'nas' 'nas' 'nas' 'harris' 'amdahl' 'nas' 'nas' 'nas' 'nas' 'harris' 'amdahl' 'nas' 'harris' 'nas' 'nas' 'harris' 'burroughs' 'harris' 'amdahl' 'harris' 'nas' 'nas' 'nas' 'ibm' 'ibm' 'nas' 'harris' 'harris' 'nas' 'burroughs' 'nas' 'nas' 'amdahl' 'ibm' 'ibm' 'harris' 'amdahl' 'harris' 'honeywell' 'nas' 'nas' 'harris' 'honeywell' 'nas' 'nas' 'honeywell' 'harris' 'nas' 'harris' 'amdahl' 'nas' 'harris' 'harris' 'harris' 'burroughs' 'nas' 'harris' 'ibm']

lab 06

In [157]: with open("machine_data.txt", "w") as f: f = tree.export_graphviz(clf_entropy, out_file=f) In [158]: Image(filename='lab_06_2.PNG') Out[158]:

lab 07

Lab 07 : Performance Metrics In [12]: from sklearn.metrics import confusion_matrix from sklearn.metrics import accuracy_score from sklearn.metrics import classification_report from sklearn.metrics import roc_auc_score from sklearn.metrics import log_loss In [13]: X_actual = [1, 1, 0, 1, 0, 0, 1, 0, 0, 0] Y_predic = [1, 0, 1, 1, 1, 0, 1, 1, 0, 0] In [14]: results = confusion_matrix(X_actual, Y_predic) print ('Confusion Matrix :') print(results) Confusion Matrix : [[3 3] [1 3]] In [15]: print ('Accuracy Score is',accuracy_score(X_actual, Y_predic)) print ('Classification Report : ') print (classification_report(X_actual, Y_predic)) print('AUC-ROC:',roc_auc_score(X_actual, Y_predic)) print('LOGLOSS Value is',log_loss(X_actual, Y_predic)) Accuracy Score is 0.6 Classification Report : precision 0 1 accuracy macro avg weighted avg

0.75 0.50

0.62 0.65

recall

f1-score

support

0.50 0.75

0.60 0.60

6 4

0.62 0.60

0.60 0.60 0.60

10 10 10

AUC-ROC: 0.625 LOGLOSS Value is 13.815750437193334 In [ ]:

Why we use performance matrices in machine learning.

lab 07

Performance metrics are use to evaluate different machine learning algorithms. Using performance metrics helps to justify the accuracy of your model/algorithm.

Task We have a confusion matric. This indicated the number of cancer patients tested and who came actually true . write the code in python to calculate the classification accuracy and classification report of the given data.

In [638]: X_actual = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,

0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1,

1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1,

0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,

1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0]

0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1,

1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,

1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1,

0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1,

0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 1,

1, 0, 1, 1, 0, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1,

1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1,

0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,

1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0]

0, 1, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 0, 1,

1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,

1, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 1,

0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1,

0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 1,

In [639]: Y_predic = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,

lab 07

In [640]: results = confusion_matrix(X_actual, Y_predic) print ('Confusion Matrix :') print(results) Confusion Matrix : [[ 50 10] [ 5 100]] In [641]: print ('Accuracy Score is',accuracy_score(X_actual, Y_predic))

print ('Classification Report : ') print (classification_report(X_actual, Y_predic)) Accuracy Score is 0.9090909090909091 Classification Report : precision recall f1-score

support

0 1

0.91 0.91

0.83 0.95

0.87 0.93

60 105

accuracy macro avg weighted avg

0.91 0.91

0.89 0.91

0.91 0.90 0.91

165 165 165

lab 09

LAB 09 : K-Means In [19]: import matplotlib.pyplot as plt import seaborn as sns; sns.set() import numpy as np from sklearn.cluster import KMeans from sklearn.datasets.samples_generator import make_blobs import warnings warnings.filterwarnings("ignore") %matplotlib inline In [20]: X, y_true = make_blobs(n_samples=400, centers=4, cluster_std=0.60, random_state=0)

In [21]: plt.scatter(X[:, 0], X[:, 1], s=20); plt.show()

In [22]: kmeans = KMeans(n_clusters=4) kmeans.fit(X) y_kmeans = kmeans.predict(X)

lab 09

In [23]: plt.scatter(X[:, 0], X[:, 1], c=y_kmeans, s=20, cmap='summer') centers = kmeans.cluster_centers_ plt.scatter(centers[:, 0], centers[:, 1], c='blue', s=100, alpha=0.9); plt.show()

In [ ]:

What is importance of K- mean theorem in clustering algorithms of machine learning . K-means clustering is a type of unsupervised learning, which is used when you have unlabeled data (i.e., data without defined categories or groups). The goal of this algorithm is to find groups in the data, with the number of groups represented by the variable K. Advantages of k-means Relatively simple to implement. Scales to large data sets. Guarantees convergence. Can warm-start the positions of centroids. Easily adapts to new examples. Generalizes to clusters of different shapes and sizes, such as elliptical clusters. Choosing manually. Being dependent on initial values.

lab 09

Write a code snippet in python to perform k mean algorithm implementation on a data set. create 10 clusters and calculate ceroids of data. And visualized them. In [40]: X, y_true = make_blobs(n_samples=1000, centers=10, cluster_std=1.5, random_state=0) In [47]: plt.scatter(X[:, 0], X[:, 1], s=5); plt.show()

In [48]: kmeans = KMeans(n_clusters=10) kmeans.fit(X) y_kmeans = kmeans.predict(X) In [52]: plt.scatter(X[:, 0], X[:, 1], c=y_kmeans, s=5, cmap='summer') centers = kmeans.cluster_centers_ plt.scatter(centers[:, 0], centers[:, 1], c='blue', s=50, alpha=1); plt.show()

lab 10

Lab 10 : Hierarchical Clustering In [15]: import pandas as pd import matplotlib.pyplot as plt from sklearn.preprocessing import normalize from sklearn.cluster import AgglomerativeClustering import scipy.cluster.hierarchy as shc %matplotlib inline In [16]: data=pd.read_csv('../Wholesale customers data.csv') data.head() Out[16]: Channel

Region

Fresh

Milk

Grocery

Frozen

Detergents_Paper

Delicassen

0

2

3

12669

9656

7561

214

2674

1338

1

2

3

7057

9810

9568

1762

3293

1776

2

2

3

6353

8808

7684

2405

3516

7844

3

1

3

13265

1196

4221

6404

507

1788

4

2

3

22615

5410

7198

3915

1777

5185

In [17]: data_scaled = normalize(data) data_scaled = pd.DataFrame(data_scaled, columns=data.columns) data_scaled.head() Out[17]: Channel

Region

Fresh

Milk

Grocery

Frozen Detergents_Paper Delicassen

0

0.000112

0.000168

0.708333

0.539874

0.422741

0.011965

0.149505

0.074809

1

0.000125

0.000188

0.442198

0.614704

0.599540

0.110409

0.206342

0.111286

2

0.000125

0.000187

0.396552

0.549792

0.479632

0.150119

0.219467

0.489619

3

0.000065

0.000194

0.856837

0.077254

0.272650

0.413659

0.032749

0.115494

4

0.000079

0.000119

0.895416

0.214203

0.284997

0.155010

0.070358

0.205294

lab 10

In [18]: plt.figure(figsize=(10, 7)) plt.title("Dendrograms") dend = shc.dendrogram(shc.linkage(data_scaled, method='ward'))

lab 10

In [19]: plt.figure(figsize=(10, 7)) plt.title("Dendrograms") dend = shc.dendrogram(shc.linkage(data_scaled, method='ward')) plt.axhline(y=6, color='r', linestyle='--') Out[19]:

lab 10

In [20]: cluster = AgglomerativeClustering(n_clusters=2,affinity='euclidean',linkage='ward') cluster.fit_predict(data_scaled) plt.figure(figsize=(10, 7)) plt.scatter(data_scaled['Milk'], data_scaled['Grocery'], c=cluster.labels_) Out[20]:

In [ ]:

How hierarchical clustering has importance over other algorithms. The advantage of hierarchical clustering is that it is easy to understand and implement. The dendrogram output of the algorithm can be used to understand the big picture as well as the groups in your data.

lab 12

LAB 12 : PCA In [2]: import pandas as pd import matplotlib.pyplot as plt from sklearn.preprocessing import LabelEncoder, StandardScaler from sklearn.decomposition import PCA from sklearn.model_selection import train_test_split # import warnings # warnings.filterwarnings("ignore") In [3]: m_data = pd.read_csv('../mushrooms.csv') In [4]: m_data.head() Out[4]: stalk class

cap-

shape

cap-

surface

capcolor

bruises

odor

gill-

attachment

gill-

spacing

gill-

size

gill- ... surface

color

below

rin 0

p

x

s

n

t

p

f

c

n

k ...

1

e

x

s

y

t

a

f

c

b

k ...

2

e

b

s

w

t

l

f

c

b

n ...

3

p

x

y

w

t

p

f

c

n

n ...

4

e

x

s

g

f

n

f

w

b

k ...

5 rows × 23 columns

In [5]: encoder = LabelEncoder() # Now apply the transformation to all the columns: for col in m_data.columns: m_data[col] = encoder.fit_transform(m_data[col]) X_features = m_data.iloc[:,1:23] y_label = m_data.iloc[:, 0] In [6]: scaler = StandardScaler() X_features = scaler.fit_transform(X_features)

lab 12

In [7]: # Visualize pca = PCA()

pca.fit_transform(X_features) pca_variance = pca.explained_variance_ plt.figure(figsize=(8, 6)) plt.bar(range(22), pca_variance, alpha=0.5, align='center', label='individual variance') plt.legend() plt.ylabel('Variance ratio') plt.xlabel('Principal components') plt.show()

lab 12

In [8]: pca2 = PCA(n_components=17) pca2.fit(X_features) x_3d = pca2.transform(X_features) plt.figure(figsize=(8,6)) plt.scatter(x_3d[:,0], x_3d[:,5], c=m_data['class']) plt.show()

In [ ]:

What is the difference between supervised and unsupervised dimensionality reduction analysis test? In a supervised learning model, the algorithm learns on a labeled dataset, providing an answer key that the algorithm can use to evaluate its accuracy on training data. An unsupervised model, in contrast, provides unlabeled data that the algorithm tries to make sense of by extracting features and patterns on its own.

lab 12

In [ ]:

Write a PCA implementation over a data set on the following link https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin (Diagnostic (https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin (Diagnostic)) and attach it in your lab file . In [11]: m_data = pd.read_csv('../breast-cancer-wisconsin.data',header=None) In [12]: m_data.head() Out[12]: 0

1 2

3 4

5

6 7

8 9

10

0

1000025

5 1

1 1

2

1 3

1 1

2

1

1002945

5 4

4 5

7

10 3

2 1

2

2

1015425

3 1

1 1

2

2 3

1 1

2

3

1016277

6 8

8 1

3

4 3

7 1

2

4

1017023

4 1

1 3

2

1 3

1 1

2

In [13]: encoder = LabelEncoder() # Now apply the transformation to all the columns: for col in m_data.columns: m_data[col] = encoder.fit_transform(m_data[col]) X_features = m_data.iloc[:,1:23] y_label = m_data.iloc[:, 0] In [16]: scaler = StandardScaler() X_features = scaler.fit_transform(X_features)

lab 12

In [24]: # Visualize pca = PCA()

pca.fit_transform(X_features) pca_variance = pca.explained_variance_ plt.figure(figsize=(8, 6)) plt.bar(range(10), pca_variance, alpha=0.5, align='center', label='individual variance' ) plt.legend() plt.ylabel('Variance ratio') plt.xlabel('Principal components') plt.show()

lab 12

In [33]: pca2 = PCA(n_components=10) pca2.fit(X_features) x_3d = pca2.transform(X_features) plt.figure(figsize=(8,6)) plt.scatter(x_3d[:,0], x_3d[:,5], c=m_data[0]) plt.show()

6/6