Clasification is one of the Machine Learning Task that identitfies the class to which an Instance Belongs.
Here, the observed health and medical records of horses. By applying Classifier Algorithm the data was studied and compared with different two algorithms. Predict the survival of horse based on the data and accuracy was observed and found Randomforset algorithm is more accurate than Decision Tree algorithm.
Import Libraries
from sklearn import metrics
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
Import Data
animals =pd.read_csv('/content/DataHorse.csv')
animals.head()
surgery | age | hospital_number | rectal_temp | pulse | respiratory_rate | temp_of_extremities | peripheral_pulse | mucous_membrane | capillary_refill_time | pain | peristalsis | abdominal_distention | nasogastric_tube | nasogastric_reflux | nasogastric_reflux_ph | rectal_exam_feces | abdomen | packed_cell_volume | total_protein | abdomo_appearance | abdomo_protein | outcome | surgical_lesion | lesion_1 | lesion_2 | lesion_3 | cp_data | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | no | adult | 530101 | 38.5 | 66.0 | 28.0 | cool | reduced | NaN | more_3_sec | extreme_pain | absent | severe | NaN | NaN | NaN | decreased | distend_large | 45.0 | 8.4 | NaN | NaN | died | no | 11300 | 0 | 0 | no |
1 | yes | adult | 534817 | 39.2 | 88.0 | 20.0 | NaN | NaN | pale_cyanotic | less_3_sec | mild_pain | absent | slight | NaN | NaN | NaN | absent | other | 50.0 | 85.0 | cloudy | 2.0 | euthanized | no | 2208 | 0 | 0 | no |
2 | no | adult | 530334 | 38.3 | 40.0 | 24.0 | normal | normal | pale_pink | less_3_sec | mild_pain | hypomotile | none | NaN | NaN | NaN | normal | normal | 33.0 | 6.7 | NaN | NaN | lived | no | 0 | 0 | 0 | yes |
3 | yes | young | 5290409 | 39.1 | 164.0 | 84.0 | cold | normal | dark_cyanotic | more_3_sec | depressed | absent | severe | none | less_1_liter | 5.0 | decreased | NaN | 48.0 | 7.2 | serosanguious | 5.3 | died | yes | 2208 | 0 | 0 | yes |
4 | no | adult | 530255 | 37.3 | 104.0 | 35.0 | NaN | NaN | dark_cyanotic | more_3_sec | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 74.0 | 7.4 | NaN | NaN | died | no | 4300 | 0 | 0 | no |
In this .csv, they have a Outcome column showing the target of study.
target=animals['outcome']
target.head()
0 died 1 euthanized 2 lived 3 died 4 died Name: outcome, dtype: object
target.unique()
array(['died', 'euthanized', 'lived'], dtype=object)
Let us Drop the 'outcome' from the excel.
animals = animals.drop(['outcome'],axis=1)
animals.head()
surgery | age | hospital_number | rectal_temp | pulse | respiratory_rate | temp_of_extremities | peripheral_pulse | mucous_membrane | capillary_refill_time | pain | peristalsis | abdominal_distention | nasogastric_tube | nasogastric_reflux | nasogastric_reflux_ph | rectal_exam_feces | abdomen | packed_cell_volume | total_protein | abdomo_appearance | abdomo_protein | surgical_lesion | lesion_1 | lesion_2 | lesion_3 | cp_data | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | no | adult | 530101 | 38.5 | 66.0 | 28.0 | cool | reduced | NaN | more_3_sec | extreme_pain | absent | severe | NaN | NaN | NaN | decreased | distend_large | 45.0 | 8.4 | NaN | NaN | no | 11300 | 0 | 0 | no |
1 | yes | adult | 534817 | 39.2 | 88.0 | 20.0 | NaN | NaN | pale_cyanotic | less_3_sec | mild_pain | absent | slight | NaN | NaN | NaN | absent | other | 50.0 | 85.0 | cloudy | 2.0 | no | 2208 | 0 | 0 | no |
2 | no | adult | 530334 | 38.3 | 40.0 | 24.0 | normal | normal | pale_pink | less_3_sec | mild_pain | hypomotile | none | NaN | NaN | NaN | normal | normal | 33.0 | 6.7 | NaN | NaN | no | 0 | 0 | 0 | yes |
3 | yes | young | 5290409 | 39.1 | 164.0 | 84.0 | cold | normal | dark_cyanotic | more_3_sec | depressed | absent | severe | none | less_1_liter | 5.0 | decreased | NaN | 48.0 | 7.2 | serosanguious | 5.3 | yes | 2208 | 0 | 0 | yes |
4 | no | adult | 530255 | 37.3 | 104.0 | 35.0 | NaN | NaN | dark_cyanotic | more_3_sec | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 74.0 | 7.4 | NaN | NaN | no | 4300 | 0 | 0 | no |
Categorize the variables into the list.
category_variables =['surgery','age','temp_of_extremities','peripheral_pulse',
'mucous_membrane','capillary_refill_time','pain',
'peristalsis','abdominal_distention','nasogastric_tube',
'nasogastric_reflux','rectal_exam_feces','abdomen',
'abdomo_appearance','abdomo_protein','surgical_lesion',
'lesion_1','lesion_2','lesion_3','cp_data']
for category in category_variables:
animals[category]=pd.get_dummies(animals[category])
Seggregate animals and target seperately. Now, lets split the data
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
X,y = animals.values, target.values
label_encoder = LabelEncoder()
y = label_encoder.fit_transform(y)
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=1)
Check the data status and shape
from sklearn.tree import DecisionTreeClassifier
print(X_train.shape)
(239, 27)
from sklearn.impute import SimpleImputer
imp = SimpleImputer(missing_values=np.NaN, strategy = 'most_frequent')
imp = imp.fit(animals.iloc[:,1:])
animals.head()
surgery | age | hospital_number | rectal_temp | pulse | respiratory_rate | temp_of_extremities | peripheral_pulse | mucous_membrane | capillary_refill_time | pain | peristalsis | abdominal_distention | nasogastric_tube | nasogastric_reflux | nasogastric_reflux_ph | rectal_exam_feces | abdomen | packed_cell_volume | total_protein | abdomo_appearance | abdomo_protein | surgical_lesion | lesion_1 | lesion_2 | lesion_3 | cp_data | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 1 | 530101 | 38.5 | 66.0 | 28.0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | NaN | 0 | 1 | 45.0 | 8.4 | 0 | 0 | 1 | 0 | 1 | 1 | 1 |
1 | 0 | 1 | 534817 | 39.2 | 88.0 | 20.0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | NaN | 1 | 0 | 50.0 | 85.0 | 0 | 0 | 1 | 0 | 1 | 1 | 1 |
2 | 1 | 1 | 530334 | 38.3 | 40.0 | 24.0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | NaN | 0 | 0 | 33.0 | 6.7 | 0 | 0 | 1 | 1 | 1 | 1 | 0 |
3 | 0 | 0 | 5290409 | 39.1 | 164.0 | 84.0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 1 | 5.0 | 0 | 0 | 48.0 | 7.2 | 0 | 0 | 0 | 0 | 1 | 1 | 0 |
4 | 1 | 1 | 530255 | 37.3 | 104.0 | 35.0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | NaN | 0 | 0 | 74.0 | 7.4 | 0 | 0 | 1 | 0 | 1 | 1 | 1 |
X_train =imp.fit_transform(X_train)
print(X_train.shape)
(239, 27)
X_test=imp.fit_transform(X_test)
print(X_test.shape)
(60, 27)
Data has been completely preprocessed and let us apply classifier alorithms. At first Decision Tree Classifier algorithm.
classifier = DecisionTreeClassifier()
classifier.fit(X_train,y_train)
DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini', max_depth=None, max_features=None, max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, presort='deprecated', random_state=None, splitter='best')
Over the training data & testing data Predict
y_predict = classifier.predict(X_test)
print(y_predict.shape)
(60,)
Lets Check accuracy of Algorithm
from sklearn.metrics import accuracy_score
accuracy = accuracy_score(y_predict,y_test)
print(accuracy)
0.65
Lets Perform another algorithm..
from sklearn.ensemble import RandomForestClassifier
Classifier = RandomForestClassifier()
classifier.fit(X_train,y_train)
y_predict = classifier.predict(X_test)
accuracy = accuracy_score(y_predict,y_test)
print(accuracy)
0.7
From the above Random Forest Classifer is more accurate than Desicion Tree Classifier for the above model.