Recognizing Handwritten Digits with scikit-learn
Hypothesis to be tested :
The Digits data set of scikit-learn library provides numerous data-sets that are useful for testing many problems of data analysis and prediction of the results. Some Scientist claims that it predicts the digit accurately 95% of the times. Perform data Analysis to accept or reject this Hypothesis.
I took the handwritten letter dataset from Digit Recognizer | Kaggle . Each image is 28 pixels in height and 28 pixels in width, for a total of 784 pixels in total. Each pixel has a single pixel-value associated with it, indicating the lightness or darkness of that pixel, with higher numbers meaning darker. This pixel-value is an integer between 0 and 255, inclusive.
The training data set, (train.csv), has 785 columns. The first column, called "label", is the digit that was drawn by the user. The rest of the columns contain the pixel-values of the associated image.
command :- data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 42000 entries, 0 to 41999
Columns: 785 entries, label to pixel783
dtypes: int64(785)
memory usage: 251.5 MB
Training & Testing Datasets
Classifiers
I want to test with various classifiers, initially i tested with Decision tree classifier and trained it with first 21000 datasets & tested for various images from testing datasets (21000, 42000), most of the times it predicted correctly
Testing
import numpy as np
import matplotlib.pyplot as pt
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
data = pd.read_csv('train.csv').to_numpy()
clf = DecisionTreeClassifier()
# training data set
xtrain = data[0:21000,1:]
train_label = data[0:21000,0]
clf.fit(xtrain,train_label)
# testing data (Rakesh Santhapuri)
xtest = data[21000:,1:]
actual_label = data[21000:,0]
d = xtest[8]
d.shape = (28,28)
pt.imshow(255-d, cmap = 'gray')
print(clf.predict([xtest[8]]))
pt.show()
Accuracy
import numpy as np
import matplotlib.pyplot as pt
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
data = pd.read_csv('train.csv').to_numpy()
clf = DecisionTreeClassifier()
xtrain = data[0:21000,1:]
train_label = data[0:21000,0]
clf.fit(xtrain,train_label)
# testing data (Rakesh Santhapuri)
xtest = data[21000:,1:]
actual_label = data[21000:,0]
p = clf.predict(xtest)
count = 0
for i in range(0,21000):
if p[i] == actual_label[i]:
count+=1
else:
count = count
print('Accuracy=',(count/21000)*100)
output :
Output of my code snippet
Accuracy= 83.45238095238095
Comments
Post a Comment