Recognizing Handwritten Digits with scikit-learn

Hypothesis to be tested :

The Digits data set of scikit-learn library provides numerous data-sets that are useful for testing many problems of data analysis and prediction of the results. Some Scientist claims that it predicts the digit accurately 95% of the times. Perform data Analysis to accept or reject this Hypothesis.

I took the handwritten letter dataset from Digit Recognizer | Kaggle .  Each image is 28 pixels in height and 28 pixels in width, for a total of 784 pixels in total. Each pixel has a single pixel-value associated with it, indicating the lightness or darkness of that pixel, with higher numbers meaning darker. This pixel-value is an integer between 0 and 255, inclusive.

The training data set, (train.csv), has 785 columns. The first column, called "label", is the digit that was drawn by the user. The rest of the columns contain the pixel-values of the associated image.

command :- data.info()

<class 'pandas.core.frame.DataFrame'>

RangeIndex: 42000 entries, 0 to 41999

Columns: 785 entries, label to pixel783

dtypes: int64(785)

memory usage: 251.5 MB


Training & Testing Datasets


It is seen that there are total 42,000 rows which indicate 42,000 handwritten images. I want to divide the total datasets into 2 things 

1. Training datasets (0, 21000)
2. Testing Datasets (21,000, 42,000)

Classifiers

 I want to test with various classifiers, initially i tested with Decision tree classifier and trained it with first 21000 datasets & tested for various images from testing datasets (21000, 42000), most of the times it predicted correctly 

Testing

lets see how our label predicts a particular image, we will make note of output of predicted one and manually print that output
Lets see the below snippet
import numpy as np
import matplotlib.pyplot as pt
import pandas as pd
from sklearn.tree import DecisionTreeClassifier

data = pd.read_csv('train.csv').to_numpy()
clf = DecisionTreeClassifier()

# training data set
xtrain = data[0:21000,1:]
train_label = data[0:21000,0]

clf.fit(xtrain,train_label)

# testing data (Rakesh Santhapuri)
xtest = data[21000:,1:]
actual_label = data[21000:,0]

d = xtest[8]
d.shape = (28,28)
pt.imshow(255-d, cmap = 'gray')
print(clf.predict([xtest[8]]))
pt.show()
output (which is predicted by classifier for above snippet) : 3


Actual Handwritten Image which is tested





Accuracy

import numpy as np
import matplotlib.pyplot as pt
import pandas as pd
from sklearn.tree import DecisionTreeClassifier

data = pd.read_csv('train.csv').to_numpy()
clf = DecisionTreeClassifier()
xtrain = data[0:21000,1:]
train_label = data[0:21000,0]
clf.fit(xtrain,train_label)

# testing data (Rakesh Santhapuri)
xtest = data[21000:,1:]
actual_label = data[21000:,0]

p = clf.predict(xtest)
count = 0
for i in range(0,21000):
if p[i] == actual_label[i]:
count+=1
else:
count = count
print('Accuracy=',(count/21000)*100)

output :

Output of my code snippet

Accuracy= 83.45238095238095


epilogue 

There are more advanced classifiers even with some basic classifiers i got accuracy around 83.5, There are high chances it may go above 95 percent by proper tuning of advance classifiers

Comments

Popular posts from this blog

Is powerpc a failure

Rise of AMD against monopoly