Optical Character Recognition of Jutakshars within Devanagari Script

Similar documents
Feature Extraction Using Zernike Moments

Learning theory. Ensemble methods. Boosting. Boosting: history

Neural Networks & Fuzzy Logic. Alphabet Identification: A MATLAB Implementation using a feed-forward neural network.

When Dictionary Learning Meets Classification

Neural networks and optimization

Anticipating Visual Representations from Unlabeled Data. Carl Vondrick, Hamed Pirsiavash, Antonio Torralba

COMS 4771 Lecture Boosting 1 / 16

Neural networks and optimization

Fast Logistic Regression for Text Categorization with Variable-Length N-grams

Using a Hopfield Network: A Nuts and Bolts Approach

10/05/2016. Computational Methods for Data Analysis. Massimo Poesio SUPPORT VECTOR MACHINES. Support Vector Machines Linear classifiers

Article from. Predictive Analytics and Futurism. July 2016 Issue 13

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM

Machine Learning 2010

Deep Learning for Natural Language Processing. Sidharth Mudgal April 4, 2017

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

EECS490: Digital Image Processing. Lecture #26

Machine Learning Basics

Deviations from linear separability. Kernel methods. Basis expansion for quadratic boundaries. Adding new features Systematic deviation

Kernel methods CSE 250B

FARSI CHARACTER RECOGNITION USING NEW HYBRID FEATURE EXTRACTION METHODS

A Novel Rejection Measurement in Handwritten Numeral Recognition Based on Linear Discriminant Analysis

Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers

The Perceptron. Volker Tresp Summer 2014

Mathematical Tools for Neuroscience (NEU 314) Princeton University, Spring 2016 Jonathan Pillow. Homework 8: Logistic Regression & Information Theory

Holdout and Cross-Validation Methods Overfitting Avoidance

CS:4330 Theory of Computation Spring Regular Languages. Finite Automata and Regular Expressions. Haniel Barbosa

CS4495/6495 Introduction to Computer Vision. 8C-L3 Support Vector Machines

CS 188: Artificial Intelligence Spring Announcements

Reliability of Seismic Data for Hydrocarbon Reservoir Characterization

Neural Networks biological neuron artificial neuron 1

Introduction to Machine Learning

Sparse Proteomics Analysis (SPA)

Kaggle.

Metric Learning. 16 th Feb 2017 Rahul Dey Anurag Chowdhury

Support Vector Machines. CAP 5610: Machine Learning Instructor: Guo-Jun QI

Noisy Subsequence Recognition Using Constrained String Editing Involving Substitutions, Insertions, Deletions and Generalized Transpositions 1

Off-line Handwritten Kannada Text Recognition using Support Vector Machine using Zernike Moments

Linear, threshold units. Linear Discriminant Functions and Support Vector Machines. Biometrics CSE 190 Lecture 11. X i : inputs W i : weights

The Perceptron. Volker Tresp Summer 2016

Mitosis Detection in Breast Cancer Histology Images with Multi Column Deep Neural Networks

Pattern Recognition and Machine Learning. Perceptrons and Support Vector machines

Linear Classification and SVM. Dr. Xin Zhang

Introduction to Machine Learning Midterm Exam

Homework 1 Solutions Probability, Maximum Likelihood Estimation (MLE), Bayes Rule, knn

Introduction to Neural Networks

Integrating Knowledge Sources in Devanagari Text Recognition System

ECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction

Pattern Recognition Problem. Pattern Recognition Problems. Pattern Recognition Problems. Pattern Recognition: OCR. Pattern Recognition Books

Lecture 4 Discriminant Analysis, k-nearest Neighbors

Sparse Kernel Machines - SVM

Machine Learning And Applications: Supervised Learning-SVM

Bounding the Probability of Error for High Precision Recognition

Machine Learning Methods for Radio Host Cross-Identification with Crowdsourced Labels

Neural Networks. David Rosenberg. July 26, New York University. David Rosenberg (New York University) DS-GA 1003 July 26, / 35

Perceptron Revisited: Linear Separators. Support Vector Machines

Unsupervised Anomaly Detection for High Dimensional Data

Application of Jaro-Winkler String Comparator in Enhancing Veterans Administrative Records

Machine Learning Overview

Automatic detection and of dipoles in large area SQUID magnetometry

Statistical Filters for Crowd Image Analysis

Handwritten English Character Recognition using Pixel Density Gradient Method

Contents Lecture 4. Lecture 4 Linear Discriminant Analysis. Summary of Lecture 3 (II/II) Summary of Lecture 3 (I/II)

Support Vector Machine I

Learning with kernels and SVM

Last Time. Today. Bayesian Learning. The Distributions We Love. CSE 446 Gaussian Naïve Bayes & Logistic Regression

Scientific Data Mining: Why is it Difficult? Sapphire: using data mining techniques to address the data overload problem

Linear vs Non-linear classifier. CS789: Machine Learning and Neural Network. Introduction

Naïve Bayes. Vibhav Gogate The University of Texas at Dallas

PAC-learning, VC Dimension and Margin-based Bounds

BACKPROPAGATION. Neural network training optimization problem. Deriving backpropagation

Support Vector Machine II

Nonlinear Classification

Statistical Pattern Recognition

Learning Methods for Linear Detectors

Support Vector Machines. Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar

Notes on Discriminant Functions and Optimal Classification

Sparse vectors recap. ANLP Lecture 22 Lexical Semantics with Dense Vectors. Before density, another approach to normalisation.

ANLP Lecture 22 Lexical Semantics with Dense Vectors

(COM4513/6513) Week 1. Nikolaos Aletras ( Department of Computer Science University of Sheffield

A Deep Interpretation of Classifier Chains

CS 188: Artificial Intelligence Spring Announcements

Characterization of Jet Charge at the LHC

Faults Loading and Display in CE8R2 and R3. David Worsick, Calgary February 25, 2009

Advanced Introduction to Machine Learning CMU-10715

Optical Chinese Character Recognition using Probabilistic Neural Networks

DETECTING HUMAN ACTIVITIES IN THE ARCTIC OCEAN BY CONSTRUCTING AND ANALYZING SUPER-RESOLUTION IMAGES FROM MODIS DATA INTRODUCTION

Sound Recognition in Mixtures

11 More Regression; Newton s Method; ROC Curves

Support'Vector'Machines. Machine(Learning(Spring(2018 March(5(2018 Kasthuri Kannan

Large-scale classification of traffic signs under real-world conditions

CMU-Q Lecture 24:

Foundations of Natural Language Processing Lecture 6 Spelling correction, edit distance, and EM

PATTERN CLASSIFICATION

Lecture 4: Feed Forward Neural Networks

Pattern Recognition and Machine Learning. Learning and Evaluation of Pattern Recognition Processes

Samson Zhou. Pattern Matching over Noisy Data Streams

University of New Mexico Department of Computer Science. Final Examination. CS 561 Data Structures and Algorithms Fall, 2013

A Discriminatively Trained, Multiscale, Deformable Part Model

A Decision Stump. Decision Trees, cont. Boosting. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. October 1 st, 2007

Transcription:

Optical Character Recognition of Jutakshars within Devanagari Script Sheallika Singh Shreesh Ladha Supervised by : Dr. Harish Karnick, Dr. Amit Mitra UGP Presentation, 10 April 2016 OCR of Jutakshars within Devanagari Script UGP 1 / 35

Basic working of OCR Convert printed/scanned text into editable text OCR of Jutakshars within Devanagari Script UGP 2 / 35

Basic working of OCR Convert printed/scanned text into editable text Starts off with some basic preprocessing of the image for uniformity OCR of Jutakshars within Devanagari Script UGP 2 / 35

Basic working of OCR Convert printed/scanned text into editable text Starts off with some basic preprocessing of the image for uniformity Segmentation of lines, words, characters based on their horizontal and vertical histograms of pixel intensities OCR of Jutakshars within Devanagari Script UGP 2 / 35

Basic working of OCR Convert printed/scanned text into editable text Starts off with some basic preprocessing of the image for uniformity Segmentation of lines, words, characters based on their horizontal and vertical histograms of pixel intensities Classifiers are then used for prediction OCR of Jutakshars within Devanagari Script UGP 2 / 35

Basic working of OCR Convert printed/scanned text into editable text Starts off with some basic preprocessing of the image for uniformity Segmentation of lines, words, characters based on their horizontal and vertical histograms of pixel intensities Classifiers are then used for prediction Composition using language rules and n-gram models OCR of Jutakshars within Devanagari Script UGP 2 / 35

Figure: Horizontal and vertical histograms OCR of Jutakshars within Devanagari Script UGP 3 / 35

Figure: Example of Segmentatin OCR of Jutakshars within Devanagari Script UGP 4 / 35

Challenges with Devanagari Robust OCR systems already in place for Roman Scripts OCR of Jutakshars within Devanagari Script UGP 5 / 35

Challenges with Devanagari Robust OCR systems already in place for Roman Scripts Lack of such software for Hindi Languages because of highly complicated structure and composition OCR of Jutakshars within Devanagari Script UGP 5 / 35

Challenges with Devanagari Robust OCR systems already in place for Roman Scripts Lack of such software for Hindi Languages because of highly complicated structure and composition An eg. of such composition is a form of character conjunction referred to as Jutakshar OCR of Jutakshars within Devanagari Script UGP 5 / 35

Figure: Classes of Jutakshars. *Note the model was developed to understand Jutakshars similar in type with the first four OCR of Jutakshars within Devanagari Script UGP 6 / 35

Our Work Words containing Jutakshar are quite popular and can be found in many document OCR of Jutakshars within Devanagari Script UGP 7 / 35

Our Work Words containing Jutakshar are quite popular and can be found in many document Most OCR systems do not take care to handle such characters, which in turn reduces their precision OCR of Jutakshars within Devanagari Script UGP 7 / 35

Our Work Words containing Jutakshar are quite popular and can be found in many document Most OCR systems do not take care to handle such characters, which in turn reduces their precision Low accuracy due to Jutakshars being recognized as a single character OCR of Jutakshars within Devanagari Script UGP 7 / 35

Our Work Words containing Jutakshar are quite popular and can be found in many document Most OCR systems do not take care to handle such characters, which in turn reduces their precision Low accuracy due to Jutakshars being recognized as a single character We have tried to build upon existing frameworks to better handle these special strings OCR of Jutakshars within Devanagari Script UGP 7 / 35

Figure: Where segmentation fails OCR of Jutakshars within Devanagari Script UGP 8 / 35

Approach Jutakshar Detection : Figure out whether a character is a jutakshar or not OCR of Jutakshars within Devanagari Script UGP 9 / 35

Approach Jutakshar Detection : Figure out whether a character is a jutakshar or not Jutakshar Identification : Figure out the jutakshar and in turn the entire word OCR of Jutakshars within Devanagari Script UGP 9 / 35

Jutakshar Detection Created dataset of jutakshars using 8 fonts and added gaussian noise OCR of Jutakshars within Devanagari Script UGP 10 / 35

Jutakshar Detection Created dataset of jutakshars using 8 fonts and added gaussian noise Trained classifiers to predict whether a given character was a Jutakshar or not OCR of Jutakshars within Devanagari Script UGP 10 / 35

Jutakshar Detection Created dataset of jutakshars using 8 fonts and added gaussian noise Trained classifiers to predict whether a given character was a Jutakshar or not Experimented with three different types of models OCR of Jutakshars within Devanagari Script UGP 10 / 35

Jutakshar Detection Model I : SVM trained using hog features OCR of Jutakshars within Devanagari Script UGP 11 / 35

Jutakshar Detection Model I : SVM trained using hog features Model II : Logistic regression using features extracted from penultimate layer of a pretrained convolutional neural network OCR of Jutakshars within Devanagari Script UGP 11 / 35

Jutakshar Detection Model I : SVM trained using hog features Model II : Logistic regression using features extracted from penultimate layer of a pretrained convolutional neural network Model III : Classification using the width of the character boxes OCR of Jutakshars within Devanagari Script UGP 11 / 35

Figure: Comparison of howocr different of Jutakshars models within Devanagari work for Script Jutakshar Detection. UGP Model 12 / 35

Jutakshar Identification Using the previous model, the character boxes predicted as containing jutakshars, are now split into the corresponding half and full character OCR of Jutakshars within Devanagari Script UGP 13 / 35

Jutakshar Identification Using the previous model, the character boxes predicted as containing jutakshars, are now split into the corresponding half and full character Used vertical histograms for splitting but restricted our search in the middle part of the character OCR of Jutakshars within Devanagari Script UGP 13 / 35

Jutakshar Identification Using the previous model, the character boxes predicted as containing jutakshars, are now split into the corresponding half and full character Used vertical histograms for splitting but restricted our search in the middle part of the character The position of the minima was used to split the the Jutakshar into the half and the full character forming it OCR of Jutakshars within Devanagari Script UGP 13 / 35

Figure: Example showing how splitting works on Jutakshars OCR of Jutakshars within Devanagari Script UGP 14 / 35

Jutakshar Identification Segmented word is passed onto the character classifiers to predict the respective characters OCR of Jutakshars within Devanagari Script UGP 15 / 35

Jutakshar Identification Segmented word is passed onto the character classifiers to predict the respective characters The predicted word is passed to a dictionary which gives a list of suggested correct words for an incorrect word OCR of Jutakshars within Devanagari Script UGP 15 / 35

Jutakshar Identification Segmented word is passed onto the character classifiers to predict the respective characters The predicted word is passed to a dictionary which gives a list of suggested correct words for an incorrect word A heuristic is defined to find the correct prediction from the suggested list OCR of Jutakshars within Devanagari Script UGP 15 / 35

Jutakshar Identification Suggested list of words refined to have only words which contain Jutskshars and having length as close as possible OCR of Jutakshars within Devanagari Script UGP 16 / 35

Jutakshar Identification Suggested list of words refined to have only words which contain Jutskshars and having length as close as possible Levenshtein distance (computes the minimum number of steps required for converting one string into another by insertions, deletions or substitutions of single characters) between the predicted and the suggested word OCR of Jutakshars within Devanagari Script UGP 16 / 35

Jutakshar Identification Suggested list of words refined to have only words which contain Jutskshars and having length as close as possible Levenshtein distance (computes the minimum number of steps required for converting one string into another by insertions, deletions or substitutions of single characters) between the predicted and the suggested word Number of substrings that match in the predicted word with the jutakshars removed and the suggested word OCR of Jutakshars within Devanagari Script UGP 16 / 35

Jutakshar Identification Suggested list of words refined to have only words which contain Jutskshars and having length as close as possible Levenshtein distance (computes the minimum number of steps required for converting one string into another by insertions, deletions or substitutions of single characters) between the predicted and the suggested word Number of substrings that match in the predicted word with the jutakshars removed and the suggested word Length of the longest common substring between the predicted word with the jutakshars removed and the suggested word in the list OCR of Jutakshars within Devanagari Script UGP 16 / 35

Figure: Example of suggestions returned by the dictionary OCR of Jutakshars within Devanagari Script UGP 17 / 35

Figure: Correction in the predicted word after applying our heuristic OCR of Jutakshars within Devanagari Script UGP 18 / 35

Results for Jutakshar Detection Model Figure: Plot of accuracies of Jutakshar detection model on different fonts and documents OCR of Jutakshars within Devanagari Script UGP 19 / 35

Results : Jutakshar Detection Document Total # Words # True # False Words containing Positives Positives Jutakshars Doc 1 166 6 6 0 Doc 2 359 20 20 0 Doc 3 131 5 4 1 Doc 4 138 9 9 0 Table: Jutakshar Detection (Font 1) OCR of Jutakshars within Devanagari Script UGP 20 / 35

Results : Jutakshar Detection Document Total # Words # True # False Words containing Positives Positives Jutakshars Doc 1 145 6 6 0 Doc 2 218 4 4 0 Doc 3 138 4 4 0 Doc 4 167 7 7 2 Table: Jutakshar Detection (Font 2) OCR of Jutakshars within Devanagari Script UGP 21 / 35

Results : Jutakshar Detection Document Total # Words # True # False Words containing Positives Positives Jutakshars Doc 1 98 5 5 1 Doc 2 171 10 10 1 Doc 3 243 10 9 0 Doc 4 210 7 7 0 Table: Jutakshar Detection (Font 3) OCR of Jutakshars within Devanagari Script UGP 22 / 35

Results : Jutakshar Detection Document Total # Words # True # False Words containing Positives Positives Jutakshars Doc 1 72 3 3 1 Doc 2 99 3 3 1 Doc 3 114 8 8 1 Doc 4 133 5 5 0 Table: Jutakshar Detection (Font 4) OCR of Jutakshars within Devanagari Script UGP 23 / 35

Results : Jutakshar Detection Document Total # Words # True # False Words containing Positives Positives Jutakshars Doc 1 246 10 10 3 Doc 2 110 3 3 1 Doc 3 89 2 2 2 Doc 4 129 9 9 2 Table: Jutakshar Detection (Font 5) OCR of Jutakshars within Devanagari Script UGP 24 / 35

Results : Jutakshar Detection False Positive type of cases OCR of Jutakshars within Devanagari Script UGP 25 / 35

OCR Evaluation OCR of Jutakshars within Devanagari Script UGP 26 / 35

Error rates WER: #Erroneous words divided by total word count in ground-truth text CER: #Erroneous characters divided by total character count in ground-truth text Evaluation tool: https://github.com/impactcentre/ocrevaluation OCR of Jutakshars within Devanagari Script UGP 27 / 35

Error rates WER: #Erroneous words divided by total word count in ground-truth text CER: #Erroneous characters divided by total character count in ground-truth text Evaluation tool: https://github.com/impactcentre/ocrevaluation Mathematically, WER(%) = (S w + I w + D w ) 100 N w CER(%) = (S c + I c + D c ) 100 N c OCR of Jutakshars within Devanagari Script UGP 27 / 35

Results : Jutakshar Identification Plot for Character Error rates with which word containing jutakshars are identified OCR of Jutakshars within Devanagari Script UGP 28 / 35

Overall Accuracies Document CER(%) WER(%) CER(%) WER(%) (Prev Model) (Prev Model) (Our Model) (Our Model) Doc 1 6.8 16.5 6.7 8.6 Doc 2 8.2 19.8 6.3 8.5 Doc 3 8.1 19.0 7.6 8.8 Doc 4 9.8 22.7 9.5 12.1 Overall 8.2 19.6 7.5 9.5 Table: OCR Accuracy : Comparing Models (Font 1) OCR of Jutakshars within Devanagari Script UGP 29 / 35

Overall Accuracies Document CER(%) WER(%) CER(%) WER(%) (Prev Model) (Prev Model) (Our Model) (Our Model) Doc 1 13.5 29.1 17.6 25.4 Doc 2 8.9 20.3 9.9 13.0 Doc 3 11.7 31.2 15.7 19.6 Doc 4 6.9 15.9 6.9 11.8 Overall 9.4 23.6 12.5 17.4 Table: OCR Accuracy : Comparing Models (Font 2) OCR of Jutakshars within Devanagari Script UGP 30 / 35

Overall Accuracies Document CER(%) WER(%) CER(%) WER(%) (Prev Model) (Prev Model) (Our Model) (Our Model) Doc 1 9.3 20.8 10.0 6.9 Doc 2 10.4 22.1 11.8 16.8 Doc 3 8.2 19.7 8.7 12.8 Doc 4 9.8 23.3 10.2 11.8 Overall 9.3 21.5 7.8 12.1 Table: OCR Accuracy : Comparing Models (Font 3) OCR of Jutakshars within Devanagari Script UGP 31 / 35

Overall Accuracies Document CER(%) WER(%) CER(%) WER(%) (Prev Model) (Prev Model) (Our Model) (Our Model) Doc 1 8.8 18.4 13.6 17.6 Doc 2 9.6 18.7 10.1 14.3 Doc 3 6.3 16.6 7.6 13.3 Doc 4 7.1 16.9 9.3 13.1 Overall 7.8 17.5 10.1 14.6 Table: OCR Accuracy : Comparing Models (Font 4) OCR of Jutakshars within Devanagari Script UGP 32 / 35

Overall Accuracies Document CER(%) WER(%) CER(%) WER(%) (Prev Model) (Prev Model) (Our Model) (Our Model) Doc 1 6.3 16.0 5.6 8.1 Doc 2 4.5 9.8 5.5 7.5 Doc 3 6.7 17.9 7.8 12.2 Doc 4 8.3 17.3 8.1 7.9 Overall 6.5 15.4 6.7 8.9 Table: OCR Accuracy : Comparing Models (Font 5) OCR of Jutakshars within Devanagari Script UGP 33 / 35

Future Work Our model relies heavily on the dictionary. Hence a more powerful dictionary would reduce quite a few errors OCR of Jutakshars within Devanagari Script UGP 34 / 35

Future Work Our model relies heavily on the dictionary. Hence a more powerful dictionary would reduce quite a few errors The dataset can be made more representative by adding more fonts and possible jutakshars OCR of Jutakshars within Devanagari Script UGP 34 / 35

Future Work Our model relies heavily on the dictionary. Hence a more powerful dictionary would reduce quite a few errors The dataset can be made more representative by adding more fonts and possible jutakshars Two deepnet models could be trained - One for all half characters and the other for all different types of full characters OCR of Jutakshars within Devanagari Script UGP 34 / 35

Questions? OCR of Jutakshars within Devanagari Script UGP 35 / 35