AI/ML Use Case Study
Problem Statement/Customer Requirements:
One of our clients from U.S. requested an AI\ML based model to analyze their financial data and
provide an A.I. App which provide them reporting from their scratch data (PDF &JPG). They
wanted us to extract data in Json format and upload it to their servers, with CNN Architecture



Challenges for Clients:


• Text detection from documents
• Form and table extraction and processing
• Extract information from identity documents
• Extract information from invoices and receipts
• Multi-column detection and reading order
• Natural language processing and document classification
• Natural language processing for medical documents
• Document translation
• Search and discovery
• Compliance control with document redaction
• PDF and multi-page TIFF document processing


Solution Introduction:


In this document, we present our step-by-step process of creating a Banking Statement PDF
Extractor OCR system to efficiently extract specific information from our US customer’s bank
statement PDFs and images. The challenge was the policies and information provided by banks.
Our journey included understanding data types, preparing data using NLP techniques, using a
Bayesian model for OCR, manually annotating scanned images to troubleshoot, and finally
applying machine learning, deep learning and transfer learning for words extraction accuracy


Understanding various formats:


We started by collecting bank statement data from our clients in PDF and image format. Our first
step was to thoroughly study and analyze the documents to understand the statements format by
each bank.


Data correction using NLP techniques:


To improve OCR accuracy, we used natural language processing (NLP) techniques for data
cleansing. These methods allowed us to preprocess and normalize the text, making it clearer and
reducing noise.


First OCR attempt:


Our first OCR attempts using the Tesseract and PyPDF Python libraries did not produce
satisfactory results, prompting us to explore more complex and advanced methods


Bayesian model for OCR:


To increase the accuracy of the OCR, we used a Bayesian model. This approach provided a
significant improvement over our earlier efforts and demonstrated the potential of NLP in OCR.


The challenges of scanning handwritten images:


Despite the success of the Bayesian model, we encountered challenges with OCR accuracy on
scanned handwritten images, posing a unique challenge


Acceptance of Deep Learning and Transfer Learning:


We explored the deep learning to overcome the limitations faced by scanned handwritten images.


Benefits of transfer learning:


Transfer learning emerged as a viable solution. We adopted the pre-trained VGGnet model,
trained on a large dataset, as the backbone of our OCR algorithm. With limited information, this
produced remarkable results.


For accuracy:


Using transfer learning has allowed us to achieve accuracy across bank documents, including
scanned handwritten images. Our system was now constantly extracting important information
from all types of information.


Building an API and future prospects:


In order to expose our results to clients, we developed an easy-to-use API to display the
extracted data in a customized way. We also briefed the client on the possibility of AI
applications that could be deployed on cloud services such as AWS and Azure, for increased
accessibility and performance.




Our Banking Statement PDF Extractor OCR system went through a journey of research, testing
and optimization. Through a combination of NLP methods, Bayesian modeling and transfer
learning, we successfully developed a robust and accurate solution to meet our client’s data
extraction needs This effort exemplifies our commitment to embracing cutting-edge technology
and providing innovative solutions to our valued clients.