AI/ML Use Case Study
Problem Statement/Customer Requirements:
One of our clients from U.S. requested an AI\ML based model to analyze their financial data and
provide an A.I. App which provide them reporting from their scratch data (PDF &JPG). They
wanted us to extract data in Json format and upload it to their servers, with CNN Architecture
model.
• Text detection from documents
• Form and table extraction and processing
• Extract information from identity documents
• Extract information from invoices and receipts
• Multi-column detection and reading order
• Natural language processing and document classification
• Natural language processing for medical documents
• Document translation
• Search and discovery
• Compliance control with document redaction
• PDF and multi-page TIFF document processing
In this document, we present our step-by-step process of creating a Banking Statement PDF
Extractor OCR system to efficiently extract specific information from our US customer’s bank
statement PDFs and images. The challenge was the policies and information provided by banks.
Our journey included understanding data types, preparing data using NLP techniques, using a
Bayesian model for OCR, manually annotating scanned images to troubleshoot, and finally
applying machine learning, deep learning and transfer learning for words extraction accuracy
We started by collecting bank statement data from our clients in PDF and image format. Our first
step was to thoroughly study and analyze the documents to understand the statements format by
each bank.
To improve OCR accuracy, we used natural language processing (NLP) techniques for data
cleansing. These methods allowed us to preprocess and normalize the text, making it clearer and
reducing noise.
Our first OCR attempts using the Tesseract and PyPDF Python libraries did not produce
satisfactory results, prompting us to explore more complex and advanced methods
To increase the accuracy of the OCR, we used a Bayesian model. This approach provided a
significant improvement over our earlier efforts and demonstrated the potential of NLP in OCR.
Despite the success of the Bayesian model, we encountered challenges with OCR accuracy on
scanned handwritten images, posing a unique challenge
We explored the deep learning to overcome the limitations faced by scanned handwritten images.
Transfer learning emerged as a viable solution. We adopted the pre-trained VGGnet model,
trained on a large dataset, as the backbone of our OCR algorithm. With limited information, this
produced remarkable results.
Using transfer learning has allowed us to achieve accuracy across bank documents, including
scanned handwritten images. Our system was now constantly extracting important information
from all types of information.
In order to expose our results to clients, we developed an easy-to-use API to display the
extracted data in a customized way. We also briefed the client on the possibility of AI
applications that could be deployed on cloud services such as AWS and Azure, for increased
accessibility and performance.
Our Banking Statement PDF Extractor OCR system went through a journey of research, testing
and optimization. Through a combination of NLP methods, Bayesian modeling and transfer
learning, we successfully developed a robust and accurate solution to meet our client’s data
extraction needs This effort exemplifies our commitment to embracing cutting-edge technology
and providing innovative solutions to our valued clients.