ML/AI

Document Classification API

NLP microservice classifying PDFs and extracting key entities.

Client

Internal Project

Industry

NLP / Document Processing

Duration

2 months (2024)

Role

ML Engineer

View Code

Overview

Built an NLP microservice that automatically classifies documents and extracts structured data from unstructured PDFs.

Key Results

95%

Accuracy

Classification

20+

Document

Types

2sec

Processing

Per Doc

10K+

Documents

Processed

The Challenge

Manual document processing was a bottleneck:

Hours spent manually categorizing documents
Inconsistent classification across team members
Key data buried in unstructured text
No searchable document database
Compliance risks from misfiled documents

They needed automated document understanding.

The Solution

I developed an intelligent document processing system:

Text Extraction

OCR and PDF parsing for text extraction from any document.

Classification Model

Fine-tuned transformer model for document type classification.

Entity Extraction

spaCy NER for extracting names, dates, amounts, and custom entities.

REST API

FastAPI endpoint for document upload and processing.

Key Features

📄

PDF Processing

Extract text from scanned and digital PDFs.

🏷️

Classification

Auto-categorize into 20+ document types.

🔍

Entity Extraction

Extract dates, names, amounts, and more.

🔗

API Ready

RESTful API for easy integration.

Tech Stack

NLP

spaCyTransformersNLTK

API

FastAPIPython

Document

PyMuPDFTesseract OCR

ML

Hugging FacePyTorch

Screenshots

Entity Extraction

API Documentation

Classification Results

Achievements

95% classification accuracy across 20+ document types
Processing time under 2 seconds per document
Extracted 50+ entity types with high precision
Processed 10,000+ documents in production

Related Projects

ML/AI

Human Genomic Mutation Analysis

Hybrid machine learning model for understanding human genomic dynamics and mutation patterns with comprehensive EDA and predictive modeling.

MatplotlibSeabornPlotly

View Case Study

ML/AI

Churn Prediction Model

Scikit-learn pipeline predicting member churn with 89% accuracy.

Scikit-learnXGBoostPandas

FastAPIPydantic

View Case Study

ML/AI

Recommendation Engine

Collaborative filtering engine for personalized plan suggestions.

TensorFlowScikit-learnPandas

FastAPIPython

View Case Study