Document Classification API
NLP microservice classifying PDFs and extracting key entities.
Overview
Built an NLP microservice that automatically classifies documents and extracts structured data from unstructured PDFs.
Key Results
The Challenge
Manual document processing was a bottleneck:
- Hours spent manually categorizing documents
- Inconsistent classification across team members
- Key data buried in unstructured text
- No searchable document database
- Compliance risks from misfiled documents
They needed automated document understanding.
The Solution
I developed an intelligent document processing system:
Text Extraction
OCR and PDF parsing for text extraction from any document.
Classification Model
Fine-tuned transformer model for document type classification.
Entity Extraction
spaCy NER for extracting names, dates, amounts, and custom entities.
REST API
FastAPI endpoint for document upload and processing.
Key Features
PDF Processing
Extract text from scanned and digital PDFs.
Classification
Auto-categorize into 20+ document types.
Entity Extraction
Extract dates, names, amounts, and more.
API Ready
RESTful API for easy integration.
Tech Stack
NLP
API
Document
ML
Screenshots
Entity Extraction
API Documentation
Classification Results
Achievements
- 95% classification accuracy across 20+ document types
- Processing time under 2 seconds per document
- Extracted 50+ entity types with high precision
- Processed 10,000+ documents in production
Related Projects
.png)
Human Genomic Mutation Analysis
Hybrid machine learning model for understanding human genomic dynamics and mutation patterns with comprehensive EDA and predictive modeling.
Churn Prediction Model
Scikit-learn pipeline predicting member churn with 89% accuracy.
Recommendation Engine
Collaborative filtering engine for personalized plan suggestions.