ML/AI

Document Classification API

NLP microservice classifying PDFs and extracting key entities.

Client
Internal Project
Industry
NLP / Document Processing
Duration
2 months (2024)
Role
ML Engineer
Document Classification API

Overview

Built an NLP microservice that automatically classifies documents and extracts structured data from unstructured PDFs.

Key Results

95%
Accuracy
Classification
20+
Document
Types
2sec
Processing
Per Doc
10K+
Documents
Processed

The Challenge

Manual document processing was a bottleneck:

  • Hours spent manually categorizing documents
  • Inconsistent classification across team members
  • Key data buried in unstructured text
  • No searchable document database
  • Compliance risks from misfiled documents

They needed automated document understanding.

The Solution

I developed an intelligent document processing system:

Text Extraction

OCR and PDF parsing for text extraction from any document.

Classification Model

Fine-tuned transformer model for document type classification.

Entity Extraction

spaCy NER for extracting names, dates, amounts, and custom entities.

REST API

FastAPI endpoint for document upload and processing.

Key Features

📄

PDF Processing

Extract text from scanned and digital PDFs.

🏷️

Classification

Auto-categorize into 20+ document types.

🔍

Entity Extraction

Extract dates, names, amounts, and more.

🔗

API Ready

RESTful API for easy integration.

Tech Stack

NLP

spaCyTransformersNLTK

API

FastAPIPython

Document

PyMuPDFTesseract OCR

ML

Hugging FacePyTorch

Screenshots

NLP

Entity Extraction

API

API Documentation

Results

Classification Results

Achievements

  • 95% classification accuracy across 20+ document types
  • Processing time under 2 seconds per document
  • Extracted 50+ entity types with high precision
  • Processed 10,000+ documents in production