ASL Recognizer

Problem

The workflow constraint

Existing ASL recognition tools are either research-only or require expensive hardware. A lightweight, real-time system was needed that runs on consumer hardware with just a webcam, using open-source ML frameworks.

Outcome

What changed

Real-time ASL recognition for 24 static letters

Published dataset on Kaggle and Hugging Face

v1.0.0 release with live webcam inference

CPU-only inference with sub-100ms latency

Decision log

Decisions and trade-offs

MediaPipe over custom CNN for hand detection

Need reliable hand landmark extraction that works across different skin tones and lighting.

Decision: Used Google's MediaPipe Hands for pre-built, production-quality hand landmark detection, then trained a lightweight classifier on extracted landmarks.

Trade-off: Depends on MediaPipe accuracy, but dramatically reduces training data needs and computation compared to end-to-end CNN approaches.

Landmark features over raw pixel classification

Raw image classification would require massive datasets and GPU training.

Decision: Extract 21 hand landmarks (63 features) and train a simple dense network, making the model lightweight and fast for real-time inference.

Trade-off: Loses some spatial information from raw images, but enables CPU-only real-time inference with high accuracy on static poses.

Technical teardown

Constraints, architecture, and proof

Pipeline: OpenCV webcam capture → MediaPipe hand landmark detection (21 points × 3 coords) → feature normalization → TensorFlow dense classifier → real-time prediction overlay. Training pipeline: raw images → MediaPipe landmark extraction → CSV dataset → model training with validation split.

Must run in real-time on consumer hardware with just a webcam

MediaPipe hand landmark extraction must be fast enough for live inference

Model must handle varying lighting conditions and hand positions

J and Z are motion-based letters — excluded from static classifier scope

Reliability

Deployment, security, and maintenance

Adjustable confidence threshold (via +/- keys) to tune precision vs recall. Published datasets on Kaggle and Hugging Face for reproducibility.

What I would improve next

Add temporal model for motion-based J and Z letters

Implement phrase-level recognition with word assembly

Add two-hand gesture support

Build web-based demo with TensorFlow.js