EmotionAI
Real-time facial emotion recognition using convolutional neural networks
THE PROBLEM
Automated emotion recognition has applications in accessibility tools, driver monitoring, and behavioral research. Building it to run in real time, on standard CPU hardware, without a cloud dependency requires deliberate tradeoffs between model accuracy and inference latency.
SYSTEM DESIGN
ENGINEERING DECISIONS
CNN instead of a Vision Transformer
Transformers score higher on emotion recognition benchmarks but cost substantially more compute. For real-time video the forward pass had to finish in under 40 milliseconds or the output lagged visibly. A lightweight CNN backbone hit that target on CPU. A Vision Transformer needed GPU acceleration that was not guaranteed on the target hardware. The accuracy tradeoff was the right call for the deployment constraint.
Frame skipping at inference time
Facial expressions change at human timescales, not at 40 millisecond intervals. Running inference on every frame at 24fps burned CPU for no perceptible gain in output quality. Processing every third frame — at 8fps effective throughput — cut load by roughly two thirds. The model holds its last prediction between processed frames so the output stays continuous to the user.
Haar cascade for face detection instead of a DNN detector
A DNN face detector would have added a second model with its own latency and accuracy variables to tune alongside the main classifier. For the frontal-face, controlled-lighting conditions this tool targeted, a Haar cascade was accurate enough and kept the pipeline focused on the classification problem, which was the actual research interest.
Softmax output instead of sigmoid
Emotions are mutually exclusive in the way this model frames them. A face expresses one dominant emotion, not several independent ones in parallel. Softmax produces a probability distribution that sums to 1, matching that constraint and making the dominant class legible from the output. Sigmoid treats each class independently and would have obscured the primary signal.
OUTCOMES
- 01Real-time inference at 8fps effective throughput on CPU
- 02Multi-class classification across standard emotion categories
- 03Sub-40ms forward pass latency per processed frame
STACK
Python · PyTorch · OpenCV · CNN · NumPy