Industrial settings frequently hear claims that AI solves all inspection problems, yet real factory environments reveal greater complexity. Most deployed systems rely on convolutional neural networks (CNNs) — including YOLO variants — that examine small, localised image regions to identify defects.
While effective for obvious, isolated defects, these networks falter when facing intricate or spread-out patterns. Understanding how CNNs actually operate enables practitioners to avoid overconfidence and develop stronger testing methodologies.
Industrial AI relies on machine learning with deep learning automatically identifying features. CNNs dominate image inspection tasks and power widely-used models like YOLO.
CNNs apply multiple filters across input images. A 640×640 pixel image processed with a 3×3 kernel at stride 2 progressively shrinks: after three such layers, the effective internal feature map shrinks to 80×80 pixels, while depth increases with multiple channels representing distinct detected patterns.
Each kernel captures local neighbouring pixel information, functioning similarly to a compartmentalised inspection tray — it sees its own section clearly, but has no awareness of what's happening elsewhere in the image.
Complications emerge when defects display complex patterns spanning broader regions — such as adjacent small defects or context-dependent shapes. Capturing dispersed defects requires significant resolution reduction, potentially sacrificing detail or conflating nearby issues.
This means a CNN that performs well in lab conditions with clean, isolated defects can fail systematically on the actual production line where real variability — dust, shadows, positional shifts, batch changes — is the norm, not the exception.
Engineers cannot depend exclusively on architecture claims. Reviewing training data — volume, diversity, balance, and particularly the confusion matrix — proves essential before trusting any AI inspection system in production.
Before deploying, ask the vendor: what defect types are in the training set? How was class balance handled? What happens with defects that appear at the edge of two inspection regions?
Upcoming editions will examine Vision Transformers (specifically DETR) as an alternative architecture that addresses these local-vision limitations.