Computer Vision for Object Identification, Counting & ERP Integration

I Chishti
Mar 18
10 min read

Updated: Mar 30

Introduction: The Problem Nobody Told You Was Hard

Imagine the scenario: a warehouse full of equipment, assets, or inventory. You want an AI system that can look at that environment, identify every object, count them accurately, and push the result into your ERP, WMS, or asset management system — automatically, reliably, and without human error.

The business case is compelling. Manual stock counts are expensive, slow, error-prone, and often performed infrequently. A CV-based system that can count continuously or on-demand would transform inventory accuracy, reduce write-offs, and eliminate the cost of periodic physical audits.

But here is what most vendors won't tell you: getting from "AI identifies an object" to "ERP record is updated accurately" is a journey filled with subtle, non-obvious challenges. Challenges around duplication, occlusion, lighting, object similarity, movement, system integration, and — critically — knowing when to trust the machine and when to bring in a human.

This blog is the complete map. We will walk through every stage of a real-world deployment, the problems you will encounter, and how to solve them. It draws on Cluedo Tech's direct experience building exactly this type of system — including the failures, the fixes, and the decisions that made the difference.

Stage 1: Scoping — What Are You Actually Counting, and Why Does It Matter?

Before any camera goes up or model gets trained, you need ruthless clarity on the following:

What is the object taxonomy? Are you counting one type of object (e.g., pallets) or many (e.g., 400 SKUs of mixed equipment)? The more object types, the harder the problem. Object similarity — items that look nearly identical but are fundamentally different — is one of the leading causes of CV counting failure.

What is the environment? Indoor vs. outdoor. Controlled vs. variable lighting. Static objects vs. objects that move. Crowded vs. sparse arrangements. Each of these variables multiplies the challenge.

What counts as a "count"? This sounds obvious but is critically important. Does a partially visible object count? What about objects inside containers? Does a stack of three count as three or one? These rules must be defined before training, not after.

What does the ERP record need to contain? Object type, quantity, location, condition, timestamp, confidence score? Getting downstream data requirements defined upfront saves enormous rework later.

What is the acceptable error rate? In a pharmaceutical context, zero tolerance. In a general warehouse, ±2% may be perfectly acceptable. This drives your entire approach to confidence thresholds and human-in-the-loop design.

Stage 2: The 360° First-Pass Strategy — Why Context Is Everything

One of the most important architectural decisions in a CV object counting system is whether the AI starts counting immediately, or whether it first builds a spatial understanding of the environment before counting begins.

Cluedo Tech's experience strongly supports the latter — and here is why.

The problem with naive scanning: If a camera or robot moves through a space and counts objects as it goes, it has no awareness of what it has already seen. An object visible from two angles gets counted twice. An object partially occluded from the first pass gets missed entirely. Without spatial context, the model is essentially guessing about what it has and hasn't counted.

The 360° context-first approach works as follows:

Spatial mapping pass (the "survey"): The camera system performs a complete 360° sweep of the environment — or a structured full-coverage scan — without counting. During this pass, the system builds a spatial map of the environment, identifying zones, boundaries, and the approximate locations of objects. Modern depth cameras (Intel RealSense, Microsoft Azure Kinect, LiDAR) make this dramatically more effective by providing 3D spatial data, not just 2D imagery.
Count pass (the "inventory"): The system then performs a structured, systematic scan. Because it now knows the spatial layout — where walls are, where objects are clustered, what zones exist — it can plan an optimal scanning path that ensures full coverage without revisiting the same object from multiple angles.
Reconciliation pass (the "verify"): Any objects flagged as uncertain, partially occluded, or below confidence threshold are then targeted specifically for a close-range, high-resolution capture.

This three-pass architecture dramatically reduces both over-counting (duplicates) and under-counting (misses) compared to a naive single-pass approach.

Stage 3: The Deduplication Problem — The Hardest Technical Challenge

Deduplication — ensuring each physical object is counted exactly once — is the single hardest problem in CV-based object counting. Here is why.

Sources of duplication:

The same object is visible from multiple camera positions or scan angles
An object moves slightly between scans (in a live environment)
An object is reflected in glass, metal surfaces, or water
Two similar objects are so close together that the model treats them as one, then re-evaluates and counts them separately

How to solve deduplication:

The most robust approach combines multiple techniques in a deduplication pipeline:

Spatial hashing / grid-based deduplication: The 3D environment is divided into a grid. When an object is identified, its estimated 3D position is logged against the grid. If a subsequent detection falls within the same grid cell as an existing detection, it is flagged as a potential duplicate rather than a new count. The confidence scores of both detections are compared and the higher-confidence detection is retained.

Object fingerprinting: Beyond position, objects can be "fingerprinted" using visual embeddings — a compact mathematical representation of what the object looks like. Two detections that share a similar fingerprint AND are close in space are very likely the same object. This is particularly powerful for distinguishing between "same object seen twice" and "two different objects that look similar."

Temporal tracking: In any environment where there is some movement (people walking past, forklifts, conveyor belts), temporal tracking using algorithms like DeepSORT or ByteTrack assigns each detected object a unique ID and tracks it across frames. An object that has been tracked continuously does not get re-counted when it is detected again.

Confidence-weighted reconciliation: Every detection carries a confidence score. When the deduplication pipeline flags two detections as potentially the same object, the system retains the higher-confidence detection and archives the lower-confidence one. Final count totals are only derived after the full reconciliation process has run.

Stage 4: Human in the Loop — When to Trust the Machine, and When Not To

No CV system should operate in a fully autonomous mode at launch. Human-in-the-loop (HITL) design is not a concession — it is the architecture that makes the system trustworthy, improvable, and deployable.

Where humans add value:

Confidence thresholding: Any detection below a defined confidence threshold (e.g., below 80%) is routed to a human reviewer rather than being automatically counted or dismissed. The human reviews the image crop, makes the call, and the decision is logged — both as a correction to the current count and as a training signal for the model.
Ambiguous object classes: When the model is uncertain between two similar object types (e.g., two equipment variants that differ only in a small label), a human reviewer confirms the classification. Over time, these ambiguous cases become the most valuable training data.
Exception handling: Novel objects — things the model has never seen before — should trigger an alert rather than being silently ignored. A human reviews, classifies, and the new object type is added to the taxonomy.
Audit-on-demand: Even after the system has established high confidence, random sampling by human auditors should be maintained. This is critical for trust, for compliance, and for catching drift before it becomes a problem.

The HITL feedback loop: Every human correction should feed directly back into model retraining. This is the mechanism by which the system gets smarter over time. In Cluedo Tech's implementations, HITL feedback loops typically reduce the rate of human intervention by 60–80% within the first 90 days of live operation, as the model is continuously improved on real-world edge cases.

Stage 5: Accuracy — What to Expect, and How to Measure It Properly

Setting realistic accuracy expectations upfront protects everyone. Here is what the real-world numbers look like:

Scenario	Realistic Accuracy at Launch	Realistic Accuracy at 90 Days
Single object type, controlled environment	96–99%	99%+
Multiple object types, controlled lighting	88–95%	95–98%
Multiple object types, variable environment	78–88%	88–95%
Highly similar objects, crowded environment	70–82%	82–92%

How to measure accuracy correctly:

Do not measure against the AI's own output. Ground truth must come from an independent physical count of a sample. The most rigorous approach:

Run the AI count on a defined area
Have a human physically count the same area immediately after (blind to the AI result)
Compare and categorise discrepancies: missed detections (false negatives), duplicate counts (false positives), and misclassifications
Calculate precision, recall, and F1 score — not just overall "accuracy" which can be misleading

The key metrics to track:

Precision: Of everything the AI said was an object, what percentage actually was? (Measures false positive rate)
Recall: Of all the objects that were actually there, what percentage did the AI find? (Measures false negative / miss rate)
F1 Score: The harmonic mean of precision and recall — the single best summary metric
Confidence calibration: Does a confidence score of 90% actually correspond to 90% accuracy? Well-calibrated models are essential for HITL thresholding to work correctly

Stage 6: ERP and Systems Integration — The Last Mile

Getting an accurate count is half the battle. Getting that count into your ERP, WMS, or asset management system — reliably, without duplication, and with full auditability — is the other half.

Key integration architecture decisions:

Synchronous vs. asynchronous: For real-time stock systems, you may want counts pushed immediately as detected. For end-of-shift inventory reconciliation, a batch sync after the count is complete is more appropriate and easier to manage.

Idempotency: Every record pushed to the ERP must be designed to be idempotent — meaning if the same record is pushed twice (due to network failure, retry logic, or system glitch), it does not create a duplicate entry. This is achieved through unique transaction IDs tied to each count event.

Confidence scores in the ERP record: Cluedo Tech's recommendation is to push confidence scores alongside count data into the ERP. This allows downstream systems and human reviewers to understand the certainty level of each entry and prioritise review of low-confidence records.

Staging layer: Rather than pushing directly to the live ERP, counts are first pushed to a staging layer (a review queue). A lightweight human review step confirms the staged data before it is committed to the production ERP. In Cluedo Tech's deployments, this staging review typically takes 5–15 minutes per session and catches the vast majority of remaining errors before they enter the system of record.

Full audit trail: Every count, every detection, every human correction, and every ERP write should be logged with a timestamp, a user or system ID, and the source image. This is non-negotiable for compliance, for debugging, and for the model improvement pipeline.

Stage 7: Key Things to Watch Out For — Lessons From the Field

These are the non-obvious pitfalls that repeatedly catch teams out:

Lighting drift: Natural light changes throughout the day. A model trained on morning conditions may perform significantly worse in afternoon shadows. Either use controlled artificial lighting, or train across multiple lighting conditions explicitly.

Object arrangement changes: If objects are typically stored in a specific way and an operator changes the arrangement, the model may fail. Build tolerance for arrangement variation into your training data from the start.

Specular surfaces and reflections: Shiny objects, metallic surfaces, and glass create false detections. Polarised lenses and careful lighting angles mitigate this — it is a hardware solution as much as a software one.

Label occlusion: If your taxonomy relies on reading labels or barcodes, even partial label occlusion will cause misclassification. Design the counting process so that the most label-readable angle is captured deliberately.

Model drift: A model that performs at 95% at launch may degrade over time as the environment evolves (new object types appear, objects are arranged differently, the space changes). Continuous evaluation against ground truth samples and scheduled retraining are essential.

Integration latency and ERP throttling: ERP systems often have API rate limits. A CV system that produces thousands of records per minute needs a proper queue and rate-limiting layer between it and the ERP — otherwise you will hit API limits, records will be dropped, and data integrity will be compromised.

Change management: The biggest non-technical failure mode. If the people using the ERP don't trust the AI count, they will manually override it — and you will end up with worse data than before. Invest heavily in transparency, accuracy reporting, and demonstrating the system's value to end users.

How Cluedo Tech Built This

Cluedo Tech has built and deployed exactly this type of system — computer vision-based object identification, counting, deduplication, and ERP integration — in a real operational environment.

Our approach centred on several principles that proved essential:

Context before counting. We implemented the 360° survey pass before any counting began. This alone reduced duplicate detections by over 70% compared to our initial naive scanning approach during testing.

Spatial-visual deduplication. We combined 3D positional data with visual embedding fingerprinting. Two detections had to be similar in both space and appearance to be flagged as duplicates — this prevented the system from incorrectly merging objects that were close together but genuinely different.

Staged ERP integration. We deliberately did not push directly to the live ERP during the first phase. Every count went to a human-reviewable staging layer. This built trust, caught edge cases, and created the training data that rapidly improved the model.

Confidence-first HITL design. Every detection below 85% confidence went to a human reviewer. In the first two weeks, approximately 25% of detections required human review. By week 12, that figure was under 4% — driven by a continuous model improvement cycle fed by every human correction.

Transparent accuracy reporting. We built a live accuracy dashboard visible to all stakeholders — showing precision, recall, and F1 score by object type, updated daily. This transparency was the single most important factor in building organisational trust in the system.

The system is now operating with greater than 94% end-to-end accuracy on a complex, multi-object-type environment — with a clear trajectory toward 97%+ as the model continues to improve on real-world data.

Conclusion: A Checklist for Your Deployment

Before you begin, ensure you have answers to every one of these:

Is the object taxonomy fully defined, including edge cases?
Is the acceptable error rate agreed with stakeholders?
Is the environment characterised — lighting, surfaces, arrangement patterns?
Is depth/3D spatial data being captured, not just 2D imagery?
Is a 360° context-first approach built into the scan architecture?
Is the deduplication pipeline designed (spatial + visual + temporal)?
Is HITL thresholding defined and the review workflow built?
Is the HITL feedback loop connected to model retraining?
Is the ERP integration idempotent, with a staging layer?
Is a full audit trail in place?
Is accuracy being measured independently against ground truth?
Is there a change management plan for end users?

Get all twelve of these right, and you have the foundation for a CV object counting system that will genuinely transform how your organisation manages physical assets and inventory.

Cluedo Tech has built this. We can help you build it too. Request a meeting.