This paper is written for the people who carry technical responsibility for what their organisations decide on the rocks. Chief geologists, technical directors, exploration leads, JV technical reviewers, resource modellers, the data engineers who support them, and the geologists who run all of the above by themselves at smaller firms.

It walks through what mining-grade data normalisation actually entails when the corpus is real — Cyrillic, decades-old, multi-jurisdictional, structured for a regulatory regime that no longer exists — and what it takes for that work to become genuinely usable to a working geologist. The two are not the same problem. Most platforms underestimate the second one.

It uses concrete examples drawn from live engagements, anonymised under client agreements. It is a field report. The companion paper — The Trust Layer: From Co-Pilot to Autopilot in Mining Finance — covers the same architecture for the capital allocators that geological practitioners serve.


The four numbers that carry everything

Picture a single drill intercept. A grade, an interval, a depth, a coordinate. The four numbers that, between them, tell a working geologist almost everything that matters about a single line in a corpus.

Now picture the same intercept the way it actually arrived at our pipeline last month. The grade is reported in volumetric units — not g/t, not %, not ppm — because the report is from the late 1970s and the convention for that placer assemblage in that country at that time was volumetric. The interval is not a sampled interval at all; it is a horizon-average expressed as a range, an arbitrary scale, not a true drill intercept. The depth is referenced to a collar elevation that is itself referenced to a regional datum that was current then but is not current now. And the coordinate is referenced to the Pulkovo Meridian, the geographical reference of the Russian Empire from 1844 onwards, which sits 30°19′34″ east of Greenwich.

Take that single intercept into a modern resource model and every one of those four numbers needs to be transformed before the model will accept it. Not converted — transformed. The unit conversion needs a custom constant for the volumetric placer assemblage. The interval requires a deliberate decision about whether to treat the horizon-average as a proxy for a sampled interval or to flag it as something else. The depth requires datum reconstruction. The coordinate requires a Pulkovo → WGS84 transformation. None of these transformations is impossible. Each of them is a place where a careless team will silently introduce a thirty-degree-of-longitude error.

The story above is not unusual. It is the median morning at any data team that is trying to ingest a real corpus of historical and multilingual mining disclosures. Most off-the-shelf platforms were built around the cleanly-disclosed quarterly figures of major operators in the major reporting jurisdictions. The rest of the industry has been left to fend for itself.

This paper is about the rest of the industry. The Soviet-Cyrillic corpus is the hardest case we have worked, and we lean on it throughout because pushing past 99% on Cyrillic, multi-format, mid-twentieth-century technical reports earns the right to be believed on anything easier. The same architecture runs at comparable performance in any language — English, Spanish, Portuguese, French, and other major resource-language corpora.

AI Readiness Diagnostic

Where does your team's data infrastructure sit today?

Answer 10 questions. Get a private diagnostic on your AI readiness — in minutes.

The five normalisation disciplines

Each of the five disciplines below is written in working language, with concrete examples from live engagements. Read them in order and a pattern comes into focus: every one is solvable, none of them is solvable casually, and the cumulative cost of solving them all by hand is what makes internal builds stall at the eighteen-month mark.

2.1 — Units of measure

The most basic problem and the one most often dismissed. Grade is reported in g/t, %, ppm, oz/t depending on commodity, jurisdiction, era, and the personal preference of the technical author. A unit error at the front-end becomes a deal error at the back-end.

During a recent platform review with the technical team at a tier-one royalty client, the conversation moved into the front-end of the comps interface. The team’s technical lead stayed on units — ounces-per-ton versus grams-per-ton — long after the rest of the conversation had moved on. They were right to stay on it. At the decision-grade scale, a unit error is structural; a careful interface that surfaces both the original disclosed value and the normalised value side by side is not a polished item. It is a defence.

Some of the harder unit problems we have built explicit handlers for:

Pulse holds the original disclosed value alongside the normalised value at every layer of the platform. The original is never overwritten. The normalised is never presented without the original being one click away.

2.2 — Currency and economic conventions

Currency normalisation looks trivial until it shows up. An Australian operator’s annual report frequently carries multiple currency references inside a single document: head numbers in AUD, segment breakouts in USD, occasionally a divestment line in CAD or ZAR. Disambiguation requires page-level context, not a document-level assumption.

Soviet-era cost reporting in roubles requires inflation-adjusted reconstruction before any modern peer comparison is meaningful. The rouble has been redenominated multiple times across the relevant period, and the inflation indices used by Soviet authorities are not directly comparable to modern equivalents.

Royalty and streaming economics introduce their own definitional challenges. NSR (net smelter return), GRR (gross revenue royalty), NPI (net profits interest), and percentage-of-gross variants are not interchangeable. A peer comparison that mixes them is not a peer comparison; it is a misleading bar chart.

2.3 — Reporting standard reconciliation

JORC, NI 43-101, SAMREC, S-K 1300, and historical Soviet GKZ are not interchangeable. Each defines categories — measured, indicated, inferred, and their pre-modern equivalents — with subtly different criteria and confidence thresholds. A reserve restated under a new regime is genuinely different from the previous figure, even when the underlying geology has not changed.

The Soviet GKZ classification deserves particular attention. The categories — A, B, C1, C2, P1, P2, P3 — are sometimes mapped one-to-one against JORC categories. They should not be. The confidence model under GKZ was built around drilling density and geological assurance criteria that do not align with modern definitions. Inferred-equivalent categories are mappable; equivalence is not equivalence.

Pulse maintains an explicit reconciliation layer for these mappings, governed by domain experts and updated as standards evolve. The layer is itself a living artefact, not a one-time spec, because the standards evolve, jurisdictions issue new guidance, and operators interpret guidance unevenly.

2.4 — Georeferencing and place-name collisions

The Pulkovo Meridian

The Pulkovo Observatory, founded outside Saint Petersburg in 1839, sat at the centre of the Russian Empire’s scientific cartography from the mid-nineteenth century. Its meridian — 30°19′34″ east of Greenwich — was adopted in 1844 as the geographical reference for all Russian mapping, and remained in use through the Soviet period.

Without an explicit Pulkovo → WGS84 transformation, an asset can sit roughly thirty degrees of longitude away from where the modern map says it sits. This places assets in the wrong country. It produces geographically incoherent drainage analyses. It defeats every downstream geospatial workflow until it is corrected.

Soviet decree time

By Sovnarkom decree of 16 June 1930, all clocks in the Soviet Union were permanently shifted one hour ahead from 21 June 1930 onwards. Cross-comparing time-stamped Soviet records against international datasets without correcting for decree time produces a quiet temporal drift that compounds across the period.

Place-name collisions

Dozens of villages, regions, and rayons share names across the former Soviet republics. “Spasskoye” is everywhere. Disambiguation requires regional context, oblast metadata, neighbouring named features, and sometimes historical administrative-boundary reconstruction. A team that resolves place names by string match alone will conflate three deposits into one regularly enough that the conflation becomes a structural error in the resource model.

Latin-character map IDs in Cyrillic-only OCR contexts

A recurring extraction failure mode. Soviet-era OCR was Russian-only. Map IDs and reference labels frequently use Latin characters that the OCR drops or misrecognises. The fix is a bilingual OCR pipeline with explicit Latin-character handling, not a post-hoc string-matching layer. We learned this the hard way; the fix is in production.

2.5 — Measurement conventions across 150 years

Mathematical conventions and statistical treatments evolve. Outlier handling, capping, declustering, top-cut conventions, and the philosophical position of how much of a high grade to believe are not historically constant. A historical resource estimate is not just a number; it is a number that was produced by a method which itself has a date stamp.


The 99% threshold

Most extraction vendors quote a required ~90% accuracy threshold. The threshold is meaningless.

Ninety percent extraction accuracy on a technical corpus is not usable in day-to-day investment workflows. Errors at the 90% level cannot be filtered — there is no signal that distinguishes the correct rows from the incorrect rows without re-verifying every cell. A team running on 90% extraction has effectively done none of the validation work; they have done a fast pull and inherited the burden of cell-by-cell verification on every row that matters.

Mining-grade extraction has to push past 99% on the metrics that drive decisions, and it has to do so on real corpora — not on the cleanly-printed examples that vendors choose for their benchmarks.

See it on your data

Ready to run this on your own corpus?

We run proof-of-concept engagements on real data. Scoped, time-bound, and benchmarked against your own ground truth.

A worked example, anonymised

A leading global producer commissioned a pilot on six complex Soviet-era geological reports drawn from a Central Asian historic exploration archive. The corpus was Cyrillic, multi-format, and historical — exactly the kind of corpus an off-the-shelf toolchain quietly fails on.

Six reports. Seventy-one Priority-1 documents. One hundred and sixty-four input files in total. A blend of drill-hole passports, well geological-geophysical columns, well cross-sections, stratigraphic correlation panels, and physical-properties sheets, each with its own structural conventions and extraction challenges.

The hardest documents — single drill-log passports — carried up to 20,000 data points apiece, on the order of 2,000 rows by 10 columns, at variable scan quality. The producer’s scoring framework allocated 20% of the total weight to assay extraction, the most technically difficult item.

99.21%
Element accuracy on scored ground-truth corpus
11,249
Drill-hole assay rows extracted across 37 drill-log images
0
Hard processing failures across 164 input files

The numbers above were produced by combining computer vision, classical OCR (Tesseract), LLM-based extraction, and statistical reconciliation. No single technique would have reached 99%+ alone. The combination, validated through a benchmarking framework with fifty to two hundred hard samples per pipeline step and regression tested on every change, is what made the result reproducible rather than lucky.

On the same corpus the pipeline extracted 11,249 drill-hole assay rows across 37 drill-log images, covering 34 distinct geochemical elements (with majors Cu, Mo, Pb, Zn, Ag), and enriched 96.3% of those rows with full lithology. Zero hard processing failures across 164 input files.

A vendor that says “99%” without describing the corpus, the metric, and the validation methodology is selling marketing. A vendor that says “Cu accuracy is 99.21% on this scored ground-truth subset, lithology accuracy is in the low-to-mid 90s by internal benchmark, multi-hole stratigraphic panels are deliberately scoped out and require a dedicated handler” is being honest. Honesty is what survives a JV technical review.


The three load-bearing stages for geological practitioners

Occurrence-level entity resolution

The unit of resolution that matters in a real corpus is not the document and is not the project. It is the occurrence — the named geological feature, mineralisation, or deposit referenced in a report. A single occurrence may be referenced under multiple names, in multiple languages, across hundreds of documents covering decades.

A team that does not deduplicate at occurrence level cannot build a chronological history of any given deposit; the same deposit appears as three different entries in the database. Across one frontier-exploration corpus, occurrences were deduplicated and cross-linked across roughly fifteen thousand document bundles, enabling queries of the shape: “Show me every gold grade above 1 g/t associated with [named occurrence] across every report in which it is mentioned, in chronological order.”

Grade-to-occurrence mapping

Document-level grade extraction is a starting point. Decision-grade work needs grade extraction mapped to the specific named occurrence the grade describes — not just to the parent document. The relational layer has to be maintained explicitly: a single grade record carrying original disclosed value, normalised value, sample type, depth and interval, source page reference, occurrence ID, and any lineage from a contradicting disclosure.

Human-in-the-loop, by design

When a contradiction surfaces between two disclosures of the same occurrence, a working geologist resolves it, and the resolution becomes a rule. Future extractions of the same pattern are governed by the rule rather than re-litigated. Over the life of an engagement, the system accumulates the team’s judgement; the team accumulates capacity, because the same problems do not have to be solved twice.

This is the architectural reason internal builds stall: the loop only spins at volume, and most teams do not see enough disclosures to feed it.


Geospatial work: where the data has to land

Three questions, all load-bearing: when a coordinate is stated, is it correct after transformation? When a coordinate is not stated, where is the asset? And once the answer to either is in hand, does it land in the GIS toolset the geologist already works in?

When a coordinate is stated: Normalise it. Pulkovo → WGS84. Krasovsky 1940 → WGS84. Local datum → WGS84. Always preserve the original. Surface the transformation logic. Production-grade georeferencing requires a fallback chain, a confidence-level reporting layer, and explicit flagging of any coordinate the system could not transform with high confidence.

When a coordinate is not stated: Infer it carefully. Trained inference can locate an asset from the textual content of disclosures — regional context, deposit-type cues, named geographic features, drainage references. The inference returns a confidence interval, not an invented coordinate. A high-confidence inference is treated differently from a low-confidence one in every downstream workflow.

Where the coordinate goes next: The geologist works in ArcGIS, QGIS, LeapFrog, or some combination of the above. The expectation, fairly, is that mining-grade data plugs into the toolset they already trust.

Pulse outputs are designed to integrate, not to displace. Coordinate sets export as shapefiles, GeoJSON, or KML. Drill collars export as point layers with their full validated attribute table — grade, interval, sample type, source page, occurrence ID. In ArcGIS or QGIS or LeapFrog, the validated data arrives in the format the geologist already understands.


The five steps from corpus to working geologist

There is a gap between “data is ready” and “geologists can use it.” It is its own discipline. Most platforms do not address it because most platforms were designed by data engineers for data engineers.

A working geologist put it directly in a recent scoping conversation: being geologists, not IT specialists, what they were struggling with most was not the underlying data work — it was the interface of how to actually access the AI-translated documents and visualise the results spatially.

Step 1 — Translate. Full-corpus OCR with structural reconstruction. Bilingual outputs (original alongside English). Translation that respects technical terminology. “Redkozemelne elementy” translates as “rare earth metals,” not as “rare ground earths.”

Step 2 — Search. Hybrid search across keyword and semantic layers, with citations to the page and table cell that produced the answer. Cross-language queries handled correctly. Misspellings of Russian or place names returning results, not zero hits.

Step 3 — Extract. 99%+ accuracy on the metrics that drive decisions, with the original disclosed value preserved alongside the normalised one, sample-type tagging, outlier flagging, and source-line lineage on every cell.

Step 4 — Spatially link. Drill holes as points. Maps as outlines with attribute tables. Reports as texts that reference both. All three representations linked — a drill hole at one location must be findable as a coordinate on a map and as the named hole inside the report.

Step 5 — Visualise and query in the GIS toolset. Shapefiles, GeoJSON, KML, raster GeoTIFFs, all carrying the validated attribute tables. Spatial queries work natively against the layer. Filters by commodity, grade range, occurrence, source document, time period — all available without leaving the GIS environment.


Three archetypes from the field

Archetype A — a frontier exploration archive in Central Asia

A late-stage exploration project working a Soviet-era archive across an asset of national-scale strategic importance. Approximately fifteen thousand document bundles — around three million pages — with 537,000 attachments, of which 320,000 were image-format. Russian-first; the team’s own internal toolset until then had been a script-based fuzzy keyword search.

The foundation was rebuilt from the ground up. Full-corpus OCR with structural reconstruction. Bilingual outputs. A structured data layer underneath the text. One hundred elements normalised across multi-language naming conventions, including Soviet-specific terminology. Approximately 950,000 grade records extracted with sample-type tagging and outlier flagging. Full corpus end-to-end processing time after pipeline optimisation: approximately forty-eight hours.

Archetype B — a tier-one global producer’s pilot

The pilot anchored the worked example above. The headline numbers bear repeating: 71 documents, 11,249 drill-hole assay rows, 34 distinct geochemical elements, ~97% word / ~99% character OCR accuracy on Russian text including handwritten, stamped, and degraded scans, 509 tables extracted to structured CSV. On the scored ground-truth corpus: 99.21% element accuracy, 99.65% recall, 99.06% precision, 99.71% interval-length equality. Zero hard processing failures across 164 input files.

Archetype C — a frontier copper-belt junior in active POC scoping

A junior exploration team holding Soviet-era licence ground in a frontier jurisdiction. Five concurrent problems articulated during scoping: translation of Soviet reports at scale; AI-assisted natural-language search across the translated corpus; recognition and extraction of drilling data; spatial linkage between drillholes, map outlines, and the reports they reference; and spatial search natively in ArcGIS.

The team’s honest framing: as geologists, not IT specialists, the hard part was not the underlying data engineering. It was the interface where the AI-translated documents and the spatial visualisation met their workflow.

The POC scope settled on a deliberately-limited subset of the team’s portfolio — around five percent of the documents — with the goal of running every step end-to-end on a high-quality output. The principle: do less, do it well, surface the edge cases on a small corpus before scaling.


What the alternatives actually do

Generic OCR is not a technical-document OCR. Basic OCR is the equivalent of handing a 600-page feasibility study to a junior intern on their first day. The output is roughly 5% usable text on a real technical corpus. Advanced OCR with structural reconstruction sits at 97%+ on Russian text including handwritten, stamped, and degraded scans, in production today. The difference is not incremental; it is the difference between a system the AI can understand and a system it cannot.

Generic LLMs hallucinate technical figures. A defensible architecture uses language models downstream of the truth layer — for summarisation, conversational interfaces, and answer assembly — after structured data has been validated. Truth is extracted by deterministic algorithmic logic. The model is the brain; the trust layer is the memory; the trust layer must be built.

Generic enterprise AI inside Office 365 isn’t this. Copilot is a competent conversational layer over text. It is not, however, a substitute for the disciplines this paper has walked through. Copilot does not maintain a definitional ontology for mining reporting standards. It does not reconcile JORC ↔ NI 43-101 ↔ GKZ. It does not transform Pulkovo coordinates to WGS84. It does not preserve the original disclosed value alongside a normalised one with the transformation logic visible. Anyone who claims otherwise has either not tried it on a real Soviet-era technical report or has not yet been audited on the result.

QGIS, LeapFrog, ArcGIS, Adobe, Excel are all excellent for what they were built for. None is a substitute for a data platform with a domain ontology, a validation layer, and a freshness loop. The right relationship between Pulse and these tools is integration, not replacement.

Eight questions

Is your data infrastructure decision-grade?

Three or more "no" answers indicate infrastructure the corpus has outgrown. Take the diagnostic to find out where your team sits.

Conclusion

Geology is a discipline of careful inference from incomplete data. The geologist takes drill intercepts, structural measurements, geochemical signatures, and a modelled understanding of how the deposit formed, and produces an estimate of what is there. The estimate is defensible because the method is rigorous, the assumptions are explicit, and the data is curated.

Geological data normalisation is the same discipline applied one layer below — to the data the geology itself is built on. A drill intercept is itself a derived figure; behind it sits a sampling decision, an instrument, a laboratory, a unit convention, an interval definition, a coordinate, an era. A team that normalises that derived figure with the same rigour the geologist applies to the inference is doing the work the corpus deserves.

Above that data layer sits the second discipline this paper has named: making the result genuinely usable to a working geologist. Solving the data layer without solving the workflow layer leaves the geologist with a queryable corpus they cannot operate. Solving the workflow layer without the data layer leaves them with a usable interface over information they cannot trust. Both have to be done.

This paper has been a long answer to a short question: what does it actually take to do that work well, at scale, on real data? The short answer is two years of blood, sweat, and tears, and an architecture that compounds rather than degrades. The shortest answer of all is that, eventually, the demonstration of the work running on real data lands faster than any explanation of it. That is the moment the pitch stops. From there it is just the work.

Pulse Intelligence

Less Searching. More Strategising.™

See the platform running on real geological data. Scoped proof-of-concept engagements available — benchmarked against your own ground truth.

Companion Paper · Investing Edition
The Trust Layer: From Co-Pilot to Autopilot in Mining Finance
The same architecture for the geological practitioners who serve the same firms.
WC
Will Coetzer
Founder and CEO, Pulse Intelligence
Pulse Intelligence is an AI infrastructure company that replaces manual technical and investment workflows in capital-intensive industries, starting with mining. The platform automates the data extraction, validation, normalisation, and structuring work that technical teams and analysts currently do by hand — with full source-line lineage on every figure.

Platform scale (live): 4,144+ companies tracked · ~14,000 mining assets monitored daily · 1.6M+ publications ingested · 16 stock exchange integrations · 34 commodities tracked daily · Multilingual processing including Cyrillic and Mandarin.