Technical Methodology

The Lead Detection Tool combines two complementary machine learning models to detect informal lead contamination sources across South Asia.

Two Complementary Pipelines

Our screening approach integrates contextual analysis with remote sensing to maximize geographic coverage while maintaining detection accuracy. The two models operate in parallel and can be deployed independently based on data availability and regional priorities.

Geospatial Data

→

Contextual Model
(XGBoost)

→

ULAB Detection

Sentinel-2 Imagery

→

Satellite Model
(Deep Learning)

→

Smelter Detection

Both pipelines output a risk score (0.0-1.0) and confidence metrics that feed into the final prioritization engine.

Contextual Model: ULAB Detection

Overview

The Contextual Model identifies informal lead-acid battery (ULAB) recycling operations using an XGBoost classifier trained on geospatial, socioeconomic, and environmental features. ULAB sites are characterized by clustering of battery shops, scrap dealers, and informal settlements without environmental controls.

Algorithm

XGBoost (eXtreme Gradient Boosting) is a gradient boosting decision tree ensemble that excels at capturing non-linear relationships in structured data. The model is trained to output a probability score for each grid cell (1km²) indicating the likelihood of ULAB activity.

        Model: XGBoost (gradient boosted trees)

        Input: Geospatial features per 1km² grid cell (POI density, population, road networks, land use)

        Output: Risk probability 0.0–1.0 per cell

        Proof of concept: NCR Delhi (13 verified ULAB sites as training labels, 3,746 grid cells scanned)

Feature Engineering

The model incorporates the following feature categories:

◆

Business Clustering

Density of battery shops, scrap metal dealers, and informal recycling businesses within 500m radius. Derived from OpenStreetMap POI data and Google Places API.
◆

Population Demographics

Population density, informal settlement presence, presence of educational institutions (schools, hospitals). Sources: WorldPop, settlement layers from OpenStreetMap.
◆

Road Networks & Accessibility

Road density, distance to major roads and highways, proximity to transport corridors. Derived from OpenStreetMap road network data.
◆

Land Use Classification

Industrial vs. residential vs. mixed-use land classification, proximity to industrial zones. Source: OpenStreetMap land use polygons.

Proof-of-Concept Results (NCR Delhi)

The model was trained and evaluated in NCR Delhi using 13 verified ULAB sites from field-verified databases as positive labels. Key observations:

All 13 verified ULAB sites fall within or near the model's high-probability zones
188 grid cells (out of 3,746 scanned) were flagged as elevated risk
High-risk clusters align with known informal industrial corridors
The model generalizes from limited ground truth by leveraging contextual feature patterns

Formal precision/recall metrics require additional field verification of model predictions and are pending future validation campaigns.

Satellite Imagery Model: Smelter Detection

Overview

The Satellite Model detects industrial smelter facilities using a convolutional neural network (CNN) trained on Sentinel-2 multispectral satellite imagery. Smelters have distinctive spectral, spatial, and contextual signatures visible in multispectral imagery that can help identify facilities even when not registered in official records.

Algorithm

A deep learning classifier is trained on image patches extracted from Sentinel-2 multispectral satellite imagery. The model learns to distinguish smelter facilities from other land cover types based on spatial and spectral patterns.

        Model: Deep learning CNN

        Input: Sentinel-2 multispectral imagery (13 bands, 10–60m resolution)

        Key bands: Visible (B2–B4), NIR (B8), SWIR (B11–B12)

        Training data: Geolocated smelter records from USGS and ILZSG databases

        Output: Risk probability 0.0–1.0 per image patch

Feature Space

The CNN automatically learns the following latent features from raw imagery:

◆

SWIR Reflectance Anomalies

SWIR bands (B11, B12) capture reflectance differences associated with bare soil, impervious surfaces, and industrial materials common at smelter sites. Note: Sentinel-2 does not carry thermal infrared sensors; heat detection is not directly possible.
◆

Building Morphology

Smelters have distinctive spatial patterns visible at 10m resolution: large roof areas, industrial building layouts, and associated infrastructure (access roads, storage areas). The CNN learns these structural signatures from visible and NIR bands.
◆

Land Cover Context

Surrounding land cover patterns — bare soil, reduced vegetation (low NDVI), impervious surfaces — provide contextual signals that distinguish smelter sites from other industrial or urban areas.
◆

Temporal Patterns

Multi-temporal composites from repeated Sentinel-2 acquisitions may reveal persistent industrial activity patterns, such as consistent bare ground and sustained low vegetation indices around active facilities.

Training Data

The model is trained on a curated dataset combining:

Geolocated smelter records from USGS industrial facility databases
ILZSG (International Lead and Zinc Study Group) smelter registries
Field-verified toxic sites in target regions
Sentinel-2 imagery archive covering South Asia
Negative samples: non-smelter industrial facilities and urban areas for model discrimination

Model Status

The satellite model is currently in development. Formal performance metrics (precision, recall, F1, AUC-ROC) will be reported after sufficient field verification of model predictions has been completed. Initial qualitative assessments are encouraging, with the model correctly identifying known smelter locations in validation regions.

Rigorous quantitative evaluation requires a large enough set of field-verified positive and negative predictions, which is an ongoing effort coordinated with field validation partners.

Active Learning Loop (Planned)

A core design goal of the screening tool is to support a feedback loop between model predictions and field verification. As field teams visit predicted sites and report findings, this data can be incorporated to improve future model performance. The envisioned process is outlined below.

Model Predictions

→

Field Teams

→

Ground Truth Data

→

Model Retraining

Envisioned Process

Prediction: Models generate risk scores for grid cells or image patches based on current data
Prioritization: High-confidence detections are sorted by risk score and clustered geographically for field team planning
Field Verification: Partner field teams visit high-priority sites and collect structured field data
Data Integration: Field observations (site type, activity level, environmental conditions) are recorded
Model Evaluation: Predictions are compared against field results to identify false positives and false negatives
Retraining: Models are retrained on accumulated ground truth data as the labeled dataset grows

This feedback loop is aspirational and will be implemented as field verification campaigns progress. The NCR Delhi proof of concept represents the first iteration of this cycle.

Limitations & Assumptions

Known Limitations

Cloud cover: Satellite-based detection is limited by cloud cover in monsoon regions. Temporal compositing helps but cannot eliminate seasonal gaps.
Small facilities: Sites with rooftops <100m² may be difficult to detect reliably. The model is optimized for medium to large operations.
Regional variation: Model performance varies by region due to differences in urban density, building materials, and climate. Contextualization by region is recommended.
Data freshness: Satellite imagery lags by 5-12 days. POI data from OpenStreetMap may be outdated in rapidly changing informal settlements.
Informal sector dynamics: ULAB sites frequently relocate. A negative prediction does not guarantee absence of activity; follow-up is important.
False positives: Some industrial facilities (e.g., battery manufacturers) may be mistaken for informal ULAB sites. Field teams must differentiate.

Methodological Assumptions

ULAB sites concentrate in areas with high informal business density
Smelter operations produce detectable spatial and spectral signatures in multispectral imagery
POI data from OpenStreetMap and Google Places has reasonable coverage in South Asia
Lead contamination patterns correlate with site locations
Field verification teams can reliably classify site types and activity levels

Data Sources & Attribution

Geospatial Data

OpenStreetMap (OSM): POI data, building footprints, road networks, administrative boundaries
Google Places API: Business listings, shop locations, amenity data
WorldPop: Population density grids at 100m resolution
ESA Copernicus: Sentinel-2 multispectral satellite imagery
NASA USGS: Industrial facility database, historical smelter records

Validation Data

Field-verified sites: Toxic sites confirmed through on-the-ground assessment (13 confirmed ULAB sites used in NCR Delhi proof of concept)
ILZSG: International Lead and Zinc Study Group smelter registry
USGS: Industrial facility records with geolocated smelter data
Academic literature: Published studies on lead contamination sources and health impacts

Data Citation: Users of this model are encouraged to cite the underlying data sources. Raw output predictions should not be used as the sole basis for remediation decisions without field verification.

References & Further Reading

Key References

Chen & Guestrin (2016). "XGBoost: A Scalable Tree Boosting System." Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.
He, Zhang, Ren & Sun (2016). "Deep Residual Learning for Image Recognition." IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
Landrigan et al. (2018). "The Lancet Commission on pollution and health." The Lancet, 391(10119), 462–512.
UNICEF (2020). "The Toxic Truth: Children's Exposure to Lead Pollution Undermines a Generation of Future Potential."
Ericson et al. (2021). "A meta-analysis of blood lead levels in India and the attributable burden of disease." Environment International, 147, 106308.

Technical Methodology

Two Complementary Pipelines

Contextual Model: ULAB Detection

Overview

Algorithm

Feature Engineering

Business Clustering

Population Demographics

Road Networks & Accessibility

Land Use Classification

Proof-of-Concept Results (NCR Delhi)

Satellite Imagery Model: Smelter Detection

Overview

Algorithm

Feature Space

SWIR Reflectance Anomalies

Building Morphology

Land Cover Context

Temporal Patterns

Training Data

Model Status

Active Learning Loop (Planned)

Envisioned Process

Limitations & Assumptions

Known Limitations

Methodological Assumptions

Data Sources & Attribution

Geospatial Data

Validation Data

References & Further Reading

Key References