Technical Methodology

The Lead Detection Tool combines two complementary machine learning models to detect informal lead contamination sources across South Asia.

Two Complementary Pipelines

Our screening approach integrates contextual analysis with remote sensing to maximize geographic coverage while maintaining detection accuracy. The two models operate in parallel and can be deployed independently based on data availability and regional priorities.

Geospatial Data
Contextual Model
(XGBoost)
ULAB Detection
Sentinel-2 Imagery
Satellite Model
(Deep Learning)
Smelter Detection

Both pipelines output a risk score (0.0-1.0) and confidence metrics that feed into the final prioritization engine.

Contextual Model: ULAB Detection

Overview

The Contextual Model identifies informal lead-acid battery (ULAB) recycling operations using an XGBoost classifier trained on geospatial, socioeconomic, and environmental features. ULAB sites are characterized by clustering of battery shops, scrap dealers, and informal settlements without environmental controls.

Algorithm

XGBoost (eXtreme Gradient Boosting) is a gradient boosting decision tree ensemble that excels at capturing non-linear relationships in structured data. The model is trained to output a probability score for each grid cell (1km²) indicating the likelihood of ULAB activity.

Model: XGBoost (gradient boosted trees)
Input: Geospatial features per 1km² grid cell (POI density, population, road networks, land use)
Output: Risk probability 0.0–1.0 per cell
Proof of concept: NCR Delhi (13 verified ULAB sites as training labels, 3,746 grid cells scanned)

Feature Engineering

The model incorporates the following feature categories:

Proof-of-Concept Results (NCR Delhi)

The model was trained and evaluated in NCR Delhi using 13 verified ULAB sites from field-verified databases as positive labels. Key observations:

Formal precision/recall metrics require additional field verification of model predictions and are pending future validation campaigns.

Satellite Imagery Model: Smelter Detection

Overview

The Satellite Model detects industrial smelter facilities using a convolutional neural network (CNN) trained on Sentinel-2 multispectral satellite imagery. Smelters have distinctive spectral, spatial, and contextual signatures visible in multispectral imagery that can help identify facilities even when not registered in official records.

Algorithm

A deep learning classifier is trained on image patches extracted from Sentinel-2 multispectral satellite imagery. The model learns to distinguish smelter facilities from other land cover types based on spatial and spectral patterns.

Model: Deep learning CNN
Input: Sentinel-2 multispectral imagery (13 bands, 10–60m resolution)
Key bands: Visible (B2–B4), NIR (B8), SWIR (B11–B12)
Training data: Geolocated smelter records from USGS and ILZSG databases
Output: Risk probability 0.0–1.0 per image patch

Feature Space

The CNN automatically learns the following latent features from raw imagery:

Training Data

The model is trained on a curated dataset combining:

Model Status

The satellite model is currently in development. Formal performance metrics (precision, recall, F1, AUC-ROC) will be reported after sufficient field verification of model predictions has been completed. Initial qualitative assessments are encouraging, with the model correctly identifying known smelter locations in validation regions.

Rigorous quantitative evaluation requires a large enough set of field-verified positive and negative predictions, which is an ongoing effort coordinated with field validation partners.

Active Learning Loop (Planned)

A core design goal of the screening tool is to support a feedback loop between model predictions and field verification. As field teams visit predicted sites and report findings, this data can be incorporated to improve future model performance. The envisioned process is outlined below.

Model Predictions
Field Teams
Ground Truth Data
Model Retraining

Envisioned Process

  1. Prediction: Models generate risk scores for grid cells or image patches based on current data
  2. Prioritization: High-confidence detections are sorted by risk score and clustered geographically for field team planning
  3. Field Verification: Partner field teams visit high-priority sites and collect structured field data
  4. Data Integration: Field observations (site type, activity level, environmental conditions) are recorded
  5. Model Evaluation: Predictions are compared against field results to identify false positives and false negatives
  6. Retraining: Models are retrained on accumulated ground truth data as the labeled dataset grows

This feedback loop is aspirational and will be implemented as field verification campaigns progress. The NCR Delhi proof of concept represents the first iteration of this cycle.

Limitations & Assumptions

Known Limitations

  • Cloud cover: Satellite-based detection is limited by cloud cover in monsoon regions. Temporal compositing helps but cannot eliminate seasonal gaps.
  • Small facilities: Sites with rooftops <100m² may be difficult to detect reliably. The model is optimized for medium to large operations.
  • Regional variation: Model performance varies by region due to differences in urban density, building materials, and climate. Contextualization by region is recommended.
  • Data freshness: Satellite imagery lags by 5-12 days. POI data from OpenStreetMap may be outdated in rapidly changing informal settlements.
  • Informal sector dynamics: ULAB sites frequently relocate. A negative prediction does not guarantee absence of activity; follow-up is important.
  • False positives: Some industrial facilities (e.g., battery manufacturers) may be mistaken for informal ULAB sites. Field teams must differentiate.

Methodological Assumptions

  • ULAB sites concentrate in areas with high informal business density
  • Smelter operations produce detectable spatial and spectral signatures in multispectral imagery
  • POI data from OpenStreetMap and Google Places has reasonable coverage in South Asia
  • Lead contamination patterns correlate with site locations
  • Field verification teams can reliably classify site types and activity levels

Data Sources & Attribution

Geospatial Data

Validation Data

Data Citation: Users of this model are encouraged to cite the underlying data sources. Raw output predictions should not be used as the sole basis for remediation decisions without field verification.

References & Further Reading

Key References

  • Chen & Guestrin (2016). "XGBoost: A Scalable Tree Boosting System." Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.
  • He, Zhang, Ren & Sun (2016). "Deep Residual Learning for Image Recognition." IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
  • Landrigan et al. (2018). "The Lancet Commission on pollution and health." The Lancet, 391(10119), 462–512.
  • UNICEF (2020). "The Toxic Truth: Children's Exposure to Lead Pollution Undermines a Generation of Future Potential."
  • Ericson et al. (2021). "A meta-analysis of blood lead levels in India and the attributable burden of disease." Environment International, 147, 106308.