Technical Methodology
The Lead Detection Tool combines two complementary machine learning models to detect informal lead contamination sources across South Asia.
Two Complementary Pipelines
Our screening approach integrates contextual analysis with remote sensing to maximize geographic coverage while maintaining detection accuracy. The two models operate in parallel and can be deployed independently based on data availability and regional priorities.
Geospatial Data
→
Contextual Model
(XGBoost)
→
ULAB Detection
Sentinel-2 Imagery
→
Satellite Model
(Deep Learning)
→
Smelter Detection
Both pipelines output a risk score (0.0-1.0) and confidence metrics that feed into the final prioritization engine.
Contextual Model: ULAB Detection
Overview
The Contextual Model identifies informal lead-acid battery (ULAB) recycling operations using an XGBoost classifier trained on geospatial, socioeconomic, and environmental features. ULAB sites are characterized by clustering of battery shops, scrap dealers, and informal settlements without environmental controls.
Algorithm
XGBoost (eXtreme Gradient Boosting) is a gradient boosting decision tree ensemble that excels at capturing non-linear relationships in structured data. The model is trained to output a probability score for each grid cell (1km²) indicating the likelihood of ULAB activity.
Model: XGBoost (gradient boosted trees)
Input: Geospatial features per 1km² grid cell (POI density, population, road networks, land use)
Output: Risk probability 0.0–1.0 per cell
Proof of concept: NCR Delhi (13 verified ULAB sites as training labels, 3,746 grid cells scanned)
Feature Engineering
The model incorporates the following feature categories:
-
◆
Business Clustering
Density of battery shops, scrap metal dealers, and informal recycling businesses within 500m radius. Derived from OpenStreetMap POI data and Google Places API.
-
◆
Population Demographics
Population density, informal settlement presence, presence of educational institutions (schools, hospitals). Sources: WorldPop, settlement layers from OpenStreetMap.
-
◆
Road Networks & Accessibility
Road density, distance to major roads and highways, proximity to transport corridors. Derived from OpenStreetMap road network data.
-
◆
Land Use Classification
Industrial vs. residential vs. mixed-use land classification, proximity to industrial zones. Source: OpenStreetMap land use polygons.
Proof-of-Concept Results (NCR Delhi)
The model was trained and evaluated in NCR Delhi using 13 verified ULAB sites from field-verified databases as positive labels. Key observations:
- All 13 verified ULAB sites fall within or near the model's high-probability zones
- 188 grid cells (out of 3,746 scanned) were flagged as elevated risk
- High-risk clusters align with known informal industrial corridors
- The model generalizes from limited ground truth by leveraging contextual feature patterns
Formal precision/recall metrics require additional field verification of model predictions and are pending future validation campaigns.
Satellite Imagery Model: Smelter Detection
Overview
The Satellite Model detects industrial smelter facilities using a convolutional neural network (CNN) trained on Sentinel-2 multispectral satellite imagery. Smelters have distinctive spectral, spatial, and contextual signatures visible in multispectral imagery that can help identify facilities even when not registered in official records.
Algorithm
A deep learning classifier is trained on image patches extracted from Sentinel-2 multispectral satellite imagery. The model learns to distinguish smelter facilities from other land cover types based on spatial and spectral patterns.
Model: Deep learning CNN
Input: Sentinel-2 multispectral imagery (13 bands, 10–60m resolution)
Key bands: Visible (B2–B4), NIR (B8), SWIR (B11–B12)
Training data: Geolocated smelter records from USGS and ILZSG databases
Output: Risk probability 0.0–1.0 per image patch
Feature Space
The CNN automatically learns the following latent features from raw imagery:
-
◆
SWIR Reflectance Anomalies
SWIR bands (B11, B12) capture reflectance differences associated with bare soil, impervious surfaces, and industrial materials common at smelter sites. Note: Sentinel-2 does not carry thermal infrared sensors; heat detection is not directly possible.
-
◆
Building Morphology
Smelters have distinctive spatial patterns visible at 10m resolution: large roof areas, industrial building layouts, and associated infrastructure (access roads, storage areas). The CNN learns these structural signatures from visible and NIR bands.
-
◆
Land Cover Context
Surrounding land cover patterns — bare soil, reduced vegetation (low NDVI), impervious surfaces — provide contextual signals that distinguish smelter sites from other industrial or urban areas.
-
◆
Temporal Patterns
Multi-temporal composites from repeated Sentinel-2 acquisitions may reveal persistent industrial activity patterns, such as consistent bare ground and sustained low vegetation indices around active facilities.
Training Data
The model is trained on a curated dataset combining:
- Geolocated smelter records from USGS industrial facility databases
- ILZSG (International Lead and Zinc Study Group) smelter registries
- Field-verified toxic sites in target regions
- Sentinel-2 imagery archive covering South Asia
- Negative samples: non-smelter industrial facilities and urban areas for model discrimination
Model Status
The satellite model is currently in development. Formal performance metrics (precision, recall, F1, AUC-ROC) will be reported after sufficient field verification of model predictions has been completed. Initial qualitative assessments are encouraging, with the model correctly identifying known smelter locations in validation regions.
Rigorous quantitative evaluation requires a large enough set of field-verified positive and negative predictions, which is an ongoing effort coordinated with field validation partners.
Active Learning Loop (Planned)
A core design goal of the screening tool is to support a feedback loop between model predictions and field verification. As field teams visit predicted sites and report findings, this data can be incorporated to improve future model performance. The envisioned process is outlined below.
Model Predictions
→
Field Teams
→
Ground Truth Data
→
Model Retraining
Envisioned Process
- Prediction: Models generate risk scores for grid cells or image patches based on current data
- Prioritization: High-confidence detections are sorted by risk score and clustered geographically for field team planning
- Field Verification: Partner field teams visit high-priority sites and collect structured field data
- Data Integration: Field observations (site type, activity level, environmental conditions) are recorded
- Model Evaluation: Predictions are compared against field results to identify false positives and false negatives
- Retraining: Models are retrained on accumulated ground truth data as the labeled dataset grows
This feedback loop is aspirational and will be implemented as field verification campaigns progress. The NCR Delhi proof of concept represents the first iteration of this cycle.
Limitations & Assumptions
Known Limitations
- Cloud cover: Satellite-based detection is limited by cloud cover in monsoon regions. Temporal compositing helps but cannot eliminate seasonal gaps.
- Small facilities: Sites with rooftops <100m² may be difficult to detect reliably. The model is optimized for medium to large operations.
- Regional variation: Model performance varies by region due to differences in urban density, building materials, and climate. Contextualization by region is recommended.
- Data freshness: Satellite imagery lags by 5-12 days. POI data from OpenStreetMap may be outdated in rapidly changing informal settlements.
- Informal sector dynamics: ULAB sites frequently relocate. A negative prediction does not guarantee absence of activity; follow-up is important.
- False positives: Some industrial facilities (e.g., battery manufacturers) may be mistaken for informal ULAB sites. Field teams must differentiate.
Methodological Assumptions
- ULAB sites concentrate in areas with high informal business density
- Smelter operations produce detectable spatial and spectral signatures in multispectral imagery
- POI data from OpenStreetMap and Google Places has reasonable coverage in South Asia
- Lead contamination patterns correlate with site locations
- Field verification teams can reliably classify site types and activity levels
Data Sources & Attribution
Geospatial Data
- OpenStreetMap (OSM): POI data, building footprints, road networks, administrative boundaries
- Google Places API: Business listings, shop locations, amenity data
- WorldPop: Population density grids at 100m resolution
- ESA Copernicus: Sentinel-2 multispectral satellite imagery
- NASA USGS: Industrial facility database, historical smelter records
Validation Data
- Field-verified sites: Toxic sites confirmed through on-the-ground assessment (13 confirmed ULAB sites used in NCR Delhi proof of concept)
- ILZSG: International Lead and Zinc Study Group smelter registry
- USGS: Industrial facility records with geolocated smelter data
- Academic literature: Published studies on lead contamination sources and health impacts
Data Citation: Users of this model are encouraged to cite the underlying data sources. Raw output predictions should not be used as the sole basis for remediation decisions without field verification.
References & Further Reading
Key References
- Chen & Guestrin (2016). "XGBoost: A Scalable Tree Boosting System." Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.
- He, Zhang, Ren & Sun (2016). "Deep Residual Learning for Image Recognition." IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
- Landrigan et al. (2018). "The Lancet Commission on pollution and health." The Lancet, 391(10119), 462–512.
- UNICEF (2020). "The Toxic Truth: Children's Exposure to Lead Pollution Undermines a Generation of Future Potential."
- Ericson et al. (2021). "A meta-analysis of blood lead levels in India and the attributable burden of disease." Environment International, 147, 106308.