Data Quality Crisis - Geocoding, Duplicates, and Validation
The Null Island Problem
Manual seeding worked for Sacramento's 20 food banks. But to scale, I needed automated discovery - scraping web sources, using AI to extract structured data, and inserting into the database.
The first automated scan ran. It found 500+ food banks. Success!
Then I looked at the map.
Half the markers were at coordinates (0, 0) - a point in the ocean off the coast of Africa, colloquially known as "Null Island."
The geocoding had failed silently.
Phase 1: Critical Fixes
This began a 2-week crisis mode sprint documented in docs/archive/phase-1-critical-fixes.md.
Fix 1: Geocoding Failure Handling
The original discovery pipeline:
// BAD: Returns (0,0) on failure
if (lat === 0 && lng === 0) {
const coords = await geocodeAddress(...);
if (coords) {
return { ...res, latitude: coords.latitude, longitude: coords.longitude };
}
return res; // ❌ Returns with lat=0, lng=0
}
The fix:
// GOOD: Skips invalid resources
async function processBatchResults(rawResults: TavilySearchResult[]) {
const results: DiscoveryResult[] = [];
const failures: FailureLog[] = [];
for (const result of rawResults) {
let coords = extractCoords(result);
// Validate coordinates
if (!coords || coords.latitude === 0 || coords.longitude === 0) {
// Try geocoding
coords = await geocodeAddress(result.address, result.city, result.state);
if (!coords || coords.latitude === 0 || coords.longitude === 0) {
// Log and skip
failures.push({
resource: result.name,
reason: "Geocoding failed or returned invalid coordinates",
});
continue; // Skip this resource
}
}
results.push({ ...result, ...coords });
}
if (failures.length > 0) {
console.warn(`[Geocoding] Failed: ${failures.length} resources`, failures);
}
return results;
}
Key change: Skip resources with invalid coordinates instead of inserting bad data.
Fix 2: Enhancement API Schema Error
The admin enhancement API used generateText() but OpenRouter expected structured output:
// BEFORE: Schema error
const { text } = await generateText({
model: openrouter(model),
prompt: `Enhance this food bank...`,
});
const parsed = JSON.parse(text); // Brittle
The fix: Use generateObject() with Zod:
// AFTER: Type-safe structured output
const { object } = await generateObject({
model: openrouter(model),
schema: z.object({
updates: z.object({
phone: z.string().nullable().optional(),
website: z.string().url().nullable().optional(),
description: z.string().nullable().optional(),
services: z.array(z.string()).nullable().optional(),
hours: z.record(z.string(), z.object({
open: z.string(),
close: z.string(),
closed: z.boolean().optional()
}).nullable()).nullable().optional()
}),
summary: z.string(),
confidence: z.number().min(0).max(1),
sources: z.array(z.string().url())
}),
prompt: `Enhance this food bank data...`,
temperature: 0.3,
});
return {
proposed: object.updates,
summary: object.summary,
confidence: object.confidence,
sources: object.sources,
};
This eliminated JSON parsing errors and provided type safety.
Fix 3: Resource Feed Pagination
The feed query had broken filter logic:
// BUG: Always returns undefined, loading everything
const whereClause = options.includeStatuses
? undefined // ❌ Ignores includeStatuses
: notInArray(foodBanks.verificationStatus, excludedStatuses);
The fix:
let query = db
.select()
.from(foodBanks)
.limit(limit)
.offset(offset);
if (includeStatuses && includeStatuses.length > 0) {
query = query.where(inArray(foodBanks.verificationStatus, includeStatuses));
} else if (excludeRejected) {
query = query.where(notInArray(foodBanks.verificationStatus, excludedStatuses));
}
const rows = await query;
Now filters worked correctly, and pagination was enforced.
Fix 4: Database Indices
With 500+ food banks, queries were slow. Missing indices:
-- Before: Full table scans
SELECT * FROM food_banks WHERE latitude BETWEEN ... AND ...;
-- 2-3 seconds with 500 rows
-- After: Indices added
CREATE INDEX idx_food_banks_coords ON food_banks(latitude, longitude);
CREATE INDEX idx_food_banks_status ON food_banks(verification_status);
CREATE INDEX idx_food_banks_created ON food_banks(created_at DESC);
-- Query time: 20-50ms (100x improvement)
Simple but crucial for performance.
Phase 2: Data Integrity
With critical bugs fixed, I tackled data quality.
Confidence Scoring
Not all data sources are equal. I built a 0-100 confidence scoring system:
// src/lib/admin-queries.ts
export function calculateConfidenceScore(resource: {
name: string;
address: string | null;
phone: string | null;
website: string | null;
hours: any;
description: string | null;
services: string[] | null;
latitude: number;
longitude: number;
}) {
let score = 0;
// Base fields (40 points)
if (resource.name) score += 10;
if (resource.address) score += 10;
if (resource.latitude !== 0 && resource.longitude !== 0) score += 20;
// Contact info (20 points)
if (resource.phone) score += 10;
if (resource.website) score += 10;
// Rich data (40 points)
if (resource.hours) score += 15;
if (resource.description) score += 10;
if (resource.services && resource.services.length > 0) score += 15;
return Math.min(score, 100);
}
This let admins prioritize low-confidence resources for manual review.
Duplicate Detection
Automated discovery often found the same food bank multiple times:
// src/lib/duplicate-guard.ts
export async function detectDuplicates(candidate: {
name: string;
address: string;
city: string;
latitude: number;
longitude: number;
}) {
const existing = await db.select().from(foodBanks);
const duplicates = existing.filter((resource) => {
// Exact address match (hard duplicate)
if (
resource.address?.toLowerCase() === candidate.address.toLowerCase() &&
resource.city?.toLowerCase() === candidate.city.toLowerCase()
) {
return { type: "hard", confidence: 1.0, resource };
}
// Name similarity + distance (soft duplicate)
const nameSimilarity = levenshtein(
resource.name.toLowerCase(),
candidate.name.toLowerCase()
) / Math.max(resource.name.length, candidate.name.length);
const distance = calculateDistance(
resource.latitude,
resource.longitude,
candidate.latitude,
candidate.longitude
);
if (nameSimilarity > 0.8 && distance < 0.1) {
return { type: "soft", confidence: nameSimilarity, resource };
}
return null;
}).filter(Boolean);
return duplicates;
}
Hard duplicates (exact address) were rejected. Soft duplicates (fuzzy match) were flagged for admin review.
Phone & Website Validation
LLMs extract data, but it's often malformed:
// src/lib/validators.ts
import { parsePhoneNumber } from 'libphonenumber-js';
export function validatePhone(phone: string): boolean {
try {
const parsed = parsePhoneNumber(phone, 'US');
return parsed.isValid();
} catch {
return false;
}
}
export function validateWebsite(url: string): boolean {
try {
new URL(url);
return url.startsWith('http://') || url.startsWith('https://');
} catch {
return false;
}
}
Invalid phone numbers and URLs were rejected during insertion.
Data Versioning
To track changes, I added audit logging:
CREATE TABLE data_versions (
id TEXT PRIMARY KEY,
resource_id TEXT NOT NULL REFERENCES food_banks(id),
field_name TEXT NOT NULL,
old_value TEXT,
new_value TEXT,
changed_by TEXT REFERENCES user(id),
change_reason TEXT,
created_at TIMESTAMP DEFAULT NOW()
);
CREATE INDEX idx_versions_resource ON data_versions(resource_id, created_at DESC);
Now every edit was traceable - critical for community-sourced data.
The Admin Dashboard
Managing 500+ resources required tooling:
// src/app/admin/page.tsx
export default async function AdminDashboard() {
const stats = await db
.select({
total: count(),
verified: countIf(eq(foodBanks.verificationStatus, 'official')),
pending: countIf(eq(foodBanks.verificationStatus, 'pending')),
rejected: countIf(eq(foodBanks.verificationStatus, 'rejected')),
missingPhone: countIf(isNull(foodBanks.phone)),
missingWebsite: countIf(isNull(foodBanks.website)),
missingHours: countIf(isNull(foodBanks.hours)),
lowConfidence: countIf(lt(foodBanks.confidenceScore, 50)),
})
.from(foodBanks);
return (
<div className="space-y-6">
<h1 className="text-3xl font-bold">Admin Dashboard</h1>
<div className="grid gap-4 md:grid-cols-4">
<StatsCard label="Total Resources" value={stats.total} />
<StatsCard label="Verified" value={stats.verified} />
<StatsCard label="Pending Review" value={stats.pending} />
<StatsCard label="Low Confidence" value={stats.lowConfidence} />
</div>
<ResourceQueue />
</div>
);
}
The queue showed resources needing attention, sorted by confidence score.
What Went Right
-
Skip Invalid Data: Better to have fewer, accurate resources than many bad ones
-
Confidence Scoring: Prioritizing low-confidence resources saved hours
-
Duplicate Detection: Prevented database bloat
-
Validation: libphonenumber-js caught malformed data
-
Audit Trail: Data versioning enabled accountability
What I'd Do Differently
Mistake 1: No Retry Logic
Geocoding failures were sometimes transient (API timeouts). Retry with exponential backoff would have recovered more resources.
Mistake 2: No Caching
Geocoding the same city 100 times was wasteful. A simple cache would have saved API calls and time.
Mistake 3: Late Indexing
Adding indices after performance problems was painful. Should have added them upfront based on expected query patterns.
What I Learned
-
Data Quality > Quantity: 100 accurate resources beat 1000 questionable ones
-
Geocoding Is Hard: LLMs are terrible at coordinates; always validate
-
Indices Matter: 100x speedup from a single
CREATE INDEX -
Duplicate Detection Is Non-Trivial: Fuzzy matching + distance heuristics work but aren't perfect
-
Audit Trails Are Essential: Community-sourced data needs accountability
Up Next
In Part 9, I'll cover trust and verification - building user confidence with verification badges, source attribution, and transparency features.
Key Commits: Phase 1 & 2 completion (multiple commits)
Related Files:
docs/archive/phase-1-critical-fixes.md- Phase 1 plandocs/archive/phase-2-data-integrity.md- Phase 2 plansrc/lib/duplicate-guard.ts- Duplicate detectionsrc/lib/validators.ts- Phone/website validation
Jordan Hindo
Full-stack Developer & AI Engineer building in public. Exploring the future of agentic coding and AI-generated assets.
Get in touch