DataHub, Iceberg, and Metadata Bridge Jobs on a Homelab Cluster
How DataHub integration in rpi_kubernetes evolved with ingestion cronjobs, bridge configmaps, and Iceberg-oriented governance patterns.
Why Add DataHub Here
Object storage and pipelines are useful, but without metadata you eventually lose discoverability.
The rpi_kubernetes updates added a meaningful DataHub footprint under kubernetes/base-services/datahub/, including:
- ingestion cronjobs
- ingestion recipe configmaps
- metadata bridge config
- values files for prerequisites and DataHub itself
What Changed Operationally
The key progression was from static deployment to recurring ingestion behavior.
Files like:
cronjob-ingest-postgres.yamlcronjob-ingest-minio-s3.yamlcronjob-ingest-mlflow.yamlcronjob-metadata-bridge.yaml
show that metadata refresh is now treated as an ongoing service, not a one-time setup task.
Why This Is A Good Pattern
On small clusters, operational simplicity matters. Cron-based ingestion gives predictable behavior and easy debugging while still enabling a governance layer across storage and ML systems.
It also aligns with the platform direction toward DataHub + Iceberg catalog patterns, making later lineage and discovery workflows easier.
What I Would Harden Next
For production-grade behavior, I would add:
- explicit retry and dead-letter handling for ingestion jobs
- richer run observability tied to Prometheus/Grafana
- metadata freshness SLOs per source
The existing manifest structure makes these additions straightforward.
Practical Takeaway
If you already run MinIO, Postgres, and MLflow in-cluster, DataHub ingestion cronjobs are one of the highest leverage additions for long-term maintainability.