March 13, 2026 · 6 min read · mlai.qa Team

The ML Architecture Review: 20 Things We Check

The complete checklist we use in our ML architecture reviews — training infrastructure, data pipelines, model serving, monitoring, and deployment process.

An ML architecture review is different from a software architecture review. Software architecture reviews focus on scalability, reliability, and maintainability of code and infrastructure. ML architecture reviews cover all of that — plus the ML-specific concerns that software reviews miss: training-serving consistency, model versioning, retraining infrastructure, and the monitoring systems that tell you whether your model is working correctly in production.

When we conduct an ML architecture review, we work through 20 checkpoints across 5 categories. This checklist reflects what we look for — and what Series B investors, enterprise customers, and ML engineering leads look for when evaluating an AI system’s production readiness.

Category 1: Training Infrastructure (4 Checks)

1. Model Registry Is every trained model registered with its evaluation metrics, training data version, hyperparameters, and the artifact required for reproducible inference? A model registry is the audit trail that makes everything else possible: rollbacks, A/B testing, lineage tracking, and reproducibility.

Red flag: Models deployed by overwriting a file in S3, with no record of what model was serving when.

2. Experiment Tracking Are training experiments tracked in a versioned system — parameters, metrics, data versions, and artifacts — such that any experiment can be reproduced and compared to others? Ad-hoc experiment tracking in spreadsheets or Jupyter notebook filenames is a signal of research process, not engineering process.

Red flag: No experiment tracking system, or tracking that captures metrics without capturing the code and data versions required for reproducibility.

3. Reproducible Training Pipelines Can you reproduce a training run from six months ago? Reproducibility requires: versioned code, versioned training data, versioned dependencies, and documented hyperparameters. The absence of any one of these breaks the reproducibility chain.

Red flag: Training pipelines that depend on mutable data sources, unversioned environments, or undocumented random seeds.

4. Distributed Training Infrastructure For models that require multi-GPU or multi-node training — is the distributed training infrastructure in place, tested, and understood by the team? Discovering that your next model requires distributed training when you’re ready to train it is an expensive surprise.

Red flag: No experience with distributed training on a team building models that will require it within 12 months.

Category 2: Data Pipeline (4 Checks)

5. Training-Serving Consistency Are the features used to train the model computed using the same logic as the features computed at serving time? Training-serving skew is one of the most common causes of production model underperformance, and one of the hardest to diagnose retrospectively.

Red flag: Training features computed in a notebook with serving features reimplemented separately in the application code.

6. Point-in-Time Correctness Does the training data pipeline ensure that features used for each training example reflect only information that was available at the time of the example, with no look-ahead bias? This is particularly critical for time-series and event-based ML.

Red flag: Features joined without temporal constraints, or training data constructed from current feature values rather than historical values at the event time.

7. Data Quality Monitoring Is there active monitoring for data quality issues in the training and serving pipelines — null rates, out-of-range values, schema changes, and distribution shifts in input data? Data quality failures upstream of the model are a common and preventable cause of model degradation.

Red flag: No automated data quality checks, with data quality issues discovered when models start performing poorly.

8. Feature Store Architecture Is there a feature store (or defined feature computation library) that provides shared feature definitions across training and serving? This is the architectural foundation that prevents training-serving skew and supports consistent feature development across models.

Red flag: No feature store, with each model team implementing their own feature computation independently.

Category 3: Model Serving (4 Checks)

9. Serving Infrastructure Scalability Can the model serving layer handle 5× current throughput without architectural changes? Serving infrastructure that works at current scale but requires architectural changes to scale is a known liability.

Red flag: FastAPI serving without batching, no horizontal scaling configuration, no load testing at peak throughput estimates.

10. Model Version Management Can you deploy a new model version alongside the current version, run an A/B test between them, and roll back to the previous version within minutes if the new version underperforms?

Red flag: Model updates require full deployment; no rollback mechanism; no A/B testing capability.

11. Serving Latency Profile Have you measured and documented the p50, p95, and p99 serving latency for your production models under realistic load? Latency degrades under concurrent load in ways that single-request benchmarks don’t reveal.

Red flag: Latency only measured for single requests, no load testing data, no latency SLA defined.

12. Inference Cost Tracking Do you track per-prediction inference cost, and does your pricing or business model account for serving costs at scale? Unit economics for ML products depend on serving cost management.

Red flag: No visibility into per-prediction cost, serving cost not factored into pricing decisions.

Category 4: Monitoring (4 Checks)

13. Model Quality Monitoring Is there active monitoring for model performance degradation — accuracy metrics where ground truth is available, prediction distribution tracking where it isn’t, and alerting when quality metrics fall below thresholds?

Red flag: Infrastructure monitoring only; no model quality monitoring; degradation discovered by users.

14. Data Drift Detection Are you monitoring for distribution shift in the features your model receives at serving time, with statistical tests that detect significant drift before it degrades model performance?

Red flag: No feature distribution monitoring; drift discovered retrospectively when accuracy analysis is triggered by user reports.

15. Ground Truth Collection Is there a pipeline for collecting ground truth labels for production predictions — either directly (explicit feedback, outcomes) or indirectly (proxy metrics, behavioural signals) — to enable ongoing accuracy evaluation?

Red flag: No ground truth collection; model accuracy in production unmeasured after training.

16. Alerting and Escalation Are monitoring alerts configured with appropriate thresholds, escalation paths, and runbooks for the ML-specific failure modes — not just the infrastructure failure modes?

Red flag: Alerts only for infrastructure failures; no defined response procedure for model quality degradation.

Category 5: Deployment Process (4 Checks)

17. Staging Environment Is there a staging environment that reflects production data distribution and traffic patterns — not just a smaller version of production infrastructure — where new models are evaluated before production deployment?

Red flag: Staging environment with synthetic or outdated data that doesn’t reflect current production distribution.

18. Rollout Strategy Are new model versions deployed via canary or gradual rollout — with monitoring active during rollout and defined criteria for proceeding to full deployment or rolling back?

Red flag: Big-bang deployments; no canary strategy; rollback is a manual process without defined criteria.

19. Deployment Documentation Is there documented runbook for model deployment — including pre-deployment checks, deployment steps, post-deployment validation, and rollback procedure — that any qualified team member can follow?

Red flag: Deployment knowledge held by one person; undocumented process; no post-deployment validation checklist.

20. Incident Response for ML Failures Is there a defined incident response process for ML-specific failures — model degradation, prediction drift, serving infrastructure failures — with roles, escalation criteria, and documented response procedures?

Red flag: ML incidents handled ad-hoc; no defined ML failure classification; ML failures handled as infrastructure incidents without ML-specific diagnosis.

This checklist represents the minimum bar for production ML engineering maturity at a company raising Series B or selling to enterprise customers. Not every item needs to be addressed before your first production deployment — but each item that remains unaddressed is a known risk that will eventually surface.

Request an ML architecture review and we’ll work through this checklist with your team, prioritise the gaps, and provide a remediation roadmap.

Build ML that scales.

Book a free 30-minute ML architecture scope call with our experts. We review your stack and tell you exactly what to fix before it breaks at scale.

Talk to an Expert