AI Architecture Strategy for Multi-Agent System to Automate 40% of Customer Support Queries
Executive summary
A leading SaaS platform serving the professional services industry provides appointment management, point-of-sale, marketing, and client relationship tools to thousands of small business owners globally.
The problem
The company’s engineering team had built a multi-agent AI system for business-to-customer interactions but faced critical technical roadblocks: unreliable intent detection in production, inconsistent service recommendations from vector stores, and authentication flow issues that broke conversational experiences.
The solution
Zartis conducted a strategic AI architecture workshop, analysing the client’s existing system and delivering targeted recommendations on information retrieval, agent design, testing infrastructure, and production deployment strategy.
Home » Success Stories » AI Architecture Strategy for Multi-Agent System to Automate 40% of Customer Support Queries
About the client
Industry: SaaS – Professional Services Management
Headquarters: Dublin, Ireland
Global Reach: Serving businesses across North America, UK, Europe, and Australia
The client serves thousands of independent professional service businesses with a comprehensive platform managing the full customer lifecycle—from appointment booking and point-of-sale to marketing automation and client relationships.
As customer expectations evolved toward instant, 24/7 support, the company recognised AI-powered customer interaction as a strategic priority—both to reduce operational burden on business staff and to differentiate their platform in a competitive market.
The problem
Business owners using the platform faced a persistent operational challenge:
Customer enquiries such as booking changes, service questions, and staff availability consumed significant front-desk time during peak hours when staff should focus on in-person clients.
The company’s product vision was clear: build an AI agent that could interact with customers like a human staff member—handling appointment changes, recommending services, matching new customers with the right provider, and maintaining each business’s unique brand voice.
The business case was compelling, however, the engineering team had hit critical technical walls in their initial implementation that threatened to derail the entire initiative.
Why traditional approaches weren't working
The internal team had already invested significant effort in building a multi-agent AI system using LangGraph and GPT-4, but encountered roadblocks that generic AI best practices couldn’t solve:
Intent detection failed in production despite perfect test results
While their test datasets achieved 100% intent recognition accuracy, real-world customer conversations with unexpected phrasing created inefficient loops between agents. Users would get bounced between agents as the system struggled to understand complex requests like “I need to reschedule my appointment and add an additional service if my usual provider is available.”
Vector store recommendations were inconsistent and unreliable
Initial attempts to use semantic similarity (vector embeddings) for matching customers with services and providers produced illogical suggestions—especially problematic for new customers with no history. Business owners demand predictability, and “black box” AI recommendations erode trust.
Fine-tuning attempts failed spectacularly
Prior experiments fine-tuning LLMs on conversation data led to overfitting and brittle models that couldn’t generalise. The team wasted weeks on this approach before abandoning it.
Authentication flows broke conversational experience
Per-agent authentication created jarring moments where customers would be asked for phone numbers late in a booking flow, destroying the natural conversation feel.
Growing complexity without clear architectural patterns
With ~14 agents and overlapping responsibilities, the system was becoming harder to debug and maintain. The team lacked confidence in how to structure agents for maximum reliability.
Why they chose Zartis
The client team knew what they wanted to build but needed specialised AI/LLM architecture expertise to navigate these production challenges and validate their technical approach.
We were brought on to the project for our experience in delivering:
- Operational efficiency: Free business staff from repetitive phone and chat enquiries to focus on higher-value in-person service
- Revenue protection: Handle enquiries during hours when front-desk staff are unavailable, preventing lost bookings
- Competitive differentiation: Deliver AI-powered customer experience capabilities competitors couldn't match
- Scalability: Enable small businesses to deliver concierge-level service without hiring additional staff
The Zartis approach
Rather than deliver a generic AI strategy deck, Zartis conducted a hands-on architecture workshop directly with the client’s engineering team—treating it as a collaborative problem-solving session with their actual system and real production challenges.
90 minutes
Deep-dive discovery
The workshop began with an intensive discovery session where Zartis reviewed the client’s existing LangGraph architecture and analysed their current agent design and orchestration flow. The team examined actual production failure cases and user conversation logs to understand where the system was breaking down in real-world scenarios. This deep dive also assessed the client’s testing methodology and observability infrastructure to identify gaps in their ability to monitor and improve the system.
90 minutes
Targeted recommendations
With a clear understanding of the challenges, Zartis whiteboarded alternative architectural approaches for key problem areas, discussing the trade-offs between complexity and determinism. The session explored specific tools and techniques applicable to the client’s stack, validating their technical decisions whilst challenging assumptions that were holding the team back. This collaborative approach ensured recommendations were immediately actionable within their existing infrastructure.
60 minutes
Action planning
The workshop concluded with concrete next steps and deliverables, establishing a clear collaboration model for follow-up analysis. The teams discussed production deployment strategy and risk mitigation, ensuring the client had a roadmap they could execute on with confidence.
Our approach
What we delivered
Information retrieval architecture redesign
Problem: Vector stores producing inconsistent service/staff recommendations
Recommendation: Replace semantic similarity with graph-based retrieval
- Build knowledge graph of services, staff, skills, and relationships
- Use graph traversal to reason about "best match" based on explicit relationships vs. embedding proximity
- Enables explainable, deterministic recommendations business owners can trust
Why this matters: The client needed reliability over sophistication. Graph-based approach provides predictable results whilst maintaining the flexibility to handle complex service catalogues.
Granular agent design framework
Problem: Intent detection unreliable in production; agents with overlapping responsibilities
Recommendation Break agents into smaller, more focused units with clear boundaries
- Split complex agents (like "booking") into subgraphs with distinct responsibilities
- Use structured workflows with routers and defined control flow
- Implement "agent handoff" pattern for smooth transitions
- Design agents for single, testable purposes
Why this matters: Smaller agents = clearer failure points. When something breaks, the team can debug specific components rather than wrestling with monolithic LLM behaviour.
Prompt engineering over fine-tuning
Problem: Fine-tuning attempts led to overfitting and brittle models
Recommendation: Invest in semantically rich, detailed prompt engineering
- Use existing chat conversation data to inform conversational style
- Create comprehensive system prompts with explicit instructions
- Template common elements (date/time formatting, business context) for consistency
- Treat prompts as versioned code artefacts, not playground experiments
Why this matters: Prompt engineering is 10x faster to iterate, doesn't require model retraining when business logic changes, and produces more predictable results.
Testing & observability infrastructure
Problem: Limited visibility into production failures; unclear metrics for success
Recommendation: Build automated evaluation pipelines with:
- Unit testing for individual agents in LangSmith
- End-to-end testing of multi-agent workflows (with mocked network layers)
- Perplexity analysis at each workflow step to identify where models are uncertain
- Provider-agnostic architecture to test different LLMs quickly
Key insight from workshop: Zartis introduced perplexity as a diagnostic metric—analysing the probability distribution of generated tokens reveals where prompts are confusing or context is insufficient.
Why this matters: "Works in testing but fails in production" is solvable with the right observability. Measuring model confidence (perplexity) at each step creates early warning signals.
Model selection & optimisation
Problem: Using GPT-4 Turbo for everything; occasional hallucination on ID extraction
Recommendation: Task-specific model selection
- Continue GPT-4 Turbo for complex reasoning tasks
- Explore specialised extraction models for ID parsing
- Consider OpenAI Agents SDK for higher-level agent abstraction
- Test Anthropic Claude for scenarios requiring large context windows
Why this matters: Right-sizing models to tasks balances performance, cost, and reliability. Using a specialised extraction model for structured data avoids GPT-4's tendency to "hallucinate" valid-looking IDs.
Production deployment strategy
Problem: Unclear path from workshop to production; security concerns
Recommendation: Gradual rollout with risk mitigation
- Phase 1: Internal testing with business owners and staff (feedback loop)
- Phase 2: Feature-flagged release to controlled subset of businesses
- Phase 3: Monitored rollout with A/B testing and performance metrics
- Security: Red teaming for prompt injection, moderator nodes for content filtering
- Authentication: Intent-based authentication (trigger only when necessary, not upfront)
Why this matters: AI in production requires different risk management than traditional software. Phased rollout with kill switches ensures issues don't impact entire customer base.
what made this work
Key technical decisions
Graph retrieval over vector stores
Whilst vector embeddings are trendy, they’re probabilistic. For matching new customers with the “perfect service and provider,” the client needed explainable, deterministic logic business owners could trust. Graph traversal provides that: “We recommended this provider because they’re certified in this service, have 4.9 star ratings, and have availability Tuesday afternoon.”
Prompt engineering over fine-tuning
Service definitions are diverse and constantly evolving (new services added, descriptions updated). Fine-tuned models become stale and require expensive retraining. Prompt engineering with retrieval stays fresh and adapts to changes immediately.
Granular agents over monolithic LLMs
The company’s prototype used a single LLM for all tasks—it was “confusing.” Breaking into specialised agents (intent detection → service reasoning → booking execution → summary) creates clear boundaries and testable units. Each agent does one thing well.
Perplexity as a diagnostic tool
Most teams treat LLMs as black boxes. Zartis’s recommendation to analyse perplexity (model confidence) at each workflow step gives the client a leading indicator of quality. High perplexity = “model is guessing” = prompt or context needs improvement.
the results
Immediate workshop deliverables
The client left the workshop with validated technical direction and confidence in their architecture approach. Key questions that had stalled internal discussions (“Should we use vector stores? How granular should agents be? Is fine-tuning worth it?”) were resolved with specific, actionable answers.
Architectural roadmap
- Graph-based retrieval design for service/staff matching
- Granular agent structure with clear responsibilities and handoff patterns
- Testing infrastructure recommendations (automated pipelines, perplexity metrics)
- Model selection guidance (task-specific LLMs, specialised extraction models)
- Production deployment strategy with security considerations
Knowledge transfer
The engineering team gained production-tested patterns and techniques from Zartis experts who had built similar systems:
- How to design prompts for determinism vs. creativity
- When to split agents vs. keep them unified
- How to debug LLM failures using perplexity analysis
- Security considerations (prompt injection testing, content moderation)
Tooling recommendations
- LangSmith for prompt versioning and A/B testing (already using, but deeper tactics)
- Specialised extraction models for structured data
- OpenAI Agents SDK for higher-level agent orchestration
- Perplexity analysis scripts for diagnostic evaluation
What this enabled next
From workshop to production
Following the workshop, the engineering team moved forward with implementing Zartis’s recommendations.
The client now has confidence and clarity in their AI roadmap. Internal debates about architectural approaches were resolved with expert validation. The team can execute with speed, knowing they’re building on proven patterns rather than guessing.
The workshop didn’t just solve immediate technical problems, it equipped the team with frameworks and mental models for making future AI architecture decisions independently.
Facing AI architecture challenges?
If your team is building with LLMs but struggling with reliability, testing, or production deployment strategy, Zartis brings hands-on expertise from real implementations. We don’t deliver generic AI roadmaps—we solve actual technical problems alongside your engineers.