AI Architecture Strategy for Multi-Agent System to Automate 40% of Customer Support Queries

Executive summary

A leading SaaS platform serving the professional services industry provides appointment management, point-of-sale, marketing, and client relationship tools to thousands of small business owners globally.

The problem

The company’s engineering team had built a multi-agent AI system for business-to-customer interactions but faced critical technical roadblocks: unreliable intent detection in production, inconsistent service recommendations from vector stores, and authentication flow issues that broke conversational experiences.

The solution

Zartis conducted a strategic AI architecture workshop, analysing the client’s existing system and delivering targeted recommendations on information retrieval, agent design, testing infrastructure, and production deployment strategy.

Home » Success Stories » AI Architecture Strategy for Multi-Agent System to Automate 40% of Customer Support Queries

About the client

Industry: SaaS – Professional Services Management

Headquarters: Dublin, Ireland

Global Reach: Serving businesses across North America, UK, Europe, and Australia

The client serves thousands of independent professional service businesses with a comprehensive platform managing the full customer lifecycle—from appointment booking and point-of-sale to marketing automation and client relationships.

As customer expectations evolved toward instant, 24/7 support, the company recognised AI-powered customer interaction as a strategic priority—both to reduce operational burden on business staff and to differentiate their platform in a competitive market.

The problem

Business owners using the platform faced a persistent operational challenge:

Customer enquiries such as booking changes, service questions, and staff availability consumed significant front-desk time during peak hours when staff should focus on in-person clients.

The company’s product vision was clear: build an AI agent that could interact with customers like a human staff member—handling appointment changes, recommending services, matching new customers with the right provider, and maintaining each business’s unique brand voice.

The business case was compelling, however, the engineering team had hit critical technical walls in their initial implementation that threatened to derail the entire initiative.

Why traditional approaches weren't working

The internal team had already invested significant effort in building a multi-agent AI system using LangGraph and GPT-4, but encountered roadblocks that generic AI best practices couldn’t solve:

Intent detection failed in production despite perfect test results

While their test datasets achieved 100% intent recognition accuracy, real-world customer conversations with unexpected phrasing created inefficient loops between agents. Users would get bounced between agents as the system struggled to understand complex requests like “I need to reschedule my appointment and add an additional service if my usual provider is available.”

Vector store recommendations were inconsistent and unreliable

Initial attempts to use semantic similarity (vector embeddings) for matching customers with services and providers produced illogical suggestions—especially problematic for new customers with no history. Business owners demand predictability, and “black box” AI recommendations erode trust.

Fine-tuning attempts failed spectacularly

Prior experiments fine-tuning LLMs on conversation data led to overfitting and brittle models that couldn’t generalise. The team wasted weeks on this approach before abandoning it.

Authentication flows broke conversational experience

Per-agent authentication created jarring moments where customers would be asked for phone numbers late in a booking flow, destroying the natural conversation feel.

Growing complexity without clear architectural patterns

With ~14 agents and overlapping responsibilities, the system was becoming harder to debug and maintain. The team lacked confidence in how to structure agents for maximum reliability.

Why they chose Zartis

The client team knew what they wanted to build but needed specialised AI/LLM architecture expertise to navigate these production challenges and validate their technical approach.

We were brought on to the project for our experience in delivering:

Operational efficiency: Free business staff from repetitive phone and chat enquiries to focus on higher-value in-person service
Revenue protection: Handle enquiries during hours when front-desk staff are unavailable, preventing lost bookings
Competitive differentiation: Deliver AI-powered customer experience capabilities competitors couldn't match
Scalability: Enable small businesses to deliver concierge-level service without hiring additional staff

The Zartis approach

Rather than deliver a generic AI strategy deck, Zartis conducted a hands-on architecture workshop directly with the client’s engineering team—treating it as a collaborative problem-solving session with their actual system and real production challenges.

90 minutes

Deep-dive discovery

The workshop began with an intensive discovery session where Zartis reviewed the client’s existing LangGraph architecture and analysed their current agent design and orchestration flow. The team examined actual production failure cases and user conversation logs to understand where the system was breaking down in real-world scenarios. This deep dive also assessed the client’s testing methodology and observability infrastructure to identify gaps in their ability to monitor and improve the system.

90 minutes

Targeted recommendations

With a clear understanding of the challenges, Zartis whiteboarded alternative architectural approaches for key problem areas, discussing the trade-offs between complexity and determinism. The session explored specific tools and techniques applicable to the client’s stack, validating their technical decisions whilst challenging assumptions that were holding the team back. This collaborative approach ensured recommendations were immediately actionable within their existing infrastructure.

60 minutes

Action planning

The workshop concluded with concrete next steps and deliverables, establishing a clear collaboration model for follow-up analysis. The teams discussed production deployment strategy and risk mitigation, ensuring the client had a roadmap they could execute on with confidence.

Our approach

What we delivered

Information retrieval architecture redesign

Problem: Vector stores producing inconsistent service/staff recommendations

Recommendation: Replace semantic similarity with graph-based retrieval

Why this matters: The client needed reliability over sophistication. Graph-based approach provides predictable results whilst maintaining the flexibility to handle complex service catalogues.

Granular agent design framework

Problem: Intent detection unreliable in production; agents with overlapping responsibilities

Recommendation Break agents into smaller, more focused units with clear boundaries

Split complex agents (like "booking") into subgraphs with distinct responsibilities
Use structured workflows with routers and defined control flow
Implement "agent handoff" pattern for smooth transitions
Design agents for single, testable purposes

Why this matters: Smaller agents = clearer failure points. When something breaks, the team can debug specific components rather than wrestling with monolithic LLM behaviour.

Prompt engineering over fine-tuning

Problem: Fine-tuning attempts led to overfitting and brittle models

Recommendation: Invest in semantically rich, detailed prompt engineering

Use existing chat conversation data to inform conversational style
Create comprehensive system prompts with explicit instructions
Template common elements (date/time formatting, business context) for consistency
Treat prompts as versioned code artefacts, not playground experiments

Why this matters: Prompt engineering is 10x faster to iterate, doesn't require model retraining when business logic changes, and produces more predictable results.

Testing & observability infrastructure

Problem: Limited visibility into production failures; unclear metrics for success

Recommendation: Build automated evaluation pipelines with:

Unit testing for individual agents in LangSmith
End-to-end testing of multi-agent workflows (with mocked network layers)
Perplexity analysis at each workflow step to identify where models are uncertain
Provider-agnostic architecture to test different LLMs quickly

Key insight from workshop: Zartis introduced perplexity as a diagnostic metric—analysing the probability distribution of generated tokens reveals where prompts are confusing or context is insufficient.

Why this matters: "Works in testing but fails in production" is solvable with the right observability. Measuring model confidence (perplexity) at each step creates early warning signals.

Model selection & optimisation

Problem: Using GPT-4 Turbo for everything; occasional hallucination on ID extraction

Recommendation: Task-specific model selection

Why this matters: Right-sizing models to tasks balances performance, cost, and reliability. Using a specialised extraction model for structured data avoids GPT-4's tendency to "hallucinate" valid-looking IDs.

Production deployment strategy

Problem: Unclear path from workshop to production; security concerns

Recommendation: Gradual rollout with risk mitigation

Phase 1: Internal testing with business owners and staff (feedback loop)
Phase 2: Feature-flagged release to controlled subset of businesses
Phase 3: Monitored rollout with A/B testing and performance metrics
Security: Red teaming for prompt injection, moderator nodes for content filtering
Authentication: Intent-based authentication (trigger only when necessary, not upfront)

Why this matters: AI in production requires different risk management than traditional software. Phased rollout with kill switches ensures issues don't impact entire customer base.

what made this work

Key technical decisions

Graph retrieval over vector stores

Whilst vector embeddings are trendy, they’re probabilistic. For matching new customers with the “perfect service and provider,” the client needed explainable, deterministic logic business owners could trust. Graph traversal provides that: “We recommended this provider because they’re certified in this service, have 4.9 star ratings, and have availability Tuesday afternoon.”

Prompt engineering over fine-tuning

Service definitions are diverse and constantly evolving (new services added, descriptions updated). Fine-tuned models become stale and require expensive retraining. Prompt engineering with retrieval stays fresh and adapts to changes immediately.

Granular agents over monolithic LLMs

The company’s prototype used a single LLM for all tasks—it was “confusing.” Breaking into specialised agents (intent detection → service reasoning → booking execution → summary) creates clear boundaries and testable units. Each agent does one thing well.

Perplexity as a diagnostic tool

Most teams treat LLMs as black boxes. Zartis’s recommendation to analyse perplexity (model confidence) at each workflow step gives the client a leading indicator of quality. High perplexity = “model is guessing” = prompt or context needs improvement.

the results

Immediate workshop deliverables

The client left the workshop with validated technical direction and confidence in their architecture approach. Key questions that had stalled internal discussions (“Should we use vector stores? How granular should agents be? Is fine-tuning worth it?”) were resolved with specific, actionable answers.

Architectural roadmap

Graph-based retrieval design for service/staff matching
Granular agent structure with clear responsibilities and handoff patterns
Testing infrastructure recommendations (automated pipelines, perplexity metrics)
Model selection guidance (task-specific LLMs, specialised extraction models)
Production deployment strategy with security considerations

Knowledge transfer

The engineering team gained production-tested patterns and techniques from Zartis experts who had built similar systems:

Tooling recommendations

What this enabled next

From workshop to production

Following the workshop, the engineering team moved forward with implementing Zartis’s recommendations.

The client now has confidence and clarity in their AI roadmap. Internal debates about architectural approaches were resolved with expert validation. The team can execute with speed, knowing they’re building on proven patterns rather than guessing.

The workshop didn’t just solve immediate technical problems, it equipped the team with frameworks and mental models for making future AI architecture decisions independently.

Facing AI architecture challenges?

If your team is building with LLMs but struggling with reliability, testing, or production deployment strategy, Zartis brings hands-on expertise from real implementations. We don’t deliver generic AI roadmaps—we solve actual technical problems alongside your engineers.

AI Architecture Strategy for Multi-Agent System to Automate 40% of Customer Support Queries

Executive summary

The problem

The solution

Why traditional approaches weren't working

Why they chose Zartis

The Zartis approach

Deep-dive discovery

Targeted recommendations

Action planning

What we delivered

Key technical decisions

Graph retrieval over vector stores

Prompt engineering over fine-tuning

Granular agents over monolithic LLMs

Perplexity as a diagnostic tool

the results

Immediate workshop deliverables

Architectural roadmap

Knowledge transfer

Tooling recommendations

What this enabled next

Facing AI architecture challenges?

Welcome to Zartis

Zartis Assistant