LLMs are being widely deployed to production with real people chatting back and forth with semi-supervised agents. I’ve been working as a solo dev on one such project and these are some of my observations from seeing LLM communications in the wild. Many of these experiences are tinted by my particular circumstance and resource constraints but I expect that some of these are general problems. Here’s what LLMs are great at, where they fail in production, and a few patterns I’ve found that help.
Chat Is the Hammer
What are the strengths of LLMs?
- Chat: The primary tuning of public models is for chat! The LLMs are great yappers and they will go on and on if you let them.
- Parallelism: The limit on the number of parallel communications with your AI is between you, your bank account and your infrastructure provider.
- Patience: Interacting with people is emotionally draining especially when you’re set up for failure. In some jobs you are set up to be the punching bag. You do not control the funds, the timeline or the weather! You will have to ask someone and wait. Now whether it can be said that LLMs truly observe the qualities of patience is somewhat abstract but they are usually upbeat and responsive to every Twilio webhook that comes their way, which is enough!
- Consistency: People love to do things slightly differently. LLMs can reread your institutional best practices each time they reply.
- Asynchronous: People are free to speak when they message you. It goes a long way to remove any of the scheduling constraints of one of the parties in a 2 person conversation.
Ok, but more importantly what are their weaknesses?
- Lying/Absolute correctness: Due to their training to be unreasonably helpful LLMs will often say they are doing things which they are not capable of. This is often just a sign you should give them those abilities. Until you give them reasonable tooling they will be lying. Additionally, they will lie about what they know. There are basic facts which anyone in their role would have access to but because of the tooling or context you’ve passed they have become deceitful. Again I interpret this as a plea for the correct context! There is a lot to know about anything. If you’re paid full time to do something odds are you know a lot more about it than an LLM that is lacking some subtle context.
- Pushing back: This is not an absolute limitation of LLMs, obviously you can set up a prompt differently or do some fine tune but… at first pass (off the shelf APIs) you are likely getting a sycophantic and overly trusting clown.
- Finesse: Paradoxically to “Pushing Back” the LLMs can often come off as rude when you have given them 1. simple crutch phrases for situations they cannot or should not handle 2. imperative directions which it’s their sole purpose to fulfill. Especially with an ultra realistic voice on the phone people may perceive the AI’s repetitiveness, persistence and broadly wrong-headedness as rude. You would not turn over the essential communications of your business to a junior employee. Some conversations require finesse!
Maybe there are nails everywhere?
Businesses of all sizes run on Texts, Calls and Emails… and whatever the dreaded ERPs are in your vertical. Luckily we have friendly(ish) APIs to do each of the above. From what I’ve seen I would say that the voice is the most fickle. I would back that up by saying that just the telephony stack can be pretty nasty. Before you know it, you’re digging through PCAP logs to find missing headers on a SIP call with Wireshark… you may ask yourself “How do I work this?” and you may ask yourself, “Where is that large automobile?” I digress.
At Mason we’ve been building an agent for property managers. A property manager is fundamentally a middleman between tenants, owners and vendors. Property managers have a lot of people to speak to, even about simple issues. So Mason the suave and gentle AI steps in to get your toilet fixed faster!
Simple but effective patterns
Unit of Work
In order to have a coherent conversation you need to relate each communication to a subset of all possible interactions. Context poisoning is real — ask me how I accidentally created a dev environment where Mason thinks I have 15 clogged toilets and calls me Big Papa. In our agent this primitive is the work order which is passed around and enriched as a log, a notification and a financial transaction before it’s closed. In order to shrink the possible world of chatter the processing of a communication usually begins with an explicit router to link it to any one or many open pieces of work.
Let the Context Flow
I’ve been learning to let the context flow, especially during tool calling. The Agent needs to see all of the conversations across parties in order to make the correct decisions. The choice here is that the decision can be made with full context but the communications are roughly guarded from leaking sensitive information.
Lazy Follow-ups
Often you need to follow up with a party who is not responding. As stated above the context that comprises the generation of such a follow up comes from multiple asynchronous threads which may or may not have changed before the intended follow up message is set to execute. Due to the probability of hallucination risk (and never sending a follow up again) instead of checking each time for context changes you should wait until the set time for the follow up and check if the intended purpose of the follow up is still relevant.
Hardcoded Transitions
At some point in your workflow, you’ll hardcode domain-specific transitions that feel uncomfortably manual for an AI agent. This one is a little hard to stomach but it may be indicative of where the moat of your agent could be relative to a generic communications bot. Besides all the effort one goes through to integrate with the existing industry’s ERPs, there is some class of vertical specific workflow domain knowledge which your agent needs. This is some natural point in a workflow where the agent transitions purposes, gains or loses context and tools. Why is this hard to stomach? Because you’re building off of tools that are supposedly plucked out of I.I.D. heaven and you worry that there is probably some general pattern you’re missing.
Markdown Reports
I may come to see the errors of my ways but I love the idea of a semi-public system prompt that you build with the customer. In our use case there are large chunks of context which are roughly static and which differ strictly by customer. Markdown is a great way to organize the information for people and the LLM.
Examples and Anecdotes
- Tenants will submit maintenance requests primarily at night or on the weekend. They generally really appreciate getting a text right away (especially at night) and often say thank you. Another major benefit to texting them right when they submit the issue is that it greatly reduces back and forth because they are clearly free/thinking about it right then so they are more able to send pictures and additional information.
- Leasing calls have a very large inbound volume but represent a relatively low value interaction for property managers. One property manager we worked with used to directly point their leasing number to a voicemail box because so many of the questions could have been answered by going to their homepage. Dropping Mason into this high volume use case he’s able to speak to all of these leasing calls simultaneously. On consistent communications, he’s able to steer sharply away from legally sensitive topics like Section 8/rent controlled housing and just send a link to a page with the process instead.
- Mason often will take the tenant at their word and make recommendations to the property manager based on the specific words and assumptions they’ve provided him. It’s a work in progress to make him more skeptical and gently push back. Often he performs worst when someone already believes they know the source of their issue and submits a very specific request without zooming out and stating the high level problem.
- AI voices have gotten extremely realistic. We shell out for our choice Eleven Labs voice and it’s extremely convincing. Mason will always tell a user if asked directly that he is an AI assistant but this rarely happens. This means that every bit of latency of silence or failed call feels extremely personal. Due to flaky infrastructure a call failed while taking in the detail of a maintenance issue. After we reached out to them over text they were upset “you hung up on me earlier”.
- Mason desperately wants to have dispatched a vendor to solve your problem. For a while he was commonly jumping the gun on reporting that someone was on the way etc. It took giving him a simple out and a tool + stopping interactions at a key time before resuming to make sure that he would be truthful.
- Early on a plumber looking for extra work would often test the bounds of what Mason would approve to see if he could ask for extra jobs or pick up odd handyman tasks around the property. This was trickier to drill out over the phone but it became obvious the importance of keeping the bounds of approval super clear.
- Chatting with the average American is beautiful in a lot of ways; people are earnest and kind. However… the set up here is often adversarial and often people blow up while being asked simple questions about their water heater. The AI takes it on the chin always but handles a different class of defense mechanism kind of oddly. People often bring up extremely personal things they are going through like the sickness of a relative. These are the cases where having the AI interacting with people can feel off. It’s hard to tell if these are honest pleas for connection but Mason will often brush by them because they are far outside his intended purpose.