Legal GenAI Chatbots Design Review

Are open-ended generative AI chatbots a good interface for high-stakes legal work?

The legal world is working on the implementation of chatbots into its workflows. These tools can bring significant clarity and professionalism gains, and a legal-specific application layer can deliver superior usability. However, success involves a lot of intentional design decisions.

This deep dive has three parts: a breakdown of the chatbot interface, its usability challenges, and notes on what to take into consideration when letting this technology into your processes.

For this article, I reviewed and compared selected Legal AI tools (Harvey, Legora, CoCounsel, Vincent AI) using LegalTechHub product walkthroughs as a reference, and contrasted these against selected general purpose tools (ChatGPT, Claude, Perplexity).

In my review, I focus primarily on usability, and only discuss accuracy and benchmarking as they relate to the design.

Chatbots as the interface to interact with Large Language Models

OpenAI has set the tone for AI UI with the release of ChatGPT in late 2022. Most of the tools nowadays follow a similar layout and interaction patterns.

Legal AI follows this general trend as well, with many solutions offering various chatbots as the prevalent interface with certain adjustments to reflect the specifics of legal work.

The adoption of the widely established open-ended chatbot interface in legal is not surprising. This form of interaction is already familiar to many users. It is very likely that by the time any enterprise solutions are implemented, users will have gotten their first impressions of the technology from tools like ChatGPT.

This familiarity could remove friction in adoption and change management. Providing similar tech in-house should also steer the employees from using unverified, personal accounts.

There are other interaction patterns in Legal AI tools beyond chat, such as tabular review, embedded AI-powered interactions in preexisting tools, word add-ins, more or less agentic workflows, and others – and we will leave these out of scope for now.

Breakdown of the Legal AI chatbot interface

The input box

The standard chatbot space is characterised by two different areas: the input area (where the user adds the prompt) and the conversation area with submitted queries and LLM answers.

The input area consists of

a text box that affords natural language prompting,
the send button, and
a selection of prioritised parameter settings.

I call them prioritised because they were selected from a virtually infinite number of potential tweaks and selections and put in a prominent space.

In this respect, it is interesting to see which options do the platforms offer in the coveted chat card:

Upload a file or select a reference (Vincent, Harvey, ChatGPT, Legora)
Query format or styling (Harvey, Claude)
Jurisdiction Selection (Vincent, Legora)
Select additional tools, such as legal research, deep research (ChatGPT, Perplexity) or web search (ChatGPT, Legora)

If I have a tool with potentially infinite use cases, the one or two things that make it to the home screen are a big deal.

The user could just specify these themselves in their prompt, but instead, the tech provider chose to nudge the user to think about them.

This could communicate to the user that some things are taken into consideration. For example, if the user can select a jurisdiction, they could infer that the bot has been sufficiently trained on legal sources from said jurisdiction or that the style of drafting shall correspond to the prevailing style (such as how lawyers write differently in continental and common law jurisdictions).

Natural Language Prompting

You can use the normal way you speak (natural language prompting) to command the chatbot.

The prompt is the expression of intent, what the user indicates that it wants the AI to do.

Prompt is to generative AI, what mouse is to graphical user interfaces: I click on this folder, I want it open. Except now it is the user, who determines what will happen, without having to follow pre-determined workflows.

There are also fewer hints on the possible actions, forcing more recall over recognition (figuring out what you want to do instead of choosing from a menu).

The models could – in theory and in varying degrees of accuracy – generate anything. The lack of guidance is therefore a feature.

To leverage the tool properly, the user needs to be able to specify exactly what they are looking for. An outsized emphasis on prompting means it is the user’s job to figure out the good questions. At least unless the tool employs sophisticated routing and orchestration (see below).

In law, prompting is one level harder. Asking good questions requires having a pretty good grasp of the bigger picture, the regulatory regime, the terminology. You may need to know where to look, what kind of reasoning to require from the model, or what kind of context is needed to make any call.

System level: what happens before and after

While putting in a couple of words feels almost deceptively simple, there is a lot that happens under the hood before and after you type.

The tools are influenced by the way the model and the tool are set up, including training data, guardrails, system settings and prompts, and custom prompts. None of these are usually visible to the user and often are not proactively communicated.

Furthermore, many of the chatbots have sophisticated orchestration and routing layers. These are crucial for implementation of complex workflows, allowing for a scalable review and production while maintaining the convenience of an open-ended chatbot.

As Winston Weinberg, the founder of Harvey puts it: “Nowadays, our system does not rely that much on prompting, we have a bunch of routing, knowledge system, it just does not matter as much.”

Defined workflows beyond the prompt chat window are an infrastructure that is trying to understand what the user is trying to do, and then point them to the most efficient solution.

In other words, the design of the system behind the minimalist chatbot basically defines how useful the output is going to be.

Communicating sources

A big pain point for lawyers is the ability to verify the accuracy of the AI generated answers.

This is also reflected in the fact that the Legal AI incumbents target primarily legal professionals (and especially law firms), where there are multiple levels of review. That is (at least in theory) a lot less potential for liability compared to generating legal advice directly to consumers.

In most of the existing tools, sources are communicated in two key ways:

Embedded references to web resources or reference documents (Perplexity, ChatGPT, Legora, Harvey)
Separate sources pane (Vincent AI)
Confidence scoring (Legora in tabular review)

Interestingly, citations may be a lower priority feature for both users and experts, despite the need to verify the outputs. Therefore, progressive disclosure using drilldowns or separate panes could be the best way to both communicate the sources, enable verification, but not overwhelm the user.

Finally, some Legal AI apps also add a reasoning as to why a certain source was used and how (Vincent AI, Legora). This makes sense especially when the tool is deducing information from the contents of an underlying document, or in case of case law in common law jurisdictions, where the applicability may be subject to a more nuanced analysis.

Output

Finally, with all of the above considered, the Large Language Model produces its output in one of the two key patterns:

Single window: meshing the product with the commentary in one output (ChatGPT, Harvey, Legora)
Multiple windows or a separate file: Separating product and commentary using a dedicated space (Claude Artifacts, ChatGPT Canvas, Vincent AI)

In general, separating the product from any additional notes can support iteration and traceability, while keeping it all together makes everything look much less complicated.

Interestingly, none of the legal AI tools explore generative UI or otherwise play with the output experience.

Key usability challenges of natural language chatbots

What do I ask for?

Prompting feels fairly accessible. You sit down and write or dictate.

This can cause a blank canvas problem – it can be hard to know where to start.

Secondly, prompt is not only the means of interaction, but also the way how we discover what the tech can do.

In designer terms, what the tech can do is referred to as affordances. In general, affordances are often indicated by signifiers, more or less subtle signs that something is possible. We know we can open a door (the door affords opening), because there is a doorknob (the signifier).

Chatbots are interfaces with no signifiers with virtually infinite amounts of affordances that are up to the user to be discovered using prompting.

This can be illustrated by (i) the frenzy of lawyers discussing use cases, and (ii) the design decisions in Legal AI tools.

For example, many Legal AI apps rectify the lack of quality in prompting using various mechanisms, such as assisted prompting, rephrasing, prompt banks and examples, or revising the prompt in the background.

Am I actually getting the outputs that I want?

Prompting a lot does not mean that you get automatically better at it. This requires a lot of critical thinking and review of the outputs, not taking anything at face value, and knowing exactly what kind of output you want.

And this, too, is on the user, because there is very little feedback as to how your prompt has landed. Some users react to this with additions to their prompts (If you need more context to answer this, ask me. Do not guess. Tell me what you are going to do first.)

This became more interesting with the reasoning models, where you can get some insight into how the models process the queries, giving at least a little bit of transparency.

But that does not mean that the model would say: congrats, this was a solid B+ prompt.

Dealing with accuracy and reviewing answers

Accuracy of outputs is of paramount importance for lawyers. As noted above, there is no threshold as to whether the prompt gets accepted. So the user can only tell if the tech is actually good in any given activity based on the output. That can be very difficult to do, unless one is an expert that can spot problems easily.

The whole point of these tools is that they are predictive engines. The answer will always sound kinda ok at first glance, because it is probable. That does not, however, mean that it will be legal or reflecting all the relevant intricacies.

There are very few ways to tell if an output is good beyond just doing all the heavy lifting and checking everything very diligently. The tools should reflect that.

We are still working on producing evals and legal AI benchmarks, so I will leave this topic aside for another deep dive.

What does this mean to the teams sourcing this technology?

Based on all the above, what can you take to make the chatbot a useful and transparent tool for your users:

Be transparent about what the tools do and don’t do,
Communicate the system prompting, routing, etc of your tools – in appropriate format and in branding of your tool
Consider the appropriate level of prompting assistance (as an interface or in training),
To the practicable extent, communicate what training data has or has not been used, if there is any knowledge database, and other available information on sources

Optimize the experience of your user

Quick tip: conduct proper research before implementing any tools

Any of the decisions below – on branding, form of prompt assistance, or communicating confidence, should be based on thorough underlying user research. This is very tricky when your tool is used by billions of people on a daily basis with infinite amounts of usages from kids birthday parties to evaluating skincare.

In this respect, legal specific application layers pose an easier job. But still, there are severe differences between different practices, levels of seniority, and tasks per jurisdiction, as well as between law firm lawyers and in-house lawyers.

Understanding usage and workflows can help you create the necessary scaffolding that will help the user achieve their desired ends easier. That also means optimizing what the tool looks like for easy retrieval of the key information – such as in the form of providing citations or optimizing prompting assistance for most frequent use cases.

This research will also help you understand how much weight you can put onto your user – if they should be the one doing all the prompting and specification, or if you should design your tool in a more assistive way.

Communicate your intentions to the user

Quick tip: Make sure that the packaging of the tool tells the user what to expect

Every touchpoint tells the user what they can expect from the tools and this starts with branding: if you call it Lex-something, you are telling the user: the tool can do law, regardless of whether it is true. If the user states that their home jurisdiction is the Czech Republic, they will be angry if they discover that your tool does not know anything about local laws.

This applies to every single interaction – in the UI, the branding, the options you select for the prompting window, whatever you communicate in the sources, how you name your custom tools, what do you tell your user about the system prompt and the routing.

You cannot rely on your user to infer any limitations themselves from the nature of those tools. If it fails to meet their expectations, they will most likely just think that the tool is stupid or useless. Or they will blindly trust the output regardless, which is straight out dangerous in certain scenarios.

The challenge is how to communicate this incredibly complex setting to the user in a way that is not completely overwhelming. In this context, even small decisions matter a lot: such as the name of the tool with some specific system prompt, the selection of settings for the input area, or providing some sort of prompting assistance (see below).

If your tool does extra work (such as via workflows in orchestration layer), the user should know how it influences the outputs.

Prompting assistance and rephrasing

Quick tip: If prompting is the key to get good outputs, make it easier for the user to do it well.

Prompting can be difficult, but there are multiple levels of assistance that you can provide to the user:

In the design of the tool: orchestration and routing layers that match the user with their needed output even if their input is not exactly awesome
At the moment of the need: prompt banks (Harvey, Legora, Vincent AI, Co-counsel) or rephrasing assistance (Vincent AI)
General: trainings and introductions to the tool, sharing of use cases, and general debate about how we use the tools across organizations.

Given the accessibility of the tools, you will always have power users and first time users in your group. The tool should be reasonably accessible for beginners supported by scaffolding, but allowing the power users to squeeze everything out of it thanks to the nondeterministic nature of the chatbot window.

Reasoning, explainability, and sourcing

Quick tip: make it easy for the user to verify the outputs

In cases where there is a correct (or lawful) answer, the user (and especially any legal user) needs to be able to verify the output and you should optimize for that.

This can be achieved by purely manual review, or be supported by sources or other forms of assistance.

There are multiple ways to streamline this process on the design level beyond just the simple disclaimer that they should do so.

Is chatbot the ultimate Legal AI design interface?

It depends.

Chatbots can truly be the optimal way to interact with Large Language Models, but only if the branding, design, and technical factors are aligned to optimize the user experience.

There is just so much that goes into the setup of both the general and legal-specific tools that there is no definitive answer.

Many users will consider Generative AI chatbots as the magic window that can generate absolutely anything for them. This poses its unique challenges on both the input and the output side.

On input, one way or another, someone needs to be asking those good questions to guide the model. And it can be either the user themselves or the application-layer provider. Each option has its tradeoffs that can be reflected in the interface.
On output, it is about how much and how easily the user can iterate and determine if it is correct, legal, and corresponding to reality.

And it is important to also note that sometimes no AI is the best solution, especially if we are seeking a very consistent, super controlled output. If speed, determinism, or structure is essential, other tools (with AI or not), can be a better fit.

Final Provisions

Working with AI chatbots looks simple on the surface, but there is so much that goes into the design of simple things.

In implementing legal AI chatbots, the design decisions should achieve a balance between usability and complexity.

Finally, there still should be concentrated change management and communication work to be done.

Where do you stand on this?

Is chatbot the ultimate Generative AI interface?

Legal GenAI Chatbots Design Review

Chatbots as the interface to interact with Large Language Models