DECISION NODE

May 6, 2024
Decision Nodes

On Building with AI

Yoel Tadmor
,ㅤ
Head of Engineering @ Eraser

Introduction

In a few short years, AI has taken the world of software by storm. Every new model features improved intelligence, better performance, and new, seemingly magical, capabilities. For app builders, it has evoked a mix of kid in a candy shop giddiness and keeping up with the Joneses anxiety.

At Eraser, we started our journey with AI a year ago, building DiagramGPT, and eventually incorporating multiple AI features into our core app. Today, we’ll look at several decisions we made along the way, addressing some in a more philosophical or strategic manner, and some more tactically:

  • How we decided what to build (and, equally, what not to)
  • The risks of adding AI to existing features and how we chose to mitigate them
  • How we approach quality
  • How our AI diagramming and outlining work

Who we are

To provide color and context, a brief introduction is needed:

Eraser is a docs and diagrams tool for engineering teams. Our conviction: most of us want to get better at making decisions collaboratively and documenting our work, but don't because:

  • Switching from coding to writing or diagramming feels like going from flow and creativity to anxiously staring at a blank screen on an unfamiliar tool.
  • Coding provides near-instant feedback, often subject only to compilation time. Writing can require working alone for days with no help or sense of whether we're on the right track.
  • Even when we do our chores, the final product often isn't something we're as proud of as our "real work".
  • Because we use a mish-mash of tools and only sporadically document things, it's never clear what's where, what's accurate, and what's relevant.

We've made a lot of progress solving these problems, and in due time, we aim to solve them all.

AI strategy: What to build

To answer this question, let's start with a simple model of AI features:

Open in Eraser

Focusing on any of those might allow us to build a uniquely valuable and magical feature:

  • Inputs: using existing form-factors, data, or assets, we can can create inputs that would impossible (or impractical) to wrangle on one's own.
    • Example: A slack-like chat app could allow users to summarize an entire thread or the last week's worth of posts in a particular channel.
  • Model interaction: using existing data, train or tune a superior model, or leverage expertise and experimentation to engineer optimized prompts.
    • Example: A customer-support chat app can fine tune a model on existing chats
  • Outputs: taking generally available inputs, we can ask AI to produce outputs that are uniquely valuable within our application.
    • Example: A canvas app can produce interesting visual outputs

Ideally, we would combine all three!

Because AI models are so powerful, it's easy to feel drawn in many directions. As a small team, we found value in articulating what we didn't want to build (for now, at least):

  • Features that would be very impressive when they worked, but would rarely do so.
  • Features that would overreach and do too much, taking more time and effort to sort through than would otherwise be saved.
  • Many small features that provided widely available functionality (e.g. "change tone", "turn into bullets")

Our choice: AI diagramming

Through that lens focusing on novel outputs was the clear choice for us. We wanted something that would work well out of the box, even for new teams that weren't already using Eraser as a system of record. We had already spent a lot of time on our diagram-as-code feature, which converts our in-house DSL into beautiful rendered diagrams. Using an AI to generate our DSL checked many boxes:

  • Proven value: Diagram-as-code was already one of our most popular and differentiating features. We had conviction that time spent improving the underlying form factor was time well spent, even if the AI feature ended up going nowhere.
  • GPT friendly: Many a useful diagram can be described in a modest amount of natural language. SQL schemas, code snippets and infrastructure configuration, while arcane to many, are lexical artifacts that GPTs are optimized to understand.
  • Philosophical alignment: When you're at a whiteboard, one thing you don't have to do is think about how to draw a square. We want Eraser to embody that - to let you focus on the ideas being communicated, and not the mechanics of drawing. Like a well-crafted abstraction, we want to hide the tedious bits while giving you full reign over what's meaningful. It's why we built diagram-as-code. Building AI into that felt like a natural extension.
  • A touch of magic: Dropping a few sentences or chunks of SQL into a textbox and seeing a beautiful diagram come out still feels wondrous.

How to ship

Open in Eraser

Our goal in the first phase was to spend 1-2 days to prove out the AI diagramming use case. We:

  • Used OpenAI's playground
  • Tried 10-20 examples that were designed to be non-trivial but straight forward
  • Did minimal prompt engineering

The results were better than we had even anticipated. Excited to build, we faced down our next decision: to modify our existing feature or build a new experience.

Open in Eraser

Building AI into our existing app had some attractive upside:

  • Immediate benefit: If it worked well, our app users could start running with it on day one.
  • Head start on in-app iteration: We knew the eventual goal was a single holistic experience. Starting inside of the app might help us iterate towards that.
  • Single deployment: For a small team, it can be a major distraction trying to build, deploy, and support a truly separate app.
  • Code sharing: Relatedly, re-packaging and re-using our complex diagram-as-code parsing and rendering pipeline would be a substantial project on its own.

We also identified several risks:

  • First impressions: Hallucinations, inconsistent quality, or plain old bugs might cast a bad light on diagram-as-code as a whole. The more separation, the easier it is to truly treat as a beta.
  • Noise and complexity: Building a UI to accommodate both AI and our syntax editor would necessarily require compromising on simplicity. If the AI didn't work well, this would just end up making a good feature worse.
  • Local maxima: The AI might work just well enough to discourage people from using and learning the syntax editor, even when it would be easier and faster.

Ultimately, we chose a middle ground - we built a sidecar experience hosted on our website that still leveraged our existing code and CI/CD pipeline:

Open in Eraser

This isn't perfect - we pay a price on load times and it isn't ideal for SEO purposes. But it let us move really fast (the whole site took only a couple weeks to stand up). It also gave us a long time to talk to users and look at analytics to inform our eventual approach to AI diagramming inside of Eraser.

How it works

Functional Overview

Our AI flows has three phases and a simple architecture:

Open in Eraser
Open in Eraser

The process starts with some user input - some combination of text and images.

The initial analysis has two goals:

  1. Classify the input - this can determine which follow-up prompts to feed to the AI.
  2. Create a list of assumptions and questions - these can be fed back to the user to seek clarifications.

The results of this phase, combined with the initial input, are then fed into the generation prompt. The goal of this phase is to generate structured output.

This output is then processed and fed into the UI for rendering.

Ensuring Quality

Historically, building quality software has fundamentally boiled down to two things:

  1. Having a comprehensive understanding of possible inputs and states and how the program should act in each case.
  2. Implement testing procedures (unit tests, E2E tools, regular manual testing, etc) to ensure the program works to spec and doesn't regress as new features are added.

This section will focus on #1 (#2 is interesting, and we are watching the development of tools and best practices in that space).

AI-based features are very often built around open-ended user inputs, and results can be unpredictable. There are no finite testing cases you can plan around, and even for archetypical examples, it isn't clear what "right" or "wrong" behavior is.

Our approach to quality has been the following, in a somewhat particular order:

  • Do what we can on the model prompting side to maximize result quality. This has several aspects:
    1. Increase the likelihood of strong results from meaningful inputs.
    2. Guide the AI away from producing nonsense results when faced with mediocre inputs.
    3. Experiment enough to detect bad habits and try to guide the AI away from them.
  • Guide the user to provide meaningful inputs and allow iteration. This can be done with  
  • Define the various failure modes and provide clear feedback when they are reached.
  • Allow rollbacks when AI edits existing work.

Notes on model selection

The easiest way to improve the quality of output is to use a best-in-class model (GPT4-turbo, Claude Opus, etc).  Thankfully, creating diagrams and starting documents have several facets in common that allow us to use these models:

  • Low frequency: One or two successful generations per day is sufficient for the feature to have impact.
  • High value: If the feature works well, it can replace an hour or more of work.
  • Low latency sensitivity: While faster is always better, a diagram doesn't need to be drawn within a certain time frame to be useful.

Consequently, we can justify using these best in class models. For other features, such as as-you-type suggestions, form auto-fills, or smaller edits, that might not be the case.

With that said, there is an important nuance. Recall that a single diagram generation might required several passes with the model (as is true of many interesting AI applications). While the primary generation phase absolutely benefits from a more powerful models, some of the analysis can be performed reasonably with lower-quality models such as GPT3.5. At a high level, that process works something like:

Open in Eraser

This continues to be an area of experimentation for us, but we've found a few useful heuristics and techniques for working with a weaker model:

  • Fixed-option classification: While errors and hallucinations are still possible, they are far less likely if the output is limited in scope.  
  • Provide opt out: As with people, if simply provided with several options, an AI can feel pressured to pick one. We've found it helps to offer a "I don't know" type of option, and explicitly asking it to only pick an option if confidence is very high. This can combine with a general purpose prompt.
  • User overrides: As with other quality aspects, it can make more sense to use lower-quality models on aspects of the process that the user can override by specifying the result or providing some specific input.

An interesting note here is that adding an analysis phase can actually end up saving both time and token cost if it is done with a cheaper model if it allows you to use a smaller, more focused, prompt.

Notes on streaming

In our view, streaming results from the AI helps make them feel more natural - like a colleague rather a black box. It's part of what made ChatGPT so successful, and we wanted to incorporate it as much as possible.

But streaming also comes with its own set of challenges. While models have gotten better at returning structured JSON when requested, we've found that asking for other formats can be easier than trying handle partial JSON responses.

The devil is in the details

While Eraser's DSL was created to be as simple as possible, we still offer a lot of optional or advanced functionality, including:

  • Icons
  • Colors
  • Different types of connections
  • Connection labels

Each of these adds value when done well, but each also requires a certain amount of space within the prompt. Much of our iteration on the prompt engineering side went into experimenting with these details and seeing which ones the AI could consistently get right.

At the same time, each detail does come at a cost - the more that goes into a prompt, the more likely it is to confuse the AI and miss some of the larger picture.

TL;DR

Where to start

First things first - it's important to have a good framework for prioritizing which features to work. One downside of AI's multi-modality is that it can do so much - but building a good feature requires a lot of focus and attention to detail.

How to ship

Feeling ambitious? It can be difficult to manage the risk of building a moon-shot feature into your primary flow. Consider all options here - we ended up building a new landing page for our initial deployment.

Thinking in conversations

To build an experience that offers unique value, it's often important to create specialized prompts and structured inputs. Consider a multi-phase process that includes initial AI-powered analysis and classification.

Getting quality results

We use state-of-the-art models by default, because our approach is to build low-frequency, high-value features. Even so, there's room for faster and cheaper models to do some of the up-front analysis. We're still experimenting with approaches here, more to come!

Beyond the model and prompt engineering, crafting a UX that guides users to provide high-quality inputs is essential. We've combined a few techniques here:

  • Examples that show off archetypical use cases
  • Structured forms
  • Input length meters

A bevy of states

With AI, best-case scenarios feel electric. But thinking through the entire spectrum of results - from "I can't diagram a rhinoceros" to "you just asked me to generate an essay out of two words" to "I'm not sure I understand" - is what separates great from so-so.