Taking AI Apps from Prototype to Production

Running complex inference on state-of-the-art AI models is now reduced to just three lines of code:

response = openai.Completion.create(
    engine="gpt-4",
    prompt="what is the meaning of life?"
)

Nothing stops developers from exploring possibilities. All managed LLMs even offer similar abstractions (Claude, Cohere, and LLaMA to name a few). The question is: why haven't enterprises adopted these en mass?

We don't mean simple chatbots where the output can change by time of day. We're talking about deep integrations of LLMs into your applications.

After interviewing hundreds of engineering leaders spanning startups to enterprises, we've boiled down the patterns to three concerns:

  • Accuracy: Synonymous with "hallucinations", it simply means the model's results aren't grounded in reality. This can be as insignificant as telling the wrong time, to as critical as telling the wrong bank balance. Tons of options can reduce these hallucinations. Developers can tune the model's hyperparameters, introduce Retrieval Augmented Generation (RAG) techniques and implement Reinforcement Learning with Human Feedback (RLHF).
  • Latency: Calling some of the beefier LLMs, whether they be managed or self-hosted, introduces significant latency. This is often due to network egress/ingress, hardware resources, and simply the size of the models. We haven't however seen any managed models get round-trip latency below 800ms. For context, standard APIs are 50-300ms. ML engineers can always quantize the models, which has it's own tradeoffs.
  • Cost: GPT-4's pricing structure is $0.03 / 1K input tokens and $0.06 / 1K output tokens. Using the tiktoken library, we calculated this article as containing 1548 tokens, which would cost $.05. This doesn't seem like much, but imagine creating 1000 iterations, after-all models are always evolving and inference is never one-and-done. Usage-based pricing such as charging per token has it's tradeoffs. We've seen teams use models such as LLaMA in their ETL pipelines, which gets expensive fast.

Other concerns such as privacy, learning-curve, explainability and maintainability are top of mind but fall below accuracy, latency and cost.

The core answer to all of these concerns is to create a Separation of Duties. Said differently, task-specific models working together to achieve a larger task. This can be referred to as model daisy chaining, ensemble learning or staged analysis.

Take a facial recognition example, expanded here

  • Model One: Given a picture, generate bounding boxes of all the faces in the frame
  • Model Two: Measure the features in the contents of each bounding box. For example, the distance between eyes, nose and mouth
  • Model Three: Generate embeddings of each feature distances

To reduce errors and latency, each model is trained, deployed and optimized independently. Model one can have hyperparameters A & B for use case X, while Model two can have hyperparameters D & E for use case Y. Then combine each of these combinations within the context of the entire chain, it quickly becomes combinatorially explosive... 🤯

Prediction: To optimize for accuracy, latency and cost, engineers will increasingly combine models with data queries then glue them together with code.

NUX enables developers to safely experiment with the evolving state of data, models, and code in a familiar playground interface that, once "ready", maps to infrastructure as code that can be deployed anywhere. We handle handle the journey from prototype to production with ease 😎.

Explosion of Options

There's a myriad of models, data and code to choose from, which creates analysis paralysis for engineering orgs, reducing their "time to value". Skunkworks projects and hackathons can give developers the freedom to evaluate each of these for their use case.

💡
We have engineers dedicated to helping teams evaluate combinations of models, data and code for their use case, completely free. Schedule a Call

Data

Vector databases, transactional databases, and text indexes often get paired with 3rd party APIs and all have their own re-ranking solutions. Different embedding models, similarity algorithms (cosine, euclidian, dotproduct), metadata filtering, and top_k all yield their own results

We maintain an open source vector database comparison/evaluation repo: http://vectorsearch.dev/
Name Open Source Index Type
Pinecone No Vector
MongoDB Yes Hybrid (Dense, Sparse, NoSQL)
Weaviate Yes Hybrid (Dense and Sparse)
Google No Vector
Elastic Yes Hybrid (Dense and Sparse)
Algolia No Text
Vespa Yes Hybrid
Milvus Yes Vector
Redis Yes Transactional (Dense, Sparse and Key Value)
Qdrant Yes Vector
OpenSearch Yes Hybrid (Dense and Sparse)
LucidWorks No Text

Models

Huggingface now has over 1 million models in their inference catalogue. There are over 20 managed LLMs that all have their own trade-offs. Different hyperparameter options (temperature, top_p, etc.) and baseline models (7b vs 30b parameters) all yield completely different results.

Name Open Source Type
GPT-4 No Sentence Completion, Text Generation
LaMDA 2 No Conversational AI
Claude No Sentence Completion, Text Generation
Bard No Text Analysis, Generation
Cohere No Natural Language Understanding, Generation
LLaMA2 Yes Text and Code Generation

Code

Pre-processing libraries can be overwhelming. Langchain for example has over 50 document loaders each with their own chunking parameters that vastly change the output quality of models.

Name Open Source Type
LlamaIndex Yes Data Indexing
Langchain Yes Text Processing
Unstructured Yes Data Extraction
Haystack Yes Information Retrieval

On the path to production, engineers must combine each of these primitives (ex: Pinecone, GPT-4 and Langchain) chain them together, and experiment with the parameters of each to see how they perform against acceptance criteria.

Rapid Prototyping

  • Objectives: This involves conducting market research, talking to users, defining scope, and more.
  • Consolidation: All technology options need to vetted against security and compliance concerns then be quickly accessible to developers. Ensuring solutions are "air-gapped" is often key. Middleware tooling can accelerate this.
  • Learning Curves: Some tools such as Langchain are notoriously more challenging to work with than LlamaIndex, but expose more functionality. Workshops and documentation can accelerate this.
  • Acceptance Criteria: It's helpful to define what the output of a happy-path pipeline is. This allows frameworks such as optimize to automate bruteforce hyperparameter optimization.
  • Cost Optimizations: Develop a budget that accounts for tokens, compute and operational management. Sizing hardware and projecting spend can involve asking questions and defining assumptions.

Path to Production

  • Independent Scaling: Scaling each block within the chain independently can reduce overall capex and increase performance. Every stage is deployed as a container with defined inputs, outputs and resources. Why should a tiny spellcheck model live on the same server as your beefy LLM? This also grants load balancing to distribute traffic and compute evenly.
  • Inclusive Network: By having all the systems coupled within the same network, not only does it reduce privacy concerns it also drastically reduces latency. Do you trust these models with your code? How do you know it won't be used for other purposes?
  • Hosting: GPUs, CPUs, RAM/Disk allocation and more are all the options to consider. Then figuring out hosting LLMs for inference, etc. and how this impacts cost structure.
  • Monitoring: Track system performance, model accuracy, and user engagement.
  • Error Handling: Alerting mechanisms to notify the team of errors or performance degradation.
  • Integration: Creating a seamless API layer for integration and endpoints to keep the system up to date (webhooks/triggers and streaming pipelines)

Inference

If you want to call your LLMs and ML models, you need to host them for inference. Here are some options:

Name Open Source Type
SageMaker No ML Model Building and Deployment
Nvidia Triton Yes AI Model Serving Framework
Ray Yes Distributed Computing for AI
HF Inference No Model Hosting and Inference
ClearML Yes ML Ops and Versioning
BentoML Yes Model Serving and Deployment
RunPod No GPU Cloud for AI and ML
BaseTen No ML Model Deployment
Modal No ML Model Development and Deployment
Banana No ML Model Monitoring
Cerebrium No AI-based Analysis and Prediction
Lepton Yes Distributed Deep Learning
Fireworks No ML Model Scaling and Management
Octoml No ML Model Optimization and Deployment

Evolving

  • Feedback Loops: Continuously monitor and adjust the model parameters automate acceptance criteria testing.
  • Reinforcement Learning: Collect user feedback and use it to inform model retraining and updates. Then schedule batch updates.

Is this exciting? Overwhelming? NUX standards for New User Experience, and we're building for this transition of Playground to Production with two offerings:

  1. Playground: Consolidated, Jupyter-like familiar workbook interface to combine organization-vetted models, data connections, and coding packages.
  2. Production:  Export your workbook into Infrastructure as Code, which translates into a container orchestration system that can be deployed anywhere. On-premise or in your cloud's VPC.

What will you build?

Explore workbook templates or customize your own.

Start Building