Taking AI Apps from Prototype to Production

Running complex inference on state-of-the-art AI models is now reduced to just three lines of code:

response = openai.Completion.create(
    engine="gpt-4",
    prompt="what is the meaning of life?"
)

Nothing stops developers from exploring possibilities. All managed LLMs even offer similar abstractions (Claude, Cohere, and LLaMA to name a few). The question is: why haven't enterprises adopted these en mass?

We don't mean simple chatbots where the output can change by time of day. We're talking about deep integrations of LLMs into your applications.

After interviewing hundreds of engineering leaders spanning startups to enterprises, we've boiled down the patterns to three concerns:

Accuracy: Synonymous with "hallucinations", it simply means the model's results aren't grounded in reality. This can be as insignificant as telling the wrong time, to as critical as telling the wrong bank balance. Tons of options can reduce these hallucinations. Developers can tune the model's hyperparameters, introduce Retrieval Augmented Generation (RAG) techniques and implement Reinforcement Learning with Human Feedback (RLHF).
Latency: Calling some of the beefier LLMs, whether they be managed or self-hosted, introduces significant latency. This is often due to network egress/ingress, hardware resources, and simply the size of the models. We haven't however seen any managed models get round-trip latency below 800ms. For context, standard APIs are 50-300ms. ML engineers can always quantize the models, which has it's own tradeoffs.
Cost: GPT-4's pricing structure is $0.03 / 1K input tokens and $0.06 / 1K output tokens. Using the tiktoken library, we calculated this article as containing 1548 tokens, which would cost $.05. This doesn't seem like much, but imagine creating 1000 iterations, after-all models are always evolving and inference is never one-and-done. Usage-based pricing such as charging per token has it's tradeoffs. We've seen teams use models such as LLaMA in their ETL pipelines, which gets expensive fast.

Other concerns such as privacy, learning-curve, explainability and maintainability are top of mind but fall below accuracy, latency and cost.

The core answer to all of these concerns is to create a Separation of Duties. Said differently, task-specific models working together to achieve a larger task. This can be referred to as model daisy chaining, ensemble learning or staged analysis.

Take a facial recognition example, expanded here

Model One: Given a picture, generate bounding boxes of all the faces in the frame
Model Two: Measure the features in the contents of each bounding box. For example, the distance between eyes, nose and mouth
Model Three: Generate embeddings of each feature distances

To reduce errors and latency, each model is trained, deployed and optimized independently. Model one can have hyperparameters A & B for use case X, while Model two can have hyperparameters D & E for use case Y. Then combine each of these combinations within the context of the entire chain, it quickly becomes combinatorially explosive... 🤯

Prediction: To optimize for accuracy, latency and cost, engineers will increasingly combine models with data queries then glue them together with code.

NUX enables developers to safely experiment with the evolving state of data, models, and code in a familiar playground interface that, once "ready", maps to infrastructure as code that can be deployed anywhere. We handle handle the journey from prototype to production with ease 😎.

Explosion of Options

There's a myriad of models, data and code to choose from, which creates analysis paralysis for engineering orgs, reducing their "time to value". Skunkworks projects and hackathons can give developers the freedom to evaluate each of these for their use case.

💡

We have engineers dedicated to helping teams evaluate combinations of models, data and code for their use case, completely free. Schedule a Call

Data

Vector databases, transactional databases, and text indexes often get paired with 3rd party APIs and all have their own re-ranking solutions. Different embedding models, similarity algorithms (cosine, euclidian, dotproduct), metadata filtering, and top_k all yield their own results

We maintain an open source vector database comparison/evaluation repo: http://vectorsearch.dev/

Name	Open Source	Index Type
Pinecone	No	Vector
MongoDB	Yes	Hybrid (Dense, Sparse, NoSQL)
Weaviate	Yes	Hybrid (Dense and Sparse)
Google	No	Vector
Elastic	Yes	Hybrid (Dense and Sparse)
Algolia	No	Text
Vespa	Yes	Hybrid
Milvus	Yes	Vector
Redis	Yes	Transactional (Dense, Sparse and Key Value)
Qdrant	Yes	Vector
OpenSearch	Yes	Hybrid (Dense and Sparse)
LucidWorks	No	Text

Models

Huggingface now has over 1 million models in their inference catalogue. There are over 20 managed LLMs that all have their own trade-offs. Different hyperparameter options (temperature, top_p, etc.) and baseline models (7b vs 30b parameters) all yield completely different results.

Name	Open Source	Type
GPT-4	No	Sentence Completion, Text Generation
LaMDA 2	No	Conversational AI
Claude	No	Sentence Completion, Text Generation
Bard	No	Text Analysis, Generation
Cohere	No	Natural Language Understanding, Generation
LLaMA2	Yes	Text and Code Generation

Code

Pre-processing libraries can be overwhelming. Langchain for example has over 50 document loaders each with their own chunking parameters that vastly change the output quality of models.

Name	Open Source	Type
LlamaIndex	Yes	Data Indexing
Langchain	Yes	Text Processing
Unstructured	Yes	Data Extraction
Haystack	Yes	Information Retrieval

On the path to production, engineers must combine each of these primitives (ex: Pinecone, GPT-4 and Langchain) chain them together, and experiment with the parameters of each to see how they perform against acceptance criteria.

Rapid Prototyping

Objectives: This involves conducting market research, talking to users, defining scope, and more.
Consolidation: All technology options need to vetted against security and compliance concerns then be quickly accessible to developers. Ensuring solutions are "air-gapped" is often key. Middleware tooling can accelerate this.
Learning Curves: Some tools such as Langchain are notoriously more challenging to work with than LlamaIndex, but expose more functionality. Workshops and documentation can accelerate this.
Acceptance Criteria: It's helpful to define what the output of a happy-path pipeline is. This allows frameworks such as optimize to automate bruteforce hyperparameter optimization.
Cost Optimizations: Develop a budget that accounts for tokens, compute and operational management. Sizing hardware and projecting spend can involve asking questions and defining assumptions.

Path to Production

Independent Scaling: Scaling each block within the chain independently can reduce overall capex and increase performance. Every stage is deployed as a container with defined inputs, outputs and resources. Why should a tiny spellcheck model live on the same server as your beefy LLM? This also grants load balancing to distribute traffic and compute evenly.
Inclusive Network: By having all the systems coupled within the same network, not only does it reduce privacy concerns it also drastically reduces latency. Do you trust these models with your code? How do you know it won't be used for other purposes?
Hosting: GPUs, CPUs, RAM/Disk allocation and more are all the options to consider. Then figuring out hosting LLMs for inference, etc. and how this impacts cost structure.
Monitoring: Track system performance, model accuracy, and user engagement.
Error Handling: Alerting mechanisms to notify the team of errors or performance degradation.
Integration: Creating a seamless API layer for integration and endpoints to keep the system up to date (webhooks/triggers and streaming pipelines)

Inference

If you want to call your LLMs and ML models, you need to host them for inference. Here are some options:

Name	Open Source	Type
SageMaker	No	ML Model Building and Deployment
Nvidia Triton	Yes	AI Model Serving Framework
Ray	Yes	Distributed Computing for AI
HF Inference	No	Model Hosting and Inference
ClearML	Yes	ML Ops and Versioning
BentoML	Yes	Model Serving and Deployment
RunPod	No	GPU Cloud for AI and ML
BaseTen	No	ML Model Deployment
Modal	No	ML Model Development and Deployment
Banana	No	ML Model Monitoring
Cerebrium	No	AI-based Analysis and Prediction
Lepton	Yes	Distributed Deep Learning
Fireworks	No	ML Model Scaling and Management
Octoml	No	ML Model Optimization and Deployment

Evolving

Feedback Loops: Continuously monitor and adjust the model parameters automate acceptance criteria testing.
Reinforcement Learning: Collect user feedback and use it to inform model retraining and updates. Then schedule batch updates.

Is this exciting? Overwhelming? NUX standards for New User Experience, and we're building for this transition of Playground to Production with two offerings:

Playground: Consolidated, Jupyter-like familiar workbook interface to combine organization-vetted models, data connections, and coding packages.
Production: Export your workbook into Infrastructure as Code, which translates into a container orchestration system that can be deployed anywhere. On-premise or in your cloud's VPC.

Explore Use Cases

💡

Oh and we're also hiring DevOps Engineers

Taking AI Apps from Prototype to Production

Explosion of Options

Data

Models

Code

Rapid Prototyping

Path to Production

Inference

Evolving

Integrate MongoDB Vector Search into Your RAG AI App

RAG Microservices: Getting to "AI Everywhere"

What will you build?

Explore templates or build your own.

Resources

Industries

Taking AI Apps from Prototype to Production

Explosion of Options

Data

Models

Code

Rapid Prototyping

Path to Production

Inference

Evolving

Integrate MongoDB Vector Search into Your RAG AI App

RAG Microservices: Getting to "AI Everywhere"

What will you build?

Explore templates or build your own.