## Operationalizing Generative AI on Vertex AI using MLOps > [!Abstract]- > The emergence of foundation models and generative AI (gen AI) has introduced a new era for building AI systems. Selecting the right model from a diverse range of architectures and sizes, curating data, engineering optimal prompts, tuning models for specific tasks, grounding model outputs in real-world data, optimizing hardware - these are just a few of the novel challenges that large models introduce. This whitepaper delves into the fundamental tenets of MLOps and the necessary adaptations required for the domain of gen AI and Foundation Models. We also examine the diverse range of Vertex AI products, specifically tailored to address the unique demands of foundation models and gen AI-based applications. Through this exploration we uncover how Vertex AI, with its solid foundations of AI infrastructure and MLOps tools, expands its capabilities to provide a comprehensive MLOps platform for gen AI. > [!Cite]- > Nawalgaria, Anant, Gabriela Hernandez Larios, Elia Secchi, Mike Styer, Christos Aniftos, and Onofrio Petragallo. “Operationalizing Generative AI on Vertex AI Using MLOps,” September 1, 2024. [https://www.kaggle.com/whitepaper-operationalizing-generative-ai-on-vertex-ai-using-mlops](https://www.kaggle.com/whitepaper-operationalizing-generative-ai-on-vertex-ai-using-mlops). > > [link](https://www.kaggle.com/whitepaper-operationalizing-generative-ai-on-vertex-ai-using-mlops) [online](http://zotero.org/users/local/kycSZ2wR/items/NVQYHFLW) [local](zotero://select/library/items/NVQYHFLW) [pdf](file://C:\Users\erikt\Zotero\storage\RRYN28AU\Nawalgaria%20et%20al.%20-%20Operationalizing%20Generative%20AI%20on%20Vertex%20AI%20using%20.pdf) ## Notes %% begin notes %% %% end notes %% %% begin annotations %% ### Imported: 2024-11-16 11:59 am The emergence of foundation models and generative AI (gen AI) has introduced a new era for building AI systems. Selecting the right model from a diverse range of architectures and sizes, curating data, engineering optimal prompts, tuning models for specific tasks, grounding model outputs in real-world data, optimizing hardware - these are just a few of the novel challenges that large models introduce. Through this exploration we uncover how Vertex AI, with its solid foundations of AI infrastructure and MLOps tools, expands its capabilities to provide a comprehensive MLOps platform for gen AI. DevOps is a software engineering methodology that aims to bridge the gap between development (Dev) and operations (Ops). It promotes collaboration, automation, and continuous improvement to streamline the software development lifecycle, introducing practices such as continuous integration and continuous delivery. MLOps builds upon DevOps principles to address the unique challenges of operationalizing Machine Learning systems rapidly and reliably. In particular, MLOps tackles the experimental nature of ML through practices like: • Data validation: Ensuring the quality and integrity of training data. • Model evaluation: Rigorously assessing model performance with appropriate metrics. • Model monitoring: Tracking model behavior in production to detect and mitigate drift. • Tracking & reproducibility: Maintaining meticulous records for experiment tracking and result reproduction. Here are some factors to consider when exploring models: 1. Quality: Early assessments can involve running test prompts or analyzing public benchmarks and metrics to gauge output quality. 2. Latency & throughput: These factors directly impact user experience. A chatbot demands lower latency than batch-processed summarization tasks. 3. Development & maintenance time: Consider the time investment for both initial development and ongoing maintenance. Managed models often require less effort than self-deployed open-source alternatives. 4. Usage cost: Factor in infrastructure and consumption costs associated with using the chosen model. 5. Compliance: Assess the model's ability to adhere to relevant regulations and licensing terms. Foundation models differ from predictive models most importantly because they are multipurpose models. Instead of being trained for a single purpose, on data specific to that task, foundation models are trained on broad datasets, and therefore can be applied to many different use cases. While models in the predictive AI context are self-sufficient and task-specific, gen AI models are multipurpose and need an additional element beyond the user input to function as part of a gen AI Application: a prompt, and more specifically, a prompt template, defined as a set of instructions and examples along with placeholders to accommodate user input. A prompt template, along with dynamic data such as user input, can be combined to create a complete prompt, the text that is passed as input to the foundation model. when applying MLOps practices to gen AI, it becomes important to have in place processes that give developers easy storage, retrieval, tracking, and modification of prompts. In tracking the results of an experiment, both the prompt and its components version, and the model version must be recorded and stored along with metrics and output data produced by the prompted model. Depending on the use case, leveraging only one prompted model to perform a particular generation might not be sufficient. To solve this issue, leveraging a divide and conquer approach, several prompted models can be connected together, along with calls to external APIs and logic expressed as code. A sequence of prompted model components connected together in this way is commonly known as a chain. RAG and Agents approaches can be combined to create multi-agent systems connected to large information networks, enabling sophisticated query handling and real-time decision-making. There are several products in Vertex AI that can support the need for chaining and augmentation, including Grounding as a service,5 Extensions,6 and Vector Search,7 Agent Builder.8 We discuss the products in the section “Role of a AI Platform”. Langchain9 is also integrated with the Vertex SDK,10 and can be used alongside the core Vertex products to define and configure gen AI chained applications. When developing a gen AI use case and a specific task that involves LLMs, it can be difficult, especially for complex tasks, to rely on only prompt engineering and chaining to solve it. To improve task performance practitioners often also need to fine-tune the model directly. Fine-tuning lets you actively change the layers or a subset of layers of the LLM to optimize the capability of the model to perform a certain task. Supervised fine-tuning: This is where we train the model in a supervised manner, teaching it to predict the right output sequence for a given input. Reinforcement Learning from Human Feedback (RLHF): In this approach, we first train a reward model to predict what humans would prefer as a response. Then, we use this reward model to nudge the LLM in the right direction during the tuning process. Like having a panel of human judges guiding the model's learning. Platforms like Vertex AI11 (and the Google Cloud platform more broadly) provide a robust suite of services designed to address these MLOps requirements: Vertex Model Registry,12 for instance, provides a centralized storage location for all the artifacts created during the tuning job, and Vertex Pipelines13 streamlines the development and management of these tuning jobs. Dataplex,14 meanwhile, provides an organization-wide data fabric for data lineage and governance and integrates well with both Vertex AI and BigQuery.15 What’s more, these products provide the same governance capability for both predictive and gen AI applications, meaning customers do not need separate products or configurations to manage generative versus AI development. In machine learning operations (MLOps), continuous training is the practice of repeatedly retraining machine learning models in a production environment. This is done to ensure that the model remains up-to-date and performs well as real-world data patterns change over time. For gen AI models, continuous tuning of the models is often more practical than retraining from scratch due to the high data and computational costs involved. Graphics processing units (GPUs) and tensor processing units (TPUs) are key hardware for fine-tuning. GPUs, known for their parallel processing power, are highly effective in handling the computationally intensive workloads and often associated with training and running complex machine learning models. TPUs, on the other hand, are specifically designed by Google for accelerating machine learning tasks. TPUs excel in handling large matrix operations common in deep learning neural networks. Synthetic data generation: This process involves creating artificial data that closely resembles real-world data in terms of its characteristics and statistical properties, often being done with a large and capable model. This synthetic data serves as additional training data for gen AI, enabling it to learn patterns and relationships even when labeled real-world data is scarce. Synthetic data correction: This technique focuses on identifying and correcting errors and inconsistencies within existing labeled datasets. By leveraging the power of larger models, gen AI can flag potential labeling mistakes and propose corrections, improving the quality and reliability of the training data. Synthetic data augmentation: This approach goes beyond simply generating new data. It involves intelligently manipulating existing data to create diverse variations while preserving essential features and relationships. Thus, gen AI can encounter a broader range of scenarios during training, leading to improved generalization and ability to generate nuanced and relevant outputs. Evaluating gen AI, unlike predictive AI, is tricky. You don't usually know the training data distribution of the foundational models. Building a custom evaluation dataset reflecting your use case is essential. This dataset should cover essential, average, and edge cases. Similar to fine-tuning data, you can leverage powerful language models to generate, curate, and augment data for building robust evaluation datasets. There are some established metrics, like BLEU for translations and ROUGE for summaries, but they don't always tell the full story. That's where custom evaluation methods come in. Another challenge is the subjective nature of many evaluation metrics for gen AI. What makes one output ‘better’ than another can often be a matter of opinion. The key here is to make sure your automated evaluation aligns with human judgment. You want your metrics to be a reliable proxy for what people would think. And to ensure comparability between experiments, it's crucial to lock down your evaluation approach and metrics early in the development process. Lack of ground truth data is another common hurdle, especially in the early stages of a project. One workaround is to generate synthetic data to serve as a temporary ground truth, which can be refined over time with human feedback. Finally, comprehensive evaluation is essential for safeguarding gen AI applications against adversarial attacks. Malicious actors can craft prompts to try to extract sensitive information or manipulate the model's outputs. Evaluation sets need to specifically address these attack vectors, through techniques like prompt fuzzing (feeding the model random variations on prompts) and testing for information leakage. However, applying CI to gen AI comes with challenges: 1. Difficult to generate comprehensive test cases: The complex and open-ended nature of gen AI outputs makes it hard to define and create an exhaustive set of test cases that cover all possibilities. Reproducibility issues: Achieving deterministic, reproducible results is tricky since generative models often have intrinsic randomness and variability in their outputs, even for identical inputs. This makes it harder to consistently test for specific expected behaviors. Quantization reduces the size and computational requirements of the model by converting its weights and activations from higher-precision floating-point numbers to lower-precision representations, such as 8-bit integers or 16-bit floating-point numbers. This can significantly reduce the memory footprint and computational overhead of the model. Model Pruning is a technique for eliminating unnecessary weight parameters or by selecting only important subnetworks within the model. This reduces model size while maintaining accuracy as high as possible. Finally, distillation trains a smaller model, using the responses generated by a larger LLM, to reproduce the output of the larger LLM for a specific domain. This can significantly reduce the amount of training data, compute, and storage resources needed for the application. Imagine the output to a given input is factually inaccurate. How can you find out which of the components are the ones that didn’t perform well? To answer this question it is necessary to apply logging on the application level and at the component level. Given that the input to the application is typically text, there are a few approaches to measuring skew and drift. In general all the methods are trying to identify significant changes in production data, both textual (size of input) and conceptual (topics in input), when compared to the evaluation dataset. All these methods are looking for changes that could potentially indicate the application might not be prepared to successfully handle the nature of the new data that are now coming in. Some common approaches are calculating embeddings and distances, counting text length and number of tokens, and tracking vocabulary changes, new concepts and intents, prompts and topics in datasets, as well as statistical approaches such as least-squares density difference,22 maximum mean discrepancy (MMD),23 learned kernel MMD,24 or context-aware MMD.25 As gen AI use cases are so diverse, it is often necessary to create additional custom metrics that better capture abnormal changes in your data. Continuous evaluation is another common approach to GenAI application monitoring. In a continuous evaluation system, you capture the model's production output and run an evaluation task using that output, to keep track of the model's performance over time. One approach is collecting direct user feedback, such as ratings (for example thumbs up/down), which provides immediate insight into the perceived quality of outputs. In parallel, comparing model-generated responses against established ground truth, often collected through human assessment or as a result of an ensemble AI Model approach, allows for deeper analysis of performance. Ground truth metrics can be used to generate evaluation metrics as described in the Evaluation section. This process provides a view on how your evaluation metrics changed from when you developed your model to what you have in production today. In the context of MLOps governance encompasses all the practices, and policies that establish control, accountability, and transparency over the development, deployment, and ongoing management of machine learning (ML) models, including all the activities related to the code, data and models lifecycle. Alongside the explosion of both predictive and gen AI applications, AI platforms, like Vertex AI,11 have emerged as indispensable tools for organizations seeking to leverage the power of Artificial Intelligence (AI). These comprehensive platforms provide a unified environment that streamlines the entire AI lifecycle, from data preparation and model training to deployment, automation, continuous integration/continuous delivery (CI/CD), governance, and monitoring. As discussed before, there is already a wide variety of available foundation models, trained on a broad range of datasets, and the cost of training a new foundation model can be prohibitive. Thus it often makes sense for companies to adapt existing foundation models rather than creating their own from scratch. As a result, a platform facilitating seamless discovery and integration of diverse model types is critical. Vertex AI Model Garden1 supports these needs, offering a curated collection of over 150 Machine Learning and gen AI models from Google, Google partners, and the opensource community. It simplifies the discovery, customization, and deployment of both Google’s proprietary foundational models and diverse open-source models across a vast spectrum of modalities, tasks, and features. Model Garden fosters experimentation by facilitating access to Google’s proprietary foundational models through the Vertex AI Studio UI,37 a playground where you can play around with prompts, models, and open-source models using provided Colab notebooks. Vertex AI Studio60 provides a unified console-driven entry point to access and leverage the full spectrum of Vertex AI's gen AI services. It facilitates exploration and experimentation with various Google first party foundation models (for example, PaLM 2, Gemini, Codey, Imagen, and Universal Speech Model). Additionally, it offers prompt examples and functionalities for testing distinct prompts and models with diverse parameters. It’s also possible to adapt existing models through various techniques like supervised fine-tuning (SFT), reinforcement learning tuning techniques, and Distillation, and deploy gen AI applications in just a few clicks. Any training or tuning job you run can be orchestrated and then operationalized using Vertex Pipelines,13 a service that aims to simplify and automate the deployment, management, and scaling of your ML workflows. Vertex AI function calling69 empowers users by enhancing the capabilities of language models (LLMs). It enables LLMs to access real-time data and interact with external systems, providing users with more accurate and up-to-date information. To do that, users need to provide function definitions such as description, inputs, outputs to the gen AI model. Instead of directly executing functions, the LLM intelligently analyzes user requests and generates structured data outputs. These outputs propose which function to call and what arguments to use. Vertex AI Grounding5 helps users connect large models with verifiable information by grounding them to internal data corpora on Vertex AI Agent Builder70 or external sources using Google Search. This enables two key functionalities: verifying model-generated outputs against internal or external sources and creating RAG systems using Google’s advanced search capabilities that produce quality content grounded in your own or web search data. Vertex AI extensions6 let developers integrate Vertex Foundation Models with real-time data and real-world actions through APIs and functions, enabling task execution and allowing enhanced capabilities. This extends to leveraging 1st party extensions like Vertex AI Search7 and Code Interpreter,71 or 3rd party extensions for triggering and completing transactions. Vertex AI Agent Builder70 is an out-of-the-box solution that allows you to quickly build gen AI agents, to be used as conversational chatbots or as part of a search engine. With Vertex AI Agent Builder, you are be able to easily ground your agents by pointing to a diverse range of data sources, including structured datastores such us BigQuery, Spanner, Cloud SQL, unstructured sources like website content crawling and cloud storage as well as connectors to Google drive and other APIs. Vertex AI Vector Search7 is a highly scalable low-latency similarity search and fully managed vector database scaling to billions of vector embeddings with auto-scaling. This technology, built upon ScaNN72 (a Google-developed technology used in products like Search, YouTube, and Play), allows you to search from billions of semantically similar or related items within your stored data. Vertex AI Feature Store74 is a centralized and fully managed repository for ML features and embedding. It enables teams to share, serve, and reuse machine learning features and embeddings effortlessly alongside other data. Vertex AI offers the flexibility to seamlessly create and connect various products to build your own custom grounding, RAG, and Agent systems. This includes utilizing diverse embedding models (multimodal, multilingual), various vector stores (Vector Search, Feature Store) and search engines like Vertex AI Agent Builder, extensions, grounding, and even SQL query generation for complex natural language queries. Moreover, Vertex AI provides SDK integration with LangChain9 to easily build and prototype applications using the umbrella of Vertex AI products. Vertex AI seamlessly integrates experimentation and collaboration into the development lifecycle of AI/ML and gen AI models and applications. Its Workbench Instances77 provide Jupyter-based development environments for the entire data science workflow, connected to other Google Cloud services and with GitHub synchronization capabilities. Vertex Colab Enterprise78 accelerates the AI workflow by enabling collaborative coding and leveraging code completion and generation features. Vertex AI also provides two tools for tracking and visualizing the output of many experiment cycles and training runs. Vertex AI Experiments79 facilitates meticulous tracking and analysis of model architectures, hyperparameters, and training environments. It logs experiments, artifacts, and metrics, enabling comparison and reproducibility across multiple runs. This comprehensive tracking permits data scientists to select the optimal model and architecture for their specific use case. Vertex AI TensorBoard80 complements the experimentation process by providing detailed visualizations for tracking, visualizing, and sharing ML experiments. It offers a range of visualizations, including loss and accuracy metrics tracking, model computational graph visualization, and weight and bias histograms, which - for example - can be used for tracking various metrics pertaining to training and evaluation of gen AI models with different prompting and tuning strategies. It also projects embeddings to lower-dimensional space, and displays image, text, and audio samples. For Ground Truth-based metrics, Automatic Metrics in Vertex AI81 lets you evaluate a model based on a defined task and “ground truth” dataset. For LLM-based evaluation, Automatic Side by Side (Auto SxS) in Vertex AI82 uses a large model to evaluate the output of multiple models or configurations being tested, helping to augment human evaluation at scale. Once developed, a production gen AI application must be deployed, including all its model components. If the application uses any models that have been trained or adapted, those models need to be deployed to their own serving endpoints. You can serve any model in the Model Garden through Vertex AI Endpoints21,which acts as the gateway for deploying your trained machine learning models. Vertex AI Model Registry12 serves as a centralized repository for comprehensive lifecycle management of both Google proprietary foundational and open-source Machine Learning models. Vertex AI Model Registry bolsters observability by providing integrated configuration and access to Vertex AI Model Monitoring91 and logging functionalities. This enables proactive identification and mitigation of both training-serving skew and prediction drift, ensuring reliability and accuracy of deployed models. Google Cloud Dataplex14 provides an organization-wide lineage across product boundaries in Google Cloud. Within the domains of AI and gen AI (and more broadly across data analytics and AI/ML) Dataplex seamlessly integrates with BigQuery and Vertex AI. Dataplex facilitates the unification, management, discovery, and governance of both data and models. %% end annotations %% %% Import Date: 2024-11-16T11:59:32.533-07:00 %%