During the AWS re: Invent conference, AWS Vice President of Databases, Analytics, and Machine Learning, Swami Sivasubramanian announced Model Evaluation on Bedrock, now available in preview, for models in his Amazon Bedrock repository.
Model Evaluation is based on two components: automated evaluation and human evaluation. In the automated version, developers can log in to their Bedrock console and choose a model to test. They can then evaluate the model’s performance on metrics such as robustness, accuracy, or toxicity for tasks such as summarizing, text classification, Q&A, and text generation. Bedrock includes popular third-party AI models such as Meta’s Llama 2, Anthropic’s Claude 2, and Stability AI’s Stable Diffusion. While AWS provides test datasets, customers can bring their own data into the benchmarking platform so that they are better informed about how the models perform. The system then generates a report.
If humans are involved, users can choose to work with an AWS human assessment team or their own. Customers must specify the type of activity (summary or text generation, for example), the evaluation metrics, and the dataset they want to use. AWS will provide customized pricing and deadlines for those who work with its evaluation team.
AWS Vice President of Generative AI Vasi Philomin told The Verge in an interview that getting a better understanding of how models behave better drives development. It also allows companies to see if models don’t meet certain responsible AI standards, such as toxicity sensitivities that are too low or too high, before building using the model.
“It’s important that the models work for our customers, to know which model suits them best, and we’re giving them a way to better evaluate it.”
Sivasubramanian also said that when humans evaluate AI models, they can detect other metrics that the automated system can’t identify, such as empathy or friendliness.
AWS won’t require all customers to benchmark models, Philomin said, as some developers may have worked with some of the core models on Bedrock previously. Companies that are still exploring which models to use could benefit from following the benchmarking process.
AWS said that while the benchmarking service is in preview, it will only charge for the model inference used during the evaluation. While there is no particular standard for benchmarking AI models, there are specific metrics that some industries generally accept. Philomin said the goal of benchmarking on Bedrock is not to evaluate models in a general way, but to give companies a way to measure the impact of a model on their projects.