codefuse-devops-eval

DevOps benchmark

An evaluation suite for assessing foundation models in the DevOps field.

Industrial-first evaluation benchmark for LLMs in the DevOps/AIOps domain.

GitHub

690 stars
9 watching
44 forks
Language: Python
last commit: 8 months ago
Linked from 1 awesome list


Backlinks from these awesome lists:

Related projects:

Repository Description Stars
codefuse-ai/codefuse-devops-model An industrial-first language model for answering questions in the DevOps domain 596
codefuse-ai/codefuse-chatbot An AI-powered tool designed to simplify and optimize various stages of the software development lifecycle 1,202
codefuse-ai/test-agent A tool that empowers software testing with large language models. 565
codefuse-ai/mftcoder A framework for fine-tuning large language models with multiple tasks to improve their accuracy and efficiency 647
bregman-arie/howtheydevops A collection of publicly available resources on DevOps practices from companies around the world 733
princeton-nlp/intercode An interactive code environment framework for evaluating language agents through execution feedback. 198
open-evals/evals A framework for evaluating OpenAI models and an open-source registry of benchmarks. 19
cloud-cv/evalai A platform for comparing and evaluating AI and machine learning algorithms at scale 1,779
hkust-nlp/ceval An evaluation suite providing multiple-choice questions for foundation models in various disciplines, with tools for assessing model performance. 1,650
johnathan79717/codeforces-parser Generates sample tests and input/output files for competitive programming contests 137
alco/benchfella Tools for comparing and benchmarking small code snippets 514
microsoft/codexglue A benchmark dataset and open challenge to improve AI models' ability to understand and generate code 1,575
openai/simple-evals Evaluates language models using standardized benchmarks and prompting techniques. 2,059
joelwmale/codeception-action An action for running Codeception tests in GitHub workflows 15
openai/procgen A benchmark for evaluating reinforcement learning agent performance on procedurally generated game-like environments. 1,030