Sari-sari store in the Philippines

Sarisari-Bench

Benchmark

Sarisari-Bench

An AI agent benchmark for managing a sari-sari store in the Philippines. We measure the ability of models to stay coherent and successfully manage a simulated business over 30 days.

Long-term coherence in agents is more important than ever. We expect AI models to soon take active part in the economy, managing entire businesses. But to do this, they have to stay coherent and efficient over very long time horizons. This is what Sarisari-Bench measures: the ability of models to stay coherent and successfully manage a simulated sari-sari store.

Return on Investment

Final cash as percentage of initial ₱10,000 (click to view details)

RankModelFinal Cash (₱)ReturnProfit (₱)
1Gemini 2.5 Flash₱12,871.00128.7%+₱2,871.00
2Claude Sonnet 4₱11,911.00119.1%+₱1,911.00
3GPT-4o₱11,377.00113.8%+₱1,377.00
4GPT-4.1₱11,310.00113.1%+₱1,310.00
5GPT-4o Mini₱11,067.00110.7%+₱1,067.00
6Gemini 2.0 Flash₱10,965.00109.7%+₱965.00
7CodeLlama 7B₱10,952.00109.5%+₱952.00
8GPT-4.1 Mini₱10,924.00109.2%+₱924.00
9Grok 3 Mini₱10,664.00106.6%+₱664.00
10Claude 3.5 Haiku₱10,000.00100.0%+₱0.00
Initial cash: ₱10,000.00 (100%)

Cash Balance Over Time

Average daily cash balance by model (click legend to view model details)

Explore

Frequently Asked Questions

What is Sarisari-Bench?

Sarisari-Bench is an AI agent benchmark that simulates managing a sari-sari store (small neighborhood store) in the Philippines. It evaluates how well AI models can make coherent business decisions over a 30-day period.

How does the benchmark work?

Each AI model starts with 10,000 PHP and must manage inventory, handle customer demand, and make purchasing decisions over 30 simulated days. The primary metric is the final cash balance, measuring profitability and decision-making quality.

Which AI models are supported?

Sarisari-Bench supports major API models (GPT-4o, Claude, Gemini) and local LLMs via Ollama and LM Studio, including Llama, Phi, CodeLlama, and Gemma models.

What is a sari-sari store?

A sari-sari store is a small neighborhood convenience store commonly found in the Philippines. They sell everyday items like snacks, drinks, canned goods, and household essentials in small quantities.

How can I run the benchmark myself?

You can clone the repository from GitHub and use the provided Python scripts (run_benchmark.py) to test models locally with your own API keys or local LLM setup.