Anthropic API has recently introduced a prompt caching feature, allowing developers to cache frequently used context information between API calls. With prompt caching, customers can provide more background knowledge and example outputs to the Claude model, significantly reducing the cost of long prompts by up to 90% and decreasing latency by up to 85%.

This feature is currently available in the public beta versions of Claude3.5Sonnet and Claude3Haiku, with future support planned for Claude3Opus.

QQ截图20240815093542.png

Prompt caching is particularly useful in scenarios where large amounts of prompt context need to be repeatedly referenced in multiple requests, such as reducing the cost and latency of long conversations in dialogue agents, especially when dealing with complex instructions or document uploads; coding assistants can improve autocompletion and codebase Q&A by retaining a summary version of the codebase in the prompt; when processing large documents, prompt caching can embed complete long materials without increasing response time; additionally, for scenarios involving multi-round tool calls and iterative changes, such as agent searches and tool usage, prompt caching significantly enhances performance.

QQ截图20240815093549.png

The pricing for prompt caching depends on the number of cached input tokens and their usage frequency. The cost of writing to the cache is 25% higher than the basic input token price, while the cost of using cached content is significantly reduced, at just 10% of the basic input token price.

It is reported that Notion, as a customer of Anthropic API, has integrated the prompt caching feature into its AI assistant, Notion AI. By reducing costs and increasing speed, Notion has optimized its internal operations and provided users with a more advanced and faster experience.