Information

Latest AI News

Explore AI Frontiers, Master Industry Trends

AI Daily Brief

Your Daily AI Brief - Never Miss What's Next

Information

AI Product Finder

Smart Product Discovery - Comprehensive Market Intelligence

AI Product Rankings

AI Product Power Rankings - Performance, Buzz & Trends

AI Product Submit

Submit Your AI Product - Amplify Reach & Drive Growth

Tools

AI Tools Directory

Discover The Best AI Websites & Tools

Information

AI Models Finder

Comprehensive AI Models Collection for All Your Development & Research Needs

LLM Leaderboard

AI LLM Power Rankings - Performance, Buzz & Trends

Model Providers

Discover Trusted AI Model Partners - Guaranteed Reliable Support

Submit Your Model

Submit Your Model Info & Services - Precision Marketing & User Targeting

Tools

Compare LLMs

Multi-Dimensional Large Model Comparison - Find Your Perfect Match

LLM Cost Calculator

Calculate AI Model Costs Accurately - Optimize Your Budget

LLM Arena

Multi-Model Real-Time Evaluation & Quick Output Comparison

Information

MCP Servers

Discover Popular AI-MCP Services - Find Your Perfect Match Instantly

MCP Client

Easy MCP Client Integration - Access Powerful AI Capabilities

MCP Case Tutorials

Master MCP Usage - From Beginner to Expert

MCP Ranking

Top MCP Service Performance Rankings - Find Your Best Choice

MCP Service Submission

Publish & Promote Your MCP Services

Tools

MCP Playground

Test MCP Services Freely - Quick Online Experience

MCP Inspector

Quick MCP Service Testing - Fast Deployment

GEO Services

Achieve Dominant Visibility in AI Search for Your Business or Brand with GEO Services

AI Search Visibility Checker

Detect brand's visibility on AI platforms

Tools

AI Model Compatibility Checker

Free PC Hardware Test for DeepSeek & Llama

AI Deployment Calculator

Enter Your Large Model Computing Requirements for Instant GPU, Memory & Server Configuration Recommendations

Information

AI Dataset Collection

Large-scale datasets and benchmarks for training, evaluating, and testing models to measure

Tools

Intelligent Document Recognition

Comprehensive Text Extraction and Document Processing Solutions for Users

TOFU

The TOFU dataset provides a benchmark for fictional forgetting tasks for large language models.

CommonProductProductivityLanguage ModelForgetting

The TOFU dataset contains question-answer pairs generated based on 200 fictional authors that do not exist. It is used to evaluate the forgetting performance of large language models on real-world tasks. The task aims to forget models fine-tuned on various forgetting set ratios. This dataset uses the question-answer format, making it suitable for popular chatbot models like Llama2, Mistral, or Qwen. However, it can also be used for any other large language model. The corresponding codebase is written for Llama2 chatbot and Phi-1.5 models but can be easily adapted to other models.

TOFU

TOFU Visit Over Time

Monthly Visits

493360068

Bounce Rate

36.08%

Page per Visit

6.1

Visit Duration

00:06:29

TOFU Visit Trend

TOFU Visit Geography

TOFU Traffic Sources

TOFU Alternatives

TOFU — The TOFU dataset provides a benchmark for fictional forgetting tasks for large language models.

•Language Model•Forgetting

DCLM-baseline — High-performance language model benchmark dataset

•Natural language processing•Language model

OLMo 2 13B

OLMo 2 13B — High-performance English academic benchmark language model

•Language Model•Natural Language Processing

MMStar — An elite benchmark dataset for evaluating large visual language models

•Visual Language Models•Benchmark

SimpleQA — A benchmark test for measuring the ability of language models to answer factual questions.

•Benchmark•Language Model

Awesome-Domain-LLM — Collects and organizes open-source models, datasets, and benchmark datasets for vertical domains.

•Efficiency Assistant•Large Model

Trustworthy Language Model (TLM) Playground — Try Cleanlab's Trustworthy Language Model (TLM) in your browser

•Natural Language Processing•Language Model

I2VGen-XL — AI Model Library and Dataset Platform

•AI Model•Dataset

ShizhiAI — AI Model and Dataset Platform

ChineseSelection

•AI Model•Dataset

Benchmark Medical RAG — Benchmark Test for Retrieval-Based Question Answering in the Medical Field

•Medical Question Answering•Benchmark Test

TAG-Bench — Natural language processing benchmark for database queries

•Natural Language Processing•Database Queries

SA-V Dataset — Video dataset for training general object segmentation models.

•Computer Vision•Object Segmentation

Procyon Professional Benchmark Suite — Performance testing benchmark suite for professional users

•Performance Testing•Benchmark Tests

Ego-Exo4D — Multimodal Multi-view Video Dataset and Benchmark Challenge

•Multimodal•Multi-view

OpenCompass 2.0 Large Language Model Leaderboard — A real-time large language model leaderboard that provides comprehensive performance assessments.

•evaluation•leaderboard

BlueLM Large Model — An independently developed intelligent language understanding model by vivo

ChineseSelection

•Language Model•Natural Language Processing

FACTS Grounding

FACTS Grounding — A cutting-edge benchmark for assessing the factual accuracy of large language models.

•Language Models•Benchmark Testing

KnowEdit

KnowEdit — A knowledge editing benchmark for evaluating the knowledge editing capabilities of large language models.

•Knowledge Editing•Large Language Models

promptbench — Unified Language Model Evaluation Framework

•Benchmark•Evaluation

Humanity's Last Exam — Humanity's Last Exam is a multimodal benchmark test designed to assess large language models' capabilities.

•Artificial Intelligence•Benchmark Testing

Self-Rewarding Language Models — Language Model Self-Reward Training

•Language Model•Self-Reward

emo-visual-data — Emoji Visual Annotation Dataset

•Dataset•Multimodal Learning

En3D — 3D Character Generation Model

•Natural Language Processing•Model

PixelProse — A large-scale image captioning dataset providing over 16M synthetic image descriptions.

•Image Captioning•Vision-Language Model

FrontierMath — AI Mathematical Benchmark Testing

•Mathematics•Benchmark Testing

HelpSteer2 — An open-source dataset designed for training high-performance reward models.

•Open-source dataset•Reward model

RULER — A benchmark for evaluating the rationality of long-text language models.

•Long-text•Language model

M2RAG — A benchmark codebase for retrieval-augmented generation in multimodal contexts.

•Multimodal•Retrieval-Augmented Generation

persona-hub — Large-scale synthetic dataset, empowering personalized research

•Large-scale dataset•Language model testing

CelebV-Text — A large-scale, high-quality, and diverse face text-video dataset