New Datasets

Discover the latest datasets designed to accelerate machine learning innovation—diverse, scalable, and ready to power your next project.

2025.12

With a focus on edging datasets, we created Tablet GUI datasets as addition to the Agent dataset pool, Sound Events dataset for rich sound recognition, and Video Reasoning Q&A dataset for multimodal logic reasoning training. Also, 11 Corpus Pairs datasets are added to the list for LLM Multiligual training.

Sound Events Dataset

Sound Events Dataset

Unit: 632.8 hours

N/A
N/A
Audio
Special Audio
Chinese Tablet GUI Dataset

Chinese Tablet GUI Dataset

Unit: 15675

China
Chinese
Image
Agent
Chinese Video Reasoning Q&A Multimodal Dataset

Chinese Video Reasoning Q&A Multimodal Dataset

Unit: 150,000 Videos

N/A
Chinese
Multimodal
Video-text
University Graduate English Question Bank Dataset

University Graduate English Question Bank Dataset

Unit: 10M questions

N/A
N/A
LLM
Question Bank
Chinese & English Corpus Pairs Dataset

Chinese & English Corpus Pairs Dataset

Unit: 150MB

N/A
Chinese & English
LLM
Corpus Pairs
Chinese & German Corpus Pairs Dataset

Chinese & German Corpus Pairs Dataset

Unit: 150MB

N/A
Chinese & German
LLM
Corpus Pairs
Chinese & Turkish Corpus Pairs Dataset

Chinese & Turkish Corpus Pairs Dataset

Unit: 150MB

N/A
Chinese & Turkish
LLM
Corpus Pairs
Chinese & Italian Corpus Pairs Dataset

Chinese & Italian Corpus Pairs Dataset

Unit: 150MB

N/A
Chinese & Italian
LLM
Corpus Pairs
Chinese & Indonesian Corpus Pairs Dataset

Chinese & Indonesian Corpus Pairs Dataset

Unit: 150MB

N/A
Chinese & Indonesian
LLM
Corpus Pairs
Chinese & Hindi Corpus Pairs Dataset

Chinese & Hindi Corpus Pairs Dataset

Unit: 150MB

N/A
Chinese & Hindi
LLM
Corpus Pairs
Chinese & Filipino Corpus Pairs Dataset

Chinese & Filipino Corpus Pairs Dataset

Unit: 150MB

N/A
Chinese & Filipino
LLM
Corpus Pairs
Chinese & Thai Corpus Pairs Dataset

Chinese & Thai Corpus Pairs Dataset

Unit: 150MB

N/A
Chinese & Thai
LLM
Corpus Pairs
Chinese & Vietnamese Corpus Pairs Dataset

Chinese & Vietnamese Corpus Pairs Dataset

Unit: 150MB

N/A
Chinese & Vietnamese
LLM
Corpus Pairs
Chinese & Malay Corpus Pairs Dataset

Chinese & Malay Corpus Pairs Dataset

Unit: 150MB

N/A
Chinese & Malay
LLM
Corpus Pairs
Chinese & HK Traditional Chinese Corpus Pairs Dataset

Chinese & HK Traditional Chinese Corpus Pairs Dataset

Unit: 150MB

N/A
Chinese
LLM
Corpus Pairs

2025.10

Following the latest trend, we created Robotic Arms Dataset, Medical Paper Dataset, Multimodal Dataset of Cultural Relic for Embodied AI and LLM projects

Chinese Duplex Conversation Dataset

Chinese Duplex Conversation Dataset

Unit: 3200 hours

China
Chinese
Audio
Dialogue
Chinese Question Bank Image Dataest

Chinese Question Bank Image Dataest

Unit: 10000 questions

China
Chinese
LLM
Question Bank
Cultural Relic Image-Text Dataset

Cultural Relic Image-Text Dataset

Unit: 500k pairs

N/A
N/A
Multimodal
Image-text
English Books Dataset -2

English Books Dataset -2

Unit: 2.5M books

N/A
English
LLM
Pre-Training
English Medical Journal Papers Dataset

English Medical Journal Papers Dataset

Unit: 30M articles

N/A
English
LLM
Pre-Training
Robotic Arm Manipulation Video Dataset

Robotic Arm Manipulation Video Dataset

Unit: 1000 hours

N/A
N/A
Embodied AI
Embodied AI

2025.09

Following the latest trend, we created new datasets for Embodied AI, Agent, and Photo Editing

China Gaokao Question Bank Dataset

China Gaokao Question Bank Dataset

Unit: 155k questions

China
Chinese
LLM
Question Bank
Chinese Photoshop Editing Multimodal Dataset

Chinese Photoshop Editing Multimodal Dataset

Unit: 250k pairs

China
Chinese
Multimodal
Image-text
Digital Human Dataset

Digital Human Dataset

Unit: 1000 ID

China
Chinese
Video
Digital Human
First-Person Perspective Hand Operation Video Dataset

First-Person Perspective Hand Operation Video Dataset

Unit: 500 hours

China
N/A
Embodied AI
Embodied AI
Chinese Smartphone GUI Dataset

Chinese Smartphone GUI Dataset

Unit: 8000 images

China
English
Image
Agent
SWE Bench Dataset

SWE Bench Dataset

Unit: 20000 pairs

China
English
LLM
Q&A

Corporate Headquarters

Level 6/9 Help St Chatswood NSW 2067 Australia

61-2-9468-6300

US Headquarters

12131 113th Ave, N.E., Suite 100

Kirkland, WA  98034

Int’l Collect +1 206-800-2101

Fax +1 425-952-7221

© 2025 Appen Limited all rights reserved.
Privacy Statement​