Software/Data

Software

NGT
big3store
AnnexML
yskip
japanese-large-lm
japanese-large-lm-instruction-sft
LINE DistilBERT Japanese
Meta AI Video Similarity Challenge - 3rd Place Solution
Tappy

Data

VFD Dataset (Japanese)
Yahoo! Chiebukuro Data
Yahoo! Search Query Data
YJ Captions Dataset
YJ Chat Detection Dataset
Japanese Visual Genome VQA Dataset
Visual Scenes with Utterances Dataset
Experimental Dataset for Post-Ensemble Methods
Yahoo! Bousai Crowd Data
JGLUE: Japanese General Language Understanding Evaluation
YJ Covid-19 Prediction Data
LibriTTS-P
YJ AmbigDialogue

Software

NGT（Neighborhood Graph and Tree for Indexing）

Overview

The software being provided performs high-speed searches for data near to the vectors indicated as queries from among large volumes of high dimensional vector data.

How to get

Software Download (external link)
big3store

Overview

This is a prototype system for storing and searching large-scale knowledge graph data efficiently. The scalability of storage system and query processing system towards Peta triples is currently possible by using large-scale distribution of data into shared-nothing clusters.

How to get

Software Download (external link)
AnnexML: Approximate Nearest Neighbor Search for Extreme Multi-Label Classification

Overview

AnnexML is a multi-label classifier designed for extremely large label space (10^4 to 10^6). The prediction of AnnexML is efficiently performed by using an approximate nearest neighbor search method.

How to get

Software Download (external link)
yskip: Incremental Skip-gram Model with Negative Sampling

Overview

C++ implementation of incremental algorithm for learning skip-gram model with negative sampling.

Paper : Incremental Skip-gram Model with Negative Sampling(external link)

How to get

Software Download (external link)
japanese-large-lm

Overview

A Japanese language model with 3.6 billion and 1.7 billion parameters, trained and published by LINE.
For more information, please see our blog post (only in Japanese): https://engineering.linecorp.com/ja/blog/3.6-billion-parameter-japanese-language-model

How to get

Software Download (3.6b model, external link)
Software Download (1.7b model, external link)
japanese-large-lm-instruction-sft

Overview

A model of japanese-large-lm with improved interaction performance using the Instruction Tuning technique.
For more information, please see our blog post (only in Japanese): https://engineering.linecorp.com/ja/blog/3.6b-japanese-language-model-with-improved-dialog-performance-by-instruction-tuning

How to get

Software Download (external link)
LINE DistilBERT Japanese

Overview

A high-performance, fast, and lightweight Japanese language model. It outperforms existing Japanese DistilBERTs such as Laboro DistilBERT and BandaiNamco DistilBERT on all JGLUE tasks, the benchmark for Japanese natural language understanding.
For more information, please see our blog post (only in Japanese): https://engineering.linecorp.com/ja/blog/line-distilbert-high-performance-fast-lightweight-japanese-language-model

How to get

Software Download (external link)
Meta AI Video Similarity Challenge - 3rd Place Solution

Overview

The solution code for the third place winner in the Descriptor Track / Matching Track of the Meta AI Video Similarity Challenge, a competition for video copy detection accuracy.

How to get

Software Download (external link)
Tappy

Overview

Analyze the sizes of buttons, links, and other elements on web pages, and output the tap success rate.

How to get

Software Download (external link)

Data

VFD Dataset (Japanese)

Overview

We propose a visually-grounded first-person dialogue (VFD) dataset with verbal and non-verbal responses. The VFD dataset provides manually annotated (1) first-person images of agents, (2) utterances of human speakers, (3) eye-gaze locations of the speakers, and (4) the agents' verbal and non-verbal responses. All utterances and responses are represented in Japanese. The images with eye-gaze locations are available at GazeFollow (MIT).

How to get

Data Download (external link)
Yahoo! Chiebukuro Data (Ver. 3)

Overview

Yahoo! Chiebukuro is the largest community-driven question answering service in Japan. It connects users with questions to those users who may have the answer, enabling people to share information and knowledge with each other. The data being provided consists of resolved questions and answers extracted from the Chiebukuro database for the period as below.

Period :April 2016 –March 2019
Number of Questions :about 2.6 million
Number of Answers :about 6.7 million

How to get

This data is available for download through the National Institute of Informatics (NII) (external link) homepage. Please refer to the NII’s Yahoo! Chiebukuro Data (Ver. 3) Usage Procedures page (external link) for details regarding applying for and using the data.
Yahoo! Search Query Data

Overview

The data is composed of a set of related queries to the topic queries of the 12th NTCIR (NTCIR-12) tasks. By using three different techniques, related queries were extracted from search logs of Yahoo! Search for the period as below. The data does not contain any personal information such as operation history, personal identifiers and context.

Period :July 2009 – June 2013

How to get

This data is provided to NTCIR (NII Testbeds and Community for Information access Research) (external link) Evaluation of Information Access Technologies Workshop participants, and can be used for free by research groups taking part in the workshop.
For details, please check the NTCIR (external link) web page.
※ Applications to participate in the task that will use the data provided by Yahoo! JAPAN are no longer being accepted.
YJ Captions Dataset

Overview

We have developed a Japanese version of the MS COCO caption dataset (external link), which we call YJ Captions 26k Dataset. It is created to facilitate the development of image captioning in Japanese language. Each Japanese caption describes the specified image provided in MS COCO dataset and each image has 5 captions.

How to get

Data Download (external link)
YJ Chat Detection Dataset

Overview

This is the chat detection dataset introduced in (Akasaki and Kaji ACL 2017) (external link).

How to get

The dataset is available for research purposes only. Please fill in Application for Use of Yahoo’s Speech Transcription Data on Chat Detection Study and send it to ml-lyresearch-data "at" lycorp.co.jp as a pdf file. Qualified applicants include academic or industrial researchers. Students can use the data, but are not qualified as applicants.
Japanese Visual Genome VQA Dataset

Overview

We have created a Japanese visual question answering (VQA) dataset by using Yahoo! Crowdsourcing, based on the images from the Visual Genome dataset(external link). Our dataset is meant to be comparable to the freeform QA part of Visual Genome dataset. The dataset consists of 99,208 images, together with 793,664 QA pairs in Japanese with every image having eight QA pairs.

How to get

Data Download (external link)
Visual Scenes with Utterances Dataset

Overview

With the widespread use of intelligent systems, more and more people expect such systems to understand complicated social scenes. To facilitate development of intelligent systems, we created a mock dataset called Visual Scenes with Utterances (VSU) that contains a vast body of image variations in visual scenes with an annotated utterance and a corresponding addressee. Our dataset is based on images and annotations from the GazeFollow dataset (Recasens et al., 2015). The GazeFollow dataset consists of (1) the original image, (2) cropped speaker image with head location annotated, and (3) gaze. To create our dataset, we further annotated (4) utterances in texts, and (5) to whom an utterance is addressed. The images are available at http://gazefollow.csail.mit.edu/ .

How to get

Data Download (external link)
Experimental Dataset for Post-Ensemble Methods

Overview

This is the dataset including 128 summarization models and their outputs used for comparing post-ensemble methods in the following paper.

Paper :Frustratingly Easy Model Ensemble for Abstractive Summarization (EMNLP 2018)

How to get

Data Download
Yahoo! Bousai Crowd Data

Overview

This data is the "Yahoo! Bousai Crowd Data" (data representing urban dynamic analyzed from Yahoo! JAPAN disaster application) used in the following paper.

Paper :DeepCrowd: A Deep Model for Large-Scale Citywide Crowd Density and Flow Prediction (IEEE　TKDE)
Period :From April 1st, 2017 to July 9th, 2017 （100days）
Area :Tokyo, Osaka
Mesh Size :about 450m grid
* score is normalized, and follows k-anonymity.

How to get
The dataset is available for research purposes only.
Please fill in Application for Use of Yahoo! Bousai Crowd Data and send it to ml-lyresearch-data "at" lycorp.co.jp as a pdf file. The fields below the underline on the application form do not need to be filled in.
Qualified applicants include academic or industrial researchers.
Please ask your supervisor or other responsible teacher to apply, since students are not qualified as applicants.
JGLUE: Japanese General Language Understanding Evaluation

Overview

This data is a Japanese language comprehension benchmark that can be used for training and evaluating models. It includes a text classification task, a sentence pair classification task, and a question answering task. JGLUE has been constructed in a joint research project between Yahoo Japan Corporation and Kawahara Lab at Waseda University.

How to get

Data Download (external link)
YJ Covid-19 Prediction Data

Overview

This data is the “YJ Covid-19 Prediction Data” used in the following paper.

Paper：Multiwave COVID-19 Prediction from Social Awareness using Web Search and Mobility Data (KDD2022)

mobility data
Period: From February 2020 to June, 2021
Area: 23 wards of Tokyo only

search data
Period: From February 2020 to June, 2021
Query: Covid-19 Symptoms query (44 queries written in the paper)

How to get
The dataset is available for research purposes only.
Please fill in Application for Use of “YJ Covid-19 Prediction Data” and send it to ml-lyresearch-data "at" lycorp.co.jp as a pdf file.
Qualified applicants include academic or industrial researchers. Please ask your supervisor or other responsible teacher to apply, since students are not qualified as applicants.
LibriTTS-P

Overview

This dataset comes from the paper "LibriTTS-P: A Corpus with Speaking Style and Speaker Identity Prompts for Text-to-Speech and Style Captioning," accepted at INTERSPEECH 2024. LibriTTS-P adds prompts describing speaking styles and speaker Identity to the publicly available LibriTTS-R dataset, which has 585 hours of speech. The LibriTTS-P includes 373,868 prompts for 2,443 speakers.

How to get

Data Download (external link)
YJ AmbigDialogue

Overview

This is the ambiguous utterance detection dataset introduced in (Akasaki and Sassano EMNLP 2024) (external link).

How to get

Data Download (external link)

Software/Data

Software

Data

Software

NGT（Neighborhood Graph and Tree for Indexing）

Overview

How to get

big3store

Overview

How to get

AnnexML: Approximate Nearest Neighbor Search for Extreme Multi-Label Classification

Overview

How to get

yskip: Incremental Skip-gram Model with Negative Sampling

Overview

How to get

japanese-large-lm

Overview

How to get

japanese-large-lm-instruction-sft

Overview

How to get

LINE DistilBERT Japanese

Overview

How to get

Meta AI Video Similarity Challenge - 3rd Place Solution

Overview

How to get

Tappy

Overview

How to get

Data

VFD Dataset (Japanese)

Overview

How to get

Yahoo! Chiebukuro Data (Ver. 3)

Overview

How to get

Yahoo! Search Query Data

Overview

How to get

YJ Captions Dataset

Overview

How to get

YJ Chat Detection Dataset

Overview

How to get

Japanese Visual Genome VQA Dataset

Overview

How to get

Visual Scenes with Utterances Dataset

Overview

How to get

Experimental Dataset for Post-Ensemble Methods

Overview

How to get

Yahoo! Bousai Crowd Data

Overview

How to get

JGLUE: Japanese General Language Understanding Evaluation

Overview

How to get

YJ Covid-19 Prediction Data

Overview

How to get

LibriTTS-P

Overview

How to get

YJ AmbigDialogue

Overview

How to get