ДС
Size: a a a
ДС
SF
SF
SF
ДС
IT
SF
SF
ИБ
ИБ
BZ
• CCMatrix described in [Schwenk et al., 2019] — 13M web-crawled sentences. The raw corpus CCNet is available, but the filtering criteria which
is need to be applied to achieve CCMatrix are not yet released.
Given the CPU required to run the full pipeline on such a big corpus we share a mapping from url to the information we computed. You can reconstruct the corpus used in the paper by using:
The total processing time is about 9 hours using 5000 CPU cores for one snapshot.
SS
A
D
GF
NB
NB
D
IV