Helping Improve Machine Translation for Patent Documents
August 4, 2011
WIPO is pleased to release to the scientific and R&D community a new
linguistic
data product, which will contribute to improving the quality of machine translation systems for
patent documents.
The PATENTSCOPE
Corpus of
Parallel Patent Applications (Coppa) uses data from WIPO’s international PATENSTCOPE database
of patent documents to provide a bilingual “corpus” consisting of more than 8 million parallel
segments of text in English and French, covering over 170 million words. Technical details can be
found
here. Other
language pairs will be added in the future if the associated source data become available to WIPO
in sufficient volume with the required redistribution rights.
The availability - in a user-friendly format - of this vast corpus will contribute
significantly to efforts aimed at building more accurate machine translation systems for patent
texts. Better machine translation systems will, in turn, lower the linguistic barriers
for inventors and for patent offices. Ultimately, more accurate machine translation will improve
the efficiency of the international patent system, as well as accessibility to the global
repository of technological information contained within it.
The parallel segments were obtained by breaking down the abstracts and titles of
twenty years’ worth of PCT international patent applications (from 1990 – 2010) into sentences, and
mapping these sentences onto their translated versions which were produced by specialist patent
translation professionals. The resulting product is a treasure trove for linguistic research, in
particular for terminology extraction, translation memory building and machine translation
research.
WIPO is making the Corpus available free of charge to academic and private research institutions wishing to use it for research purposes only. In return these institutions commit to sharing the published results with WIPO. For other parties wishing to use the product for non-academic research purposes, it is available for CHF 2,000, and is subject to a no redistribution policy.
WIPO is making the Corpus available free of charge to academic and private research institutions wishing to use it for research purposes only. In return these institutions commit to sharing the published results with WIPO. For other parties wishing to use the product for non-academic research purposes, it is available for CHF 2,000, and is subject to a no redistribution policy.