Multilingual Bibliographic Structure

OCLC Research Library Partnership


The Multilingual Bibliographic Structure activity is designed to leverage the multilingual content of WorldCat, the world's largest network of library content and services, so that bibliographic information can be presented in the preferred language and script of the user. This project is experimenting with mining the data from all associated bibliographic records within a workset, tagging the language of each textual data element and making it possible to construct the display from all available data in the workset in the preferred language and script of the user.

OCLC is also generating work-translation ("expression level") records—including the translated title and translator with links to the original work and the author—and adding them to VIAF (Virtual International Authority File), flagged as "xR".

Outputs

Impact

Identifying the records representing translations will enable presenting a work in the user's preferred language, where available. This work will also enable us to gain a better understanding of the extent information is shared across cultures, e.g., the percentage of non-English works representing translations of English works, and vice-versa.

Work-translation xR records continue to be added to VIAF.

As of May 2014, more than one million xR records have been added to VIAF, representing:

  • 401,000 works
  • 676,000 translations

310,000 of the translations have translators associated with them.

The number of "expressions" for certain popular works has increased substantially. For example, Jane Austen's Pride and Prejudice now has 50 translations represented compared to 13 before the xR records were added; Sense and Sensibility increased from 13 to 35.

The works associated with popular Japanese mystery writer Higashino Keigo increased from 13 to 47 and translations of his works in Chinese and Korean are represented in VIAF for the first time.

Background

More than half of the 300 million bibliographic records in WorldCat represent resources in languages other than English. These records are clustered together in worksets. Each workset may include multiple bibliographic records for the same title with data elements represented in different languages of cataloging, that is, the language of the metadata used to describe the resource. This information is supplied by catalogers and not transcribed from the resource, such as notes and subject headings. Since OCLC member institutions span the world, WorldCat records include many different languages of cataloging. For any one resource, there may be multiple records with summaries, subject headings and notes in various languages and scripts.

Although WorldCat.org offers interfaces in different languages and scripts, the displayed bibliographic content is based on only one record from the workset. Bibliographic records vary in their detail and coding accuracy. Given the years of cooperation among libraries, it is common to see "hybrid" records where information has been added in different languages, e.g., subject headings in multiple languages. The combined information within a workset can be very rich, but is not fully exploited as currently WorldCat.org presents content only from a single record. Thus even if a user has selected a preferred interface language, the WorldCat records displayed within the interface are unlikely be in that user's preferred language and script.

A related component of this activity focuses on translations, how the most valued corpus of the world's cultural and knowledge heritage is shared. The Multilingual Bibliographic Structure project has data mined WorldCat's bibliographic records for translated works, with the goal to improve work clustering, presentation, linked data representations and contribute generally to global knowledge. Existing worksets in WorldCat are being expanded by associating all possible translations with the original work and identifying which record represents the language and script of the original work. Since a work may have several translations in the same language, we are also parsing the WorldCat records to identify translators. OCLC is generating work-translation ("expression level") records—including the translated title and translator with links to the original work and the author—and adding them to VIAF™ (Virtual International Authority File), flagged as "xR". At the same time, OCLC is marking up these generated VIAF records using linked data schema so that the relationship of each work with their associated translations and translators can be shared in the Semantic Web.

Details

A small group of authors is responsible for the translated works with the most editions and the most holdings in WorldCat. Only one million persons are associated with titles in more than one language; only 7,000 names are associated with titles in 10 or more languages. Focusing on this relatively small group of titles is expected to have the most impact on users looking for well-known works in a given language.

The effort to associate translations with the original works includes overcoming variations in cataloging practices such as: titles with (or missing) subtitles; different forms of uniform title or missing uniform titles; different transliteration schemes, or errors in the transliteration; inverted titles; incorrect coding of the language of the work.

Many records for translations do not have an added-entry for the translator; records that do have added entries often lack a role designator (neither a $4 nor a $e) to indicate that the entry was for a translator. OCLC parsed records' statements of responsibility (the 245 $c) for strings in different languages that mean "translator".

 

Most recent updates: Page content: 2014-06-02

This activity is a part of the Metadata Management theme.

We are a worldwide library cooperative, owned, governed and sustained by members since 1967. Our public purpose is a statement of commitment to each other—that we will work together to improve access to the information held in libraries around the globe, and find ways to reduce costs for libraries through collaboration.