We recently discussed some internationalization (i18n) issues via IRC, and I want to summarize what we talked about.
The main point we discussed was the differences between regional titles, text translations, and text transliterations, and how to implement those in our data model. Regional titles mean that a game or platform is released under different names in different regions, a translation means bringing a text from one language to another, while a transliteration means bringing a text from one script to another without changing the language.
Let's start with an example to illustrate these issues. Taking a look at the game "Secret of Mana", which is the US release of the Japanese original "Seiken Densetsu 2".
As we can see in the English entry for the game at Wikipedia, "Secret of Mana" is not the translation of "Seiken Densetsu 2". These are two different regional titles, which lead us to the following scheme for the game's titles:
Release Name (Region) ||| English language ||| Japanese language ||| Latin script ||| Japanese script
Secret of Mana (USA) ||| Secret of Mana ||| マナの秘密 ||| Secret of Mana ||| シークリット・オブ・マーナー
聖剣伝説2 (Japan) ||| Legend of the Sacred Sword 2 ||| 聖剣伝説2 ||| Seiken Densetsu 2 ||| 聖剣伝説2
As you can see here, we have two regional titles for the game, which can both be translated to every language imaginable, and both be transliterated to each of the eight scripts that are, in our humble opinion, important for a video game database.
The first thing to record is that every text - regardless of it being a person's name, game title, game description, screenshot caption, or whatever - is written in a certain script, and in a certain language. The separation between script and language is very important, so let's take another look at the two titles of Secret of Mana to make this distinction clear:
String (Script, Language)
聖剣伝説2 (Japanese, Japanese)
Legend of the Sacred Sword 2 (Latin, English)
Seiken Densetsu 2 (Latin, Japanese)
Secret of Mana (Latin, English)
マナの秘密 (Japanese, Japanese)
シークリット・オブ・マーナー (Japanese, English)
For "normal" texts like a game description, a transliteration prolly won't be needed. But one could think that for personal or geographical names only the transliteration (and thus, the script attribute) is needed, but that's not true. The "translation" to another language is important here, too. Let's take a look at another two examples for this:
String (Script, Language)
Михаил Сергеевич Горбачёв (Cyrillic, Russian)
Mikhail Sergeyevich Gorbachev (Latin, English)
Michail Sergeevič Gorbačёv (Latin, Russian)
Michail Sergejewitsch Gorbatschow (Latin, German)
東京 (Japanese, Japanese)
Tokyo (Latin, English)
Tōkyō (Latin, Japanese)
Tokio (Latin, German)
When I talk about "translation" here, I don't mean transferring the meaning of the name to the other language (Tokyo would be "Eastern Capital" in English, then.), but using the official spelling of that other language.
So, technically, every text object of our database can be defined as a "meta object" consisting of n strings with the two attributes (script, language) assigned to it.
Next problem is that we will have to pick one (or more) of these strings for display every time our meta text object is used. But some strings are needed for our video game documentation, some are only informal, and some are necessary to make a game more available within the database. So which one to pick, and how? Let's revisit the Secret of Mana example with this in mind:
String (Script, Language)
聖剣伝説2 (Japanese, Japanese)
Secret of Mana (Latin, English)
These are official release titles of the game for a certain region, so they need to be assigned to all releases using them.
Legend of the Sacred Sword 2 (Latin, English)
マナの秘密 (Japanese, Japanese)
These are only informal translations, which could be shown to users when hovering over the other language's title.
Seiken Densetsu 2 (Latin, Japanese)
シークリット・オブ・マーナー (Japanese, English)
These are transliterations of the official release titles, and therefore more important than the informal translations above. Exemplary, "Seiken Densetsu 2" is needed for Latin users searching for Japanese games, or for game lists in Latin script.
So, having said that, should we label some strings as "leading" or "important" in the meta text object to begin with, so those will show up when no other context is specified? Or shouldn't we do this, leaving the meta text object unaware of its content's importance, thus having to provide context every time we use the text object?
Not really sure, but my gut feeling is that a labeling of one string as being "leading" is too unflexible to solve future problems. So I'd go for the solution to always use a text object in context, and thus manually pick the right string from its contents based on that context.
Exemplary, if we connect the above text object
聖剣伝説2 (Japanese, Japanese)
Legend of the Sacred Sword 2 (Latin, English)
Seiken Densetsu 2 (Latin, Japanese)
as Japanese release title to the respective game, we need to specify that "聖剣伝説2" is the actual string used for this release. On the other hand, if a Latin user requires a list of all SNES games released in Japan, we would need to pick "Seiken Densetsu 2" as the string to show for this list.
So much for some basics to this complex issue, but there's one important question about i18n we didn't touch, yet. How to handle the different language versions of Oregami? While it may be rather easy to translate the (static) UI and help to another language, I am mainly talking about the textual content (descriptions screenshot captions, etc.), i.e. the data.
I think we will be well advised to only start a new data language once this language's community has grown to a critical mass of native contributors / approvers. But which way to go after we started more languages besides English? I see two basic ways:
1) The Wikipedia way: every language grows alone, more or less based on common standards. The quality of the texts may differ severely from one language to another, nonetheless.
2) English is the central language, so every other language's text is translated from and to it, common standards apply strictly. The quality level is comparable in every language.
Details need to be worked out, but what way do you prefer?
The main point we discussed was the differences between regional titles, text translations, and text transliterations, and how to implement those in our data model. Regional titles mean that a game or platform is released under different names in different regions, a translation means bringing a text from one language to another, while a transliteration means bringing a text from one script to another without changing the language.
Let's start with an example to illustrate these issues. Taking a look at the game "Secret of Mana", which is the US release of the Japanese original "Seiken Densetsu 2".
As we can see in the English entry for the game at Wikipedia, "Secret of Mana" is not the translation of "Seiken Densetsu 2". These are two different regional titles, which lead us to the following scheme for the game's titles:
Release Name (Region) ||| English language ||| Japanese language ||| Latin script ||| Japanese script
Secret of Mana (USA) ||| Secret of Mana ||| マナの秘密 ||| Secret of Mana ||| シークリット・オブ・マーナー
聖剣伝説2 (Japan) ||| Legend of the Sacred Sword 2 ||| 聖剣伝説2 ||| Seiken Densetsu 2 ||| 聖剣伝説2
As you can see here, we have two regional titles for the game, which can both be translated to every language imaginable, and both be transliterated to each of the eight scripts that are, in our humble opinion, important for a video game database.
The first thing to record is that every text - regardless of it being a person's name, game title, game description, screenshot caption, or whatever - is written in a certain script, and in a certain language. The separation between script and language is very important, so let's take another look at the two titles of Secret of Mana to make this distinction clear:
String (Script, Language)
聖剣伝説2 (Japanese, Japanese)
Legend of the Sacred Sword 2 (Latin, English)
Seiken Densetsu 2 (Latin, Japanese)
Secret of Mana (Latin, English)
マナの秘密 (Japanese, Japanese)
シークリット・オブ・マーナー (Japanese, English)
For "normal" texts like a game description, a transliteration prolly won't be needed. But one could think that for personal or geographical names only the transliteration (and thus, the script attribute) is needed, but that's not true. The "translation" to another language is important here, too. Let's take a look at another two examples for this:
String (Script, Language)
Михаил Сергеевич Горбачёв (Cyrillic, Russian)
Mikhail Sergeyevich Gorbachev (Latin, English)
Michail Sergeevič Gorbačёv (Latin, Russian)
Michail Sergejewitsch Gorbatschow (Latin, German)
東京 (Japanese, Japanese)
Tokyo (Latin, English)
Tōkyō (Latin, Japanese)
Tokio (Latin, German)
When I talk about "translation" here, I don't mean transferring the meaning of the name to the other language (Tokyo would be "Eastern Capital" in English, then.), but using the official spelling of that other language.
So, technically, every text object of our database can be defined as a "meta object" consisting of n strings with the two attributes (script, language) assigned to it.
Next problem is that we will have to pick one (or more) of these strings for display every time our meta text object is used. But some strings are needed for our video game documentation, some are only informal, and some are necessary to make a game more available within the database. So which one to pick, and how? Let's revisit the Secret of Mana example with this in mind:
String (Script, Language)
聖剣伝説2 (Japanese, Japanese)
Secret of Mana (Latin, English)
These are official release titles of the game for a certain region, so they need to be assigned to all releases using them.
Legend of the Sacred Sword 2 (Latin, English)
マナの秘密 (Japanese, Japanese)
These are only informal translations, which could be shown to users when hovering over the other language's title.
Seiken Densetsu 2 (Latin, Japanese)
シークリット・オブ・マーナー (Japanese, English)
These are transliterations of the official release titles, and therefore more important than the informal translations above. Exemplary, "Seiken Densetsu 2" is needed for Latin users searching for Japanese games, or for game lists in Latin script.
So, having said that, should we label some strings as "leading" or "important" in the meta text object to begin with, so those will show up when no other context is specified? Or shouldn't we do this, leaving the meta text object unaware of its content's importance, thus having to provide context every time we use the text object?
Not really sure, but my gut feeling is that a labeling of one string as being "leading" is too unflexible to solve future problems. So I'd go for the solution to always use a text object in context, and thus manually pick the right string from its contents based on that context.
Exemplary, if we connect the above text object
聖剣伝説2 (Japanese, Japanese)
Legend of the Sacred Sword 2 (Latin, English)
Seiken Densetsu 2 (Latin, Japanese)
as Japanese release title to the respective game, we need to specify that "聖剣伝説2" is the actual string used for this release. On the other hand, if a Latin user requires a list of all SNES games released in Japan, we would need to pick "Seiken Densetsu 2" as the string to show for this list.
So much for some basics to this complex issue, but there's one important question about i18n we didn't touch, yet. How to handle the different language versions of Oregami? While it may be rather easy to translate the (static) UI and help to another language, I am mainly talking about the textual content (descriptions screenshot captions, etc.), i.e. the data.
I think we will be well advised to only start a new data language once this language's community has grown to a critical mass of native contributors / approvers. But which way to go after we started more languages besides English? I see two basic ways:
1) The Wikipedia way: every language grows alone, more or less based on common standards. The quality of the texts may differ severely from one language to another, nonetheless.
2) English is the central language, so every other language's text is translated from and to it, common standards apply strictly. The quality level is comparable in every language.
Details need to be worked out, but what way do you prefer?