Typesetting Japanese with Omega

For busy people

To use Omega to typeset japanese text, just do the following (I assume you are on a UNIX machine, with perl5.6.0 or later, connected to the Internet).

wget ftp://ftp.netscape.com/pub/communicator/extras/fonts/windows/Cyberbit.ZIP
unzip Cyberbit.ZIP
wget http://www.math.jussieu.fr/~zoonek/LaTeX/FontInstallUnicode/font_install_unicode.pl
perl font_install_unicode.pl

wget http://www.math.jussieu.fr/~zoonek/LaTeX/Omega-Japanese/Omega-Japanese-0.004.tar.gz
cd 0_tmp_cyberbit
tar zxvf ../Omega-Japanese-0.001.tar.gz
make

And adapt sample-article.tex or sample-book.tex to suit your needs. These file may be compiled and viewed as follows (remember to use odvips and not dvips).

lambda sample.tex
odvips -o sample.ps sample.dvi
gv sample.ps

Here are the sample files

http://www.math.jussieu.fr/~zoonek/LaTeX/Omega-Japanese/sample.ps.gz
http://www.math.jussieu.fr/~zoonek/LaTeX/Omega-Japanese/sample-article.ps.gz
http://www.math.jussieu.fr/~zoonek/LaTeX/Omega-Japanese/sample-book.ps.gz

Fonts

A few weeks ago, I wrote a small Perl script to install True Type Unicode fonts for use under Omega. Just run it in a directory containing such a font, and you're done. The details are here:

http://www.math.jussieu.fr/~zoonek/LaTeX/FontInstallUnicode/

OTP

An OTP is a program (either in a dedicated language (internal OTP) or in your favorite language (external OTP)) that transforms a stream of characters, for instance, to change their encoding, to add ligatures, or to add various information between the characters. We shall use them to remove unneeded spaces and tell Omega where it is allowed to break the lines.

An OCP is merely a compiled OTP: one first writes an OTP, then compiles it into an OCP,

otp2ocp myotp

mkocp myotp.ocp

and Omega reads and executes the OCP.

We shall use these OTPs in the following way. First, we must tell Omega what encoding we use in our file. Here, I choose UTF8, so I add, at the beginning of the file

\ocp\OCPutf=inutf8
\InputTranslation currentfile \OCPutf

If you are not using UTF8, replace "inutf8" by the correct OTP, such as "injis", "insjis" or "ineucjp":

http://itohws03.ee.noda.sut.ac.jp/~matsuda/omega-j/

This first OTP will be called before Omega does anything. The other OTPs will be called after it has done most of its work. In the following example, we define an OCP list composed of three OTPs (one of them is even an external OTP).

\ocp\OCPjpspaces=JapaneseStripUnneededSpaces
\ocp\OCPjphyph=JapaneseLineBreaking
\externalocp\OCPdebug=debug.pl {}
\ocplist\JapaneseOCP=
  \addbeforeocplist 1 \OCPjpspaces
  \addbeforeocplist 1 \OCPdebug
  \addbeforeocplist 1 \OCPjphyph
\nullocplist

We must then use this OCP list.

\pushocplist\JapaneseOCP

External OTP

External OTP are OTP written in any programming language (C, Perl, Ruby, Python, etc.)

For debugging purposes, I use the following OTP.

#!/usr/bin/perl -w
use strict;
$|++;
print STDERR "OTP DEBUG START\n";
while(<>){
  print STDERR "OTP DEBUG: `$_'\n";
  print $_;
}
print STDERR "OTP DEBUG END\n";

Do not forget to make it executable.

chmod +x debug.pl

We have already seen how to put it into an OCP list.

Internal OTP

Internal OTP are OTPs written in a dedicated language and directly interpreted by Omega.

An OTP file starts with the two following lines, saying that the input and the output are made up of 2-byte characters.

input: 2;
output: 2;

We can then define character classes, such as "spaces", "letters", "punctuation", or anything we may need.

aliases:
SPACE = (@"0020 | @"0009 | @"000A | @"000C | @"00D );
LETTER = (@"0386-@"03F3 | @"1F70-@"1FFC) ;

We can use negation

NONASIAN = ^( {ASIAN} );

We can use the logical OR

A = ( {FORBIDDEN_BEFORE} |
      {STRICTLY_FORBIDDEN_BEFORE} 
    );

But there is no logical AND

A_AND_B = ^( ^({A}) | ^({B}) );

Then, we have a list of expressions.

% These characters should be left unchanged
@"00-@"9F => \1;

% Changing a single character
@"40      => @"C9;

% Replacing several characters by a single one
`<'`a'    =>    #(@"1F01)  ;
`>'``'`a' =>    #(@"1F02)  ;

So far, we have just taken characters from the input stream and sent other characters to the output stream. But we sometimes want to put some characters back on the input stream, for further processing.

% We remove any duplicate spaces
{SPACE}{SPACE} 
  => 
  <= #(\1);

% lam-meem (not followed by dzeem, hha, kha)
@"E144@"E245^(@"E22C|@"E22D|@"E22E) 
  => @"0183 
  <= \3;

That is all what we shall need, but it is also possible to define and use tables, and to use states -- thus, we can program finite automata.

Why several OTP?

I had first tried to put everything (removing of spaces and line breaking rules) into a single OTP. But it is much easier to write and debug if it is decomposed into sevral small OTPs.

OTP to remove unneeded spaces

When one types a japanese text, spurious spaces may appear, for instance, carriage returns may be inserted: they should not add spaces to the text.

For instance, the following text

ある日の暮方の事である。
一人の下人が、羅生門の下
で雨やみを待っていた。

would be understood as

ある日の暮方の事である。  一人の下人が、羅生門の下  で雨やみを待っていた。

We can first try to remove all the spaces. We shall consider the following characters as spaces to be removed (if I have forgotten some, tell me). We do not remove the ideographic space U+3000.

U+0009 Horizontal tabulation 
U+000A Line Feed 
U+000C Form Feed 
U+000D Carriage Return
U+0020 Space

Here is the OTP

input: 2;
output: 2;
    
aliases:
SPACE = (@"0020 | @"0009 | @"000A | @"000C | @"00D );

expressions:
{SPACE} => ;

This will work. You may fear that this will prevent Omega from distinguishing paragraphs, which are separated by a blank line, i.e., by spaces, but this OTP is called after the text has been broken in paragraphs.

Yet, if latin characters appear inside japanese text, we may be removing too many spaces. We shall therefore remove only the spaces that are before or after an asian character.

To distinguish between asian and non asian characters, I just consult the Unicode Code Charts http://www.unicode.org/charts/

We shall consider the following character ranges as asian (there is a coarser decomposition of the Unicode chart, between an A-zone (alphabetic, 0000-33FF) and an I-zone (ideographic, 3400-9FFD), see http://czyborra.com/unicode/characters.html , but hiragana, katakana and asian punctuation also lies in the A-zone: we have to be more precise).

1100-11FF Hangul Jamo

2E80-2EFF CJK Radicals Supplement
2F00-2FDF Kangxi Radicals
2FF0-2FFF Ideographic Description Characters (?)
3000-303F CJK Symbols and punctuation
3040-309F Hiragana
3010-30FF Katakana
3100-312F Bopomofo (?)
3130-318F Hangul Compatibility Jamo
3190-319F Kanbun
31A0-31BF Bopomofo extended
3200-32FF Enclosed CJK letters and months
3300-33FF CJK compatibility (units)
3400-4DBF CJK Unified Ideographs Extension A
4E00-9FAF CJK Unified Ideographs
A000-A48F Yi syllables (?)
A490-A4CF Yi radicals  
AC00-D7AF Hangul syllables

F900-FAFF CJK compatibility Ideographs

FE30-FE4F CJK compatibility forms (vertical punctuation)

FF00-FFEF Half width and fullwidth forms (katakana, hangul)

And the rest as non-asian.

Note: On the Unicode Web site, I also find the following ranges:

20000-2A6DF CJK Unified Ideographs Extension B (?)
2F800-2FA1F CJK Compatibility Ideographs Supplement (?)

I do not understand why these are 3-byte characters, I do not know what to do with them, and I will therefore simply forget them (I hope it is OK, if not, tell me).

Here is the OTP.

input: 2;
output: 2;
    
aliases:
    
SPACE = (@"0020 | @"0009 | @"000A | @"000C | @"00D );
  
ASIAN = (@"1100-@"11FF | @"2E80-@"D7AF | 
         @"F900-@"FAFF | @"FE30-@"FE4F | 
         @"FF00-@"FFEF);
NONASIAN = ^( {ASIAN} );
  
expressions:
  
% We remove any duplicate spaces
{SPACE}{SPACE} => 
               <= #(\1);
  
% We remove any space if the following characters are asian
{SPACE}{ASIAN} =>
               <= #(\2);
  
% We remove any space if the preceeding characters are asian
{ASIAN}{SPACE} =>
               <= #(\1);

Japanese line-breaking rules (1)

According to Ken Lunde's book, "CJKV Information Processing" (actually, the library only has an older edition, entitled "Japanese Information Processing", but guess this part hasn't changed much), there are three classes of characters before or after which line breaking is forbidden.

Strictly forbidden before:
  punctuations
    U+0021 !
    U+002c ,
    U+002e .
    U+003a :
    U+003b ;
    U+003f ?
    U+3001 、
    U+3002 。
    U+ff01 ！
    U+ff0c ，
    U+ff0e ．
    U+ff1a ：
    U+ff1b ；
    U+ff1f ？
    U+ff61 ｡
  closing brackets and quotes
    U+0029 )
    U+005d ]
    U+007d }
    U+2019 ’
    U+201d ”
    U+3009 〉
    U+300b 》
    U+300d 」
    U+300f 』
    U+3011 】
    U+3015 〕
    U+3017 〗
    U+ff09 ）
    U+ff09 ）
    U+ff3d ］
    U+ff5d ｝
    U+ff5d ｝
    U+ff63 ｣

forbidden before:
  katakana lengthening mark
    U+30fc ー
  kanji repetition mark
    U+3005 々
  small kana
    U+3041 ぁ
    U+3043 ぃ
    U+3045 ぅ
    U+3047 ぇ
    U+3049 ぉ
    U+3083 ゃ
    U+3085 ゅ
    U+3087 ょ
    U+3063 っ
    U+308e ゎ
    U+30a1 ァ
    U+30a3 ィ
    U+30a5 ゥ
    U+30a7 ェ
    U+30a9 ォ
    U+30e3 ャ
    U+30e5 ュ
    U+30e7 ョ
    U+30c3 ッ
    U+30ee ヮ
    U+30f5 ヵ
    U+30f6 ヶ

slightly forbidden before:
    U+000a 
    U+0025 %
    U+002d -
    U+2010 ‐
    U+2212 −
    U+2030 ‰
    U+2032 ′
    U+2033 ″
    U+2103 ℃
    U+309b ゛
    U+309c ゜
    U+309d ゝ
    U+309e ゞ
    U+30fd ヽ
    U+30fe ヾ
    U+ff02 ＂
    U+ff05 ％
    U+ff0d －
    U+ff9e ﾞ
    U+ff9f ﾟ

Strictly forbidden after:
  opening brackets and quotes: 「【『［（〈“‘‘（〔｛《{[(〖｛｢
    U+0028 (
    U+005b [
    U+007b {
    U+2018 ‘
    U+201c “
    U+3008 〈
    U+300a 《
    U+300c 「
    U+300e 『
    U+3010 【
    U+3014 〔
    U+3016 〖
    U+ff08 （
    U+ff08 （
    U+ff3b ［
    U+ff5b ｛
    U+ff5b ｛
    U+ff62 ｢

slightly forbidden after:
    U+ffe5 ￥
    U+00a5 ¥
    U+ff04 ＄
    U+0024 $
    U+3012 〒
    U+266f ♯
    U+ff03 ＃
    U+0023 #
    U+ffe0 ￠
    U+00a2 ¢
    U+ffe1 ￡
    U+00a3 £
    U+ff20 ＠
    U+0040 @
    U+00a7 §
  (TODO: Is there a large § in Unicode?)

OTP to break lines

To break lines, we shall insert commands between any two characters that will either allow or prevent line breaking. These macros are defined as follows.

\def\CJKunbreakablekernClassOne{%
  \nobreak
  \hskip 0sp plus 2sp minus 2sp
  \nobreak
}
\def\CJKunbreakablekernClassTwo{%
  \penalty 200
  \hskip 0sp plus 2sp minus 2sp
  \penalty 200
}
\def\CJKunbreakablekernClassThree{%
  \penalty 100
  \hskip 0sp plus 2sp minus 2sp
  \penalty 100
}
\def\CJKbreakablekern{\hskip 0sp plus 2pt minus 2sp}

Here is the OTP.

input: 2;
output: 2;
    
aliases:
    
STRICTLY_FORBIDDEN_AFTER = (...);
FORBIDDEN_AFTER = ( ... );
SLIGHTLY_FORBIDDEN_AFTER = ( ... );
STRICTLY_FORBIDDEN_BEFORE = ( ... );
FORBIDDEN_BEFORE = ( ... );
SLIGHTLY_FORBIDDEN_BEFORE = ( ... );

ASIAN = (...);
  
ANY = .;
  
expressions:
  
{ANY}{STRICTLY_FORBIDDEN_BEFORE} 
  => #(\1) "\CJKunbreakablekernone "
  <= #(\2);
{STRICTLY_FORBIDDEN_AFTER}{ANY} 
  => #(\1) "\CJKunbreakablekernone "
  <= #(\2);
{ANY}{FORBIDDEN_BEFORE} 
  => #(\1) "\CJKunbreakablekerntwo "
  <= #(\2);
{FORBIDDEN_AFTER}{ANY} 
  => #(\1) "\CJKunbreakablekerntwo "
  <= #(\2);
{ANY}{SLIGHTLY_FORBIDDEN_BEFORE} 
  => #(\1) "\CJKunbreakablekernthree "
  <= #(\2);
{SLIGHTLY_FORBIDDEN_AFTER}{ANY} 
  => #(\1) "\CJKunbreakablekernthree "
  <= #(\2);
{ASIAN}{ANY} 
  => #(\1) "\CJKbreakablekern "
  <= #(\2);
{ANY}{ASIAN} 
  => #(\1) "\CJKbreakablekern "
  <= #(\2);

Japanese line-breaking rules (2)

Furthermore, some characters (such as 。(full stop) or 」 (closing quotes)) are allowed to protrude in the right margin.

Refinement of our line-breaking OTP

We shall implement this with the following macro.

\def\CJKprotrude#1{%
  \discretionary{\rlap{#1}}%
                {}%
                {#1}%
}

There is little change in the OTP:

...

aliases:
...  
NOT_STRICTLY_FORBIDDEN_BEFORE = ^( {STRICTLY_FORBIDDEN_BEFORE} );

expressions:

{ANY}{STRICTLY_FORBIDDEN_BEFORE}{STRICTLY_FORBIDDEN_BEFORE}
  => #(\1) "\CJKunbreakablekernone "
  <= #(\2) #(\3);
{ANY}{STRICTLY_FORBIDDEN_BEFORE}{NOT_STRICTLY_FORBIDDEN_BEFORE}
  => #(\1) "\CJKunbreakablekernone \CJKprotrude "
  <= #(\2) #(\3);
{ANY}{STRICTLY_FORBIDDEN_BEFORE}
  => #(\1) "\CJKunbreakablekernone "
  <= #(\2);
{STRICTLY_FORBIDDEN_AFTER}{ANY} 
  => #(\1) "\CJKunbreakablekernone "
  <= #(\2);
{ANY}{FORBIDDEN_BEFORE} 
  => #(\1) "\CJKunbreakablekerntwo "
  <= #(\2);
{FORBIDDEN_AFTER}{ANY} 
  => #(\1) "\CJKunbreakablekerntwo "
  <= #(\2);
{ANY}{SLIGHTLY_FORBIDDEN_BEFORE} 
  => #(\1) "\CJKunbreakablekernthree "
  <= #(\2);
{SLIGHTLY_FORBIDDEN_AFTER}{ANY} 
  => #(\1) "\CJKunbreakablekernthree "
  <= #(\2);
{ASIAN}{ANY} 
  => #(\1) "\CJKbreakablekern "
  <= #(\2);
{ANY}{ASIAN} 
  => #(\1) "\CJKbreakablekern "
  <= #(\2);

Hyphenation parameters

We shall change the hyphenation parameters in the following way.

% If the badness does not exceed this, no hyphenation is
% attempted.
\pretolerance=-1 % Was 100

% Maximal badness (increase this inside multicols?)
\tolerance=200 % Unchanged

% Penalty added for the first hyphenation
% in the current paragraph
\hyphenpenalty=0 % Was 50

% Penalty added for subsequent hyphenations
\exhyphenpenalty=0 % Was 50

% TeX tries to minimize the demerit of the lines:
% (\linepenalty^2 + badness^2) + penalty^2
\linepenalty=10

% If a tight line is followed by a loose one
% (or conversely), we add \adjdemerits 
% to the demerit
\adjdemerits=0 % Was 10000 % Is is a good value?

% two hyphens on consecutive lines also add 
% to the demerit
\doublehyphendemerits=0 % Was 10000

% A hyphen on the last line also adds to
% the demerit.
\finalhyphendemerits=0 % Was 5000

% Minimum number of characters in the current word
% before or after a hyphenation point
\lefthyphenmin=2 % Unchanged
\righthyphenmin=3 % Unchanged

Translations

Some elements of the standard LaTeX classes need to be translated into Japanese. Upon looking into jarticle.cls and jbook.cls, we find the following:

\newcommand{\prepartname}{第}
\newcommand{\postpartname}{部}
\newcommand{\prechaptername}{第}
\newcommand{\postchaptername}{章}
\newcommand{\contentsname}{目 次}
\newcommand{\listfigurename}{図 目 次}
\newcommand{\listtablename}{表 目 次}
\newcommand{\bibname}{関連図書}
\newcommand{\refname}{参考文献}
\newcommand{\indexname}{索 引}
\newcommand{\figurename}{図}
\newcommand{\tablename}{表}
\newcommand{\appendixname}{付 録}
\newcommand{\abstractname}{概 要}

We therefore need to redefine the \chapter and \part commands, to add text before and after the part or chapter name.

\def\@part[#1]#2{%
  ...
  \huge\bfseries \prepartname~\thepart~\postpartname
  ...
}

\def\@makechapterhead#1{%
  ...
  \prechaptername~\thechapter~\postchaptername
  ...
}

The \today command should be redefined to print the date in japanese. The following code is adapted from jbook.cls (which also treats the case of vertical writing, in which case the numerals have to be written in kanji -- I removed this part).

\newif\if西暦 \西暦false
\def\西暦{\西暦true}
\def\和暦{\西暦false}
\newcount\heisei \heisei\year \advance\heisei-1988\relax
\def\today{{%
  \if西暦
    \number\year~年%
    \number\month~月%
    \number\day~日%
  \else
    平成\ifnum\heisei=1 元年\else\number\heisei~年\fi
    \number\month~月%
    \number\day~日%
  \fi
}}

Non-latin characters in macro names

In the above example, we defined macros \西暦 and \和暦 to select how the date should be typeset. But they do not work, because Omega does not consider these characters as letters: the cannot appear inside control sequences. To allow their use in macro names, we have to change their catcodes.

\catcode`\^^^^e8a5=11% 西
\catcode`\^^^^e69a=11% 暦
\def\西暦{...}

As there are several thousand catcodes to change, and as they should be set one by one, we shall write a small Perl script to do the job.

#! perl -w
use strict;
foreach my $code (0x1100..0x11FF,
                  0x2E80..0xFFFF)
{
  printf '\catcode`\^^^^%04x=11'."\n", $code;
}

Text width

We would like the characters to be vertically aligned, as much as possible. A possible solution would be to set the text width to an integer multiple of the character width. For instance

\newlength{\tmplength}
\settowidth{\tmplength}{字字字字字字字字字字字字字字字字字字字字字字字字字字字字字字}
\usepackage[a4paper,textwidth=\tmplength]{geometry}

The same goes with the paragraph indentation, that ought to be an integer multiple of the character width (unless you explicitely use the ideographic space for this purpose, in which case the indentation should be zero).

\settowidth{\parindent}{漢字}

Multicols

Things get more complicated with the multicol package.

First, the hyphenation parameters are reset by the multicols environment. But we just have to add

\multicolpretolerance=-1

in the preamble.

Second, the textwidth will be changed by the multicols package. I haven't found a simple and clean way to set the text width, so here is a simple and dirty way : first use the multicols without any particular setting, then count the number of characters in a line inside a column and set the \hsize locally. For instance, if I count 12 characters, I set it as follows

\begin{multicols}{3}
\settowidth{\hsize}{字字字字字字字字字字字字}%
...
\end{multicols}

Vertical writing (not yet)

There are several problems:

First, typeset the text vertically. All the text should be vertical (RTR, in Omega parlance), except page headers and footers (we shall first assume that there are no floats nor footnotes).

Second, there is a font problem: most characters remain unchanged, but some have to be turned (parentheses, quotes, long vowel mark) or moved (full stop, comma, small kana). There do exist vertical fonts, but I do not have one.

⌒

Third, I do not know if Omega is really able to do that (yet). The documentation for Omega 1.12 (the only documented version) says that it is not done yet; I am using version 1.15; the latest version is 1.23. It may be wiser to wait until things stabilize.

Vincent Zoonekynd
<zoonek@math.jussieu.fr>
latest modification on Thu Jun 20 10:06:36 CEST 2002