ALINK="#FF0000">
LINUX GAZETTE
[ Prev ][ Table of Contents ][ Front Page ][ Talkback ][ FAQ ][ Next ]

"Linux Gazette...making Linux just a little more fun!"


Writing Documentation, Part II: LaTeX with latex2html

By Christoph Spiel


LaTeX

Let me first define what LaTeX is and what its primary goals are. LaTeX is a huge add-on macro package for the TeX typesetting system developed by Prof. Donald E. Knuth. If we are not overly picky, we mean ``TeX plus all LaTeX macros'' when we say ``LaTeX system'' or just ``LaTeX''. LaTeX itself was written by Leslie Lamport, who found TeX to be very powerful, but too difficult for everyday use. Therefore he modeled LaTeX after the Scribe system. Scribe puts its emphasis on the logical structure of a document instead of the physical markup. (For those readers proficient in HTML tag <em> is an example logical markup and tab  <i> is the corresponding physical markup.)

LaTeX -- as plain TeX -- allows a normal computer user to typeset documents with production-ready quality. It has been intended that a LaTeX author prepares articles or even books on her local computer, then walk over to the printer shop with a diskette to have the document printed on a high-resolution phototypesetter, and finally have it bound as a book (... shipped off the book to all bookstores in the alpha-quadrant, make millions from it, and two years later win the Intergalactic Pulitzer Prize. -- OK, this is a bit of a stretch).

In the next sections I will introduce very briefly to LaTeX, but I would like to recommend the Not So Short Introduction to LaTeX to everyone who wants to learn LaTeX. The 95-pages document is available for free on the Net. Please see ``Further Reading'' for details.

LaTeX gets installed by most current Linux distributions. You can check whether it is available on your machine by asking

    latex --version

at the command line. My system responds with

    TeX (Web2C 7.3.1) 3.14159
    kpathsea version 3.3.1
    Copyright (C) 1999 D.E. Knuth.
    Kpathsea is copyright (C) 1999 Free Software Foundation, Inc.
    There is NO warranty.  Redistribution of this software is
    covered by the terms of both the TeX copyright and
    the GNU General Public License.
    For more information about these matters, see the files
    named COPYING and the TeX source.
    Primary author of TeX: D.E. Knuth.
    Kpathsea written by Karl Berry and others.

Overall Document Structure

Here is an example of a very short, yet complete LaTeX document:

    \documentclass{article}
    % preamble
    \pagestyle{empty}
    \begin{document}
    % body
    Here comes the text.
    \end{document}

Every LaTeX document consists of a preamble and a body. The preamble reaches from the definition of the document's class, \documentclass[options]{class}, up to, but excluding \begin{document}. The body is everything from \begin{document} to \end{document}.

The preamble in the example features only one command, \pagestyle{empty}, which instructs LaTeX to omit all page decorations such as running heads or page numbers. The percent signs introduce comments that extend to the ends of the respective lines.

Syntax

Paragraphs
Paragraphs are separated by one or more blank lines. The number of blank lines does not influence the output; one is as good as many. The same holds true for spaces (which separate words, but didn't you know that?): one hundred spaces produce the same output as a single space. Newlines, this is line-terminators, are counted as spaces, so are tabulator characters.

If we apply these simple rules to the three different versions of two paragraphs that follow, we conclude that they all will be typeset the same. I have added line numbers at the beginning of each line to point out empty lines, which separate the paragraphs. The numbers are not part of the text.

Version A
    1    I am a short sentence in the first paragraph.
    2
    3    I'm the only sentence in the second paragraph.
Version B
    1    I am a short sentence
    2    in the first paragraph.
    3
    4    I'm the
    5    only sentence
    6    in the second
    7    paragraph.
Version C
    1    I   am   a   short    sentence   in   the  first paragraph.
    2
    3
    4    I'm the only sentence
    5        in the
    6            second paragraph.
Special Characters
Most non-alphanumeric characters carry a special meaning inside LaTeX. This is one of the features that appalls LaTeX beginners. However, after some time, the user is alert of their particular behavior.

I have collected the few most important special characters along with the ways how to insert them literally into a text.

\
Introduce a command, like ``\dots'' or ``\/''.

Note that ``\\'' does not insert a single backslash character into the text as many C-programmers might assume right now. The control sequence ``\\'' inserts a line break, whereas a literal backslash is produced by ``$\backslash$''. To maximize the confusion, `` ''--this is a backslash followed by a blank space--is a command, too! It inserts a so-called control space, a space (more precisely: exactly one space) that is never eaten up like ordinary spaces as explained in section ``Paragraphs''.

{}
Group arguments together.

You get literal curly braces by quoting them with a backslash like this ``\{'' and ``\}''.

%
Start a comment that reaches to the end of the line.

Comments extend up to and include the newline character at the end of a line. Thus LaTeX comments differ from one-line comments in all general programming languages, as those exclude the newline character. For the user this means, he can mask a newline by ending a line with a comment.

    Hessenberg-%
    Triangular % <- note space directly in front of the %-sign
    Reduction

is equivalent to

    Hessenberg-Triangular Reduction

To typeset a literal percent sign, use ``\%''.

~
Make an unbreakable space, like ``&nbsp;'' in HTML.
$math$
Switch to math mode and back.

The sequence math is typeset inline in mathematical typesetting mode. To get a literal dollar sign, use ``\$''.

The following table summarizes all ASCII characters that are treated specially by LaTeX. The rightmost column of the table suggests one or more possible equivalent sequences to get the plain ASCII character into the text. As can be guessed from the entries for caret and twiddle, \charcode_number inserts the ASCII character with the decimal index code_number into a document.

ASCII characters that are special for LaTeX. The right column denotes the strings (in LaTeX) which produce the ASCII characters in the middle column.
Name ASCII LaTeX
sharp # \#
dollar $ \$
percent % \%
ampersand & \&
multiplication sign * * or $*$
minus sign - $-$
less-than sign < $<$
greater-than sign > $>$
backslash \ $\backslash$
caret ^ \char94
underscore _ \_
curly braces {, } \{, \}
vertical bar | $|$
twiddle ~ \char126
Commands
LaTeX commands usually start with a backslash character ``\'' and either extend from the backslash to the next non-letter character (kind 1) or consist of exactly one non-alphanumeric character (kind 2). So ``\raggedleft'' and ``\makebox'' are commands of kind 1 whereas ``\\'' and ``\"'' are commands of kind 2. Arguments are passed to commands within curly braces ``{'', ``}''. Empty arguments can be omitted.

Examples:

    \raggedleft{}                      % no argument
    \raggedleft                        % same as above
    \makebox{Text inside of a box.}    % single argument
    \parbox{160pt}{This text is
    typeset inside of a box.}          % two arguments

The number of arguments passed to a command is fixed. However, some commands accept optional parameters. These are passed within square brackets (``['', ``]'') and usually precede the arguments just as the options precede the arguments in most UN*X utility programs.

Example:

    \parbox[t]{10cm}{I am a top-aligned
    paragraph.} % one option, two arguments

Here t is the optional parameter.

Spaces that follow a type 1 command name without arguments (like the second ``\raggedleft'' above) are ``eaten''; they are not passed on to the output.

Environments
Environments are pairs in the form

\begin{environment}

Text within the environment.

\end{environment}

An environment changes the appearance of the text within it. Environments control the alignment, the width of the margins and many other things. Some predefined environments are: center, description, enumerate, flushleft, flushright, itemize, list, minipage, quotation, quote, tabbing, table, tabular, verbatim, and verse.

Environments do nest. For example, to get a quotation typeset flush against the right margin, use the flushright environment and the quotation environment.

    \begin{flushright}
        \begin{quotation}
            Letters are things,     \\
            not pictures of things. \\
            -- Eric Gill
        \end{quotation}
    \end{flushright}

An environment only affects text inside of it; it encapsulates all changes, like a different indentation occurring within the environment. (Well -- unless you happen to change a global variable, but I won't tell you how to do that, so you're safe.)

Sectioning

LaTeX knows three or four heading levels depending on the documentclass. Class article has three section levels, whereas classes book and report feature chapters as a fourth and topmost heading level.

\chapter{heading} % only for class book and report

\section{heading}

\subsection{heading}

\subsubsection{heading}

Note that as in POD, discussed in Part I, sectioning commands act as separators. They do not group together text with a start marker and an end marker, but their mere appearance groups the text. This will be different in DocBook, as I shall show in next month's article.

Lists

LaTeX ships with three kinds of list-generating environments:

They correspond to unnumbered lists, numbered lists, and definition lists in HTML, or =item *, =item 1, =item term lists in POD.

The items themselves are introduced with ``\item''. An item can consist of more than one paragraph.

For description lists the optional parameter given to ``\item'' as in ``\item[term]'' specifies the term. The text following ``\item[term]'' is term's definition.

Examples:

Itemized List
    What emacs can do for you:
    \begin{itemize}
        \item Cut and paste blocks of text
        \item Fill or justify paragraphs
        \item Spell check documents
    \end{itemize}
Enumerated List
    Starting emacs for the first time
    \begin{enumerate}
        \item Start emacs from the command line:
        \texttt{\$ emacs}
        emacs will show you its startup screen and soon switch to a
        buffer called \texttt{*scratch*}.
        \item Hold down the Control~key and press~H.  You see a prompt
        at the bottom of the screen (or emacs window).
        \texttt{C-h (Type ? for further options)-}
        \item Press~T to start the emacs tutorial.
    \end{enumerate}
Description List
    Some emacs commands:
    \begin{description}
        \item[C-x C-c] Quit emacs.
        \item[C-x f] Open a file.
        \item[C-x r k]
            Kill rectangle defined by mark and point, this is, by the
            active region.
    \end{description}

Cross-References

All cross references need two parts: a pointer (the link) and a pointee (the anchor). Anchors in LaTeX are inserted with \label{anchor-name}. Every anchor is located in a particular section and on a particular page. These two pieces of information are retrieved with \ref{anchor-name} and \pageref{anchor-name} at any place in the document.

Example use of \ref:

    \section{Setup}\label{section:setup}
    ...
    \section{Summary}\label{section:summary}
    As has been pointed out in section~\ref{section:setup} `Setup', ...

Example use of \pageref:

    \section{Setup}\label{section:setup}
    The steel used in the sample chamber is alloyed with Ti (0.5\%),
    Cr (0.1\%), and Mn (0.1\%).\label{definition:chamber-alloy}
    \section{Experiments}\label{section:experiments}
    For sample chamber is made of stainless steel (see
    page~\pageref{definition:chamber-alloy} for the exact
    metallurgical composition), ...

Defining Your Own Commands and Environments

One of the major advantages of the LaTeX typesetting system is to allow the user to define her own commands and environments. Say you want to mark up all replaceable parameters in the description of a UN*X utility, like in

    cd directory

to be rendered as, for example,

cd directory

Here, cd is the utility's name, and directory is the replaceable parameter.

Often utility names are typeset in bold face, and replaceable parameters in italics. Thus, a good solution would be to write

    \utilityname{cd} \replaceable{directory}

where \utilityname and \replaceable switch fonts to bold face and italics respectively. With the help of \utilityname and \replaceable we can consistently mark up further command lines:

    \utilityname{pushd} \replaceable{directory}
    \utilityname{ls} \replaceable{filename}

To define a new LaTeX command, use

\newcommand{command-name}[number-of-arguments ]{command-sequence}

where command-name is the new command's name, number-of-arguments is the number of arguments the new command takes (it defaults to 0 if omitted), and command-sequence are the LaTeX commands to execute when command-name is called.

For our example, define \utilityname and \replaceable as:

    \newcommand{\utilityname}[1]{\textbf{#1}}
    \newcommand{\replaceable}[1]{\textit{#1}}

The predefined commands \textbf and \textit switch fonts to text bold face (in contrary to math bold face) and text italic. Arguments are referred to by #digit, where digit takes on values from 1 to 9.

To give you an impression of the usefulness of our newly defined commands, suppose we would like to generate an index entry for each utility that is mentioned in the text. Command \index{term} puts term in the index. We only need to modify the definition of \utilityname to

    \newcommand{\utilityname}[1]{\textbf{#1}\index{#1}}

and are done. (For the curious: index levels are separated with vertical bars. So, we probably would prefer \index{utility|#1} as it neatly groups all utilities together. See the documentation of makeindex for details.)

New environments are defined with

\newenvironment{environment-name}[number-of-arguments ]{starting-sequence}{ending-sequence }

the only difference being that \newenvironment requires two command sequences: one to open the environment, starting-sequence, and one to close it, ending-sequence. Continuing the example of a quotation typeset flush left against the page's margin, we define our own own quotation environment:

    \newenvironment{myquotation}% Note: "%" masks newline
    {\begin{flushright}\begin{quotation}}%
    {\end{quotation}\end{flushright}}

which is then used like this:

    \begin{myquotation}
        Letters are things,     \\
        not pictures of things. \\
        -- Eric Gill
    \end{myquotation}

Neither commands, nor environments can be defined multiple times with \newcommand or \newenvironment. These commands only serve first time definition. Redefinitions are done with \renewcommand and \renewenvironment, which take on the same arguments as their first-time cousins.

Inline Markup

LaTeX offers an extremely rich set of inline markup. I restrict the discussion to the same three inline markup changes which I discussed for Perl's plain old documentation format: emphasis, italics, bold face, and typewrite (code) font.

Emphasis and Italics
\textit{argument} -- Typeset argument in text italics.

\emph{argument} -- Emphasize argument. The default configuration switches to and from italics depending on the current font setting. If the current font is upright, \emph uses italics; if the current font is italics, it uses an upright font. This way the emphasized parts of text always stand out.

Why have \textit and at the same time \emph? The commands express different requests. \textit unconditionally demands the argument to be typeset using an italics font. Period. \emph on the other hand asks for emphasizing its argument, however the emphasizing may look like. The default uses an italics font as explained above, but \emph can be redefined to use a bold font, underlining, or anything else the writer imagines for her preferred method of emphasizing. The command name emph always catches the concept of emphasis and hides the implementation.

Bold Face
\textbf{argument} -- Typeset argument in text bold face.

Based on \textbf, we can define our own logical markup commands, like for example

    \newcommand{\important}[1]{\textbf{#1}}
Typewriter Font
\texttt{argument} -- Typeset argument in text typewriter font.

As with \textbf, \texttt can be wrapped into user-defined commands:

    \newcommand{\sourcecode}[1]{\texttt{#1}}

LaTeX Tool Chain

LaTeX files usually carry the extension tex. LaTeX translates these tex-files into so called device independent (dvi) files. dvi files are a binary representation of the source. They can be previewed to dvisvga on the console (given the terminal supports high-resolution graphics), or, for example, xdvi under the X11 windowing system. Often dvi files are converted to Postscript with the dvips tool. If Portable Document Format is desired, pdflatex transforms tex files into pdf files in a single step.

latex2html

So far so good. LaTeX makes wonderfully looking Postscript documents, and its pdf sibling does the same, but outputs Portable Document Format files. Didn't we say we want HTML, too? Sure, we did! But LaTeX cannot help us here; we need another tool: latex2html. This tool transforms a LaTeX source file into a set of html files that are properly linked together according to the source file's structure.

latex2html has a home page at http://www.latex2html.org where it is available for download. It can also be obtained from http://www.ctan.org or better one of its many mirrors. To see whether it is installed on your Linux system, try

    latex2html --version

and you should get an answer like

    This is LaTeX2HTML Version 2K.1beta (1.57)
    by Nikos Drakos, Computer Based Learning Unit, University of Leeds.

What do I have to change to make my LaTeX document translatable with latex2html? -- Good news: almost nothing! Just ensure that the packages html and makeindex are referenced in the document's preamble, this is, at least add

    \usepackage{html,makeidx}

to it. Now file my_document.tex can be translated to HTML with the call

    latex2html my_document.tex

References Revisited

latex2html takes care of almost all issues that arise when a LaTeX file is translated into a set of html files. However, references to other parts in the document or other documents are conceptually different in printed documentation and HTML. Consider the LaTeX snippet

    In the following, we summarize the findings
    using a cylindrical coordinate system.  See
    page~\pageref{definition:coordinate-system}
    for the definition of the coordinate system.

where LaTeX dutifully replaces \pageref{definition:coordinate-system} with the page number on which \label{definition:coordinate-system}, the anchor of the page reference, occurs. Where is the problem? First, a set of html pages does not have a rigid notion of a ``page number''. Second, latex2html does replace \pageref{definition:coordinate-system} with a hyper-link to the spot where \label{definition:coordinate-system} is rendered. The link is a dark square for graphical browser or the marker ``[*]'' for text browsers. But the whole construct looks awkward -- almost distracting and this is not latex2html's fault:

In the following, we will summarize the findings using a cylindrical coordinate system. See page  [*] for the definition of the coordinate system.

Latex2html needs our help! The paragraph, which contains the reference, ought to be rephrased for the on-screen version, for example to:

    In the following, we will summarize the
    findings using a <a>cylindrical coordinate
    system</a>.

where I have indicated the hyperlink with HTML anchor tags. To allow for two different versions depending on the output format, latex2html defines the \hyperref command.

\hyperref[reference-type]{text for html version}{pre-reference text for LaTeX version}{post-reference text for LaTeX version}

The optional parameter reference-type selects the counter the reference refers to:

``ref''
Cross reference to a section number like \ref does. The reference text is the section number (``4'', ``1.5.2'', ``3.4.2.1'', etc.).
``page'' or ``pageref''
Reference to a page number like \pageref does. The reference text is a page number (``25'', ``xxiii'', etc.).

Rewritten with \hyperref our example looks like this

    In the following, we will summarize the
    findings using a \hyperref[pageref]%
    {cylindrical coordinate system}% for HTML
    {cylindrical coordinate system.  See page~}% for LaTeX
    { for the definition of the coordinate system}% trailing text for LaTeX
    {definition:coordinate-system}.% label the reference refers to

LaTeX renders it to

In the following, we will summarize the findings using a cylindrical coordinate system. See page 97 for the definition of the coordinate system.

and latex2html produces

In the following, we will summarize the findings using a cylindrical coordinate system.

from it.

Hyperlinks

A problem related to the one we have just encountered with references happens when hyperlinks come into play. In the HTML version of the document hyperlinks are essential; in the printed version, they are of little use: Compare ``Click here'' with ``Press your pencil against this letter''? Sometimes, however, the author really wants to include the target of the hyperlink, an universal resource locator (URL), in the printed text. latex2html defines two commands that exactly cater these needs.

\htmladdnormallink{link text}{universal resource locator}

\htmladdnormallinkfoot{link text}{universal resource locator}

Both commands generate the hyperlink <a href = "universal resource locator">link text</a> in the HTML version. The first only renders link text in the LaTeX version, suppressing universal resource locator completely. The second adds a footnote containing universal resource locator. The typical usage of these commands is

The text of this article can be downloaded from our \htmladdnormallink{web site}{http://www.linux-gazette.org}.

and

The text of this article can be downloaded from our \htmladdnormallinkfoot{web site}{http://www.linux-gazette.org}.

where the LaTeX result of the first looks like this

The text of this article can be downloaded from our web site.

for the second web site gets a footnote marker and a footnote with the URL is placed at the bottom of the page. The HTML output will show up both times as

The text of this article can be downloaded from our web site.

Format Specific Commands

As a last resort several commands and environments enable the writer to divert her text between LaTeX and HTML versions of the document:

I recommend to use diversion of output only if no more specialized latex2html command or environment can produce the desired markup, for splitting always requires to keep both branches in sync.

latex2html Pros and Cons

Pros
Cons

Further Reading

Next month: DocBook

Christoph Spiel

Chris runs an Open Source Software consulting company in Upper Bavaria, Germany. Despite being trained as a physicist -- he holds a PhD in physics from Munich University of Technology -- his main interests revolve around numerics, heterogenous programming environments, and software engineering. He can be reached at cspiel@hammersmith-consulting.com.


Copyright © 2002, Christoph Spiel.
Copying license http://www.linuxgazette.net/copying.html
Published in Issue 74 of Linux Gazette, January 2002

[ Prev ][ Table of Contents ][ Front Page ][ Talkback ][ FAQ ][ Next ]