GNU Source-highlight, given a source file, produces a document with syntax highlighting.
This is Edition 3.1 of the Source-highlight manual.
This file documents GNU Source-highlight version 3.1.
This manual is for GNU Source-highlight (version 3.1, 10 June 2009), which given a source file, produces a document with syntax highlighting.
Copyright © 2005-2008 Lorenzo Bettini, http://www.lorenzobettini.it.
Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.1 or any later version published by the Free Software Foundation; with no Invariant Sections, with the Front-Cover Texts being “A GNU Manual,” and with the Back-Cover Texts as in (a) below. A copy of the license is included in the section entitled “GNU Free Documentation License.”(a) The FSF's Back-Cover Text is: “You have freedom to copy and modify this GNU Manual, like GNU software. Copies published by the Free Software Foundation raise funds for GNU development.”
GNU Source-highlight, given a source file, produces a document with syntax highlighting. The colors and the styles can be specified (bold, italics, underline) by means of a configuration file, and some other options can be specified at the command line.
The program already recognizes many programming languages (e.g., C++, Java, Perl, etc.) and file formats (e.g., log files, ChangeLog, etc.), and some output formats (e.g., HTML, ANSI color escape sequences, LaTeX, etc.). Since version 2.0, it allows you to specify your own input source language via a simple syntax described later in this manual (Language Definitions). Since version 2.1, it allows you to specify your own output format language via a simple syntax described later in this manual (Output Language Definitions). Since version 2.2, it is able to generate cross references (e.g., to variable names, field names, etc.) by relying on the program ctags, http://ctags.sourceforge.net (Generating References).
Since version 3.0, GNU Source-highlight also provides a C++ library (which is used by the main program itself), that can be used by C++ programmers to add highlighting functionalities to their programs. see Introduction.
The complete list of languages (indeed, file extensions) natively
supported by this version of Source-highlight (3.1), as
reported by --lang-list, is the following:
     C = cpp.lang
     H = cpp.lang
     ac = m4.lang
     ada = ada.lang
     adb = ada.lang
     am = makefile.lang
     autoconf = m4.lang
     bib = bib.lang
     bison = bison.lang
     c = c.lang
     caml = caml.lang
     cc = cpp.lang
     changelog = changelog.lang
     cls = latex.lang
     conf = conf.lang
     cpp = cpp.lang
     cs = csharp.lang
     csh = sh.lang
     csharp = csharp.lang
     css = css.lang
     desktop = desktop.lang
     diff = diff.lang
     docbook = xml.lang
     dtx = latex.lang
     eps = postscript.lang
     fixed-fortran = fixed-fortran.lang
     flex = flex.lang
     fortran = fortran.lang
     free-fortran = fortran.lang
     glsl = glsl.lang
     h = cpp.lang
     haxe = haxe.lang
     hh = cpp.lang
     hpp = cpp.lang
     htm = html.lang
     html = html.lang
     hx = haxe.lang
     in = makefile.lang
     ini = desktop.lang
     java = java.lang
     javascript = javascript.lang
     js = javascript.lang
     kcfg = xml.lang
     kdevelop = xml.lang
     kidl = xml.lang
     ksh = sh.lang
     l = flex.lang
     lang = langdef.lang
     langdef = langdef.lang
     latex = latex.lang
     ldap = ldap.lang
     ldif = ldap.lang
     lex = flex.lang
     lgt = logtalk.lang
     ll = flex.lang
     log = log.lang
     logtalk = logtalk.lang
     lsm = lsm.lang
     lua = lua.lang
     m4 = m4.lang
     makefile = makefile.lang
     ml = caml.lang
     mli = caml.lang
     moc = cpp.lang
     outlang = outlang.lang
     oz = oz.lang
     pas = pascal.lang
     pascal = pascal.lang
     patch = diff.lang
     pc = pc.lang
     perl = perl.lang
     php = php.lang
     php3 = php.lang
     php4 = php.lang
     php5 = php.lang
     pkgconfig = pc.lang
     pl = prolog.lang
     pm = perl.lang
     postscript = postscript.lang
     prolog = prolog.lang
     properties = properties.lang
     ps = postscript.lang
     py = python.lang
     python = python.lang
     rb = ruby.lang
     rc = xml.lang
     ruby = ruby.lang
     scala = scala.lang
     sh = sh.lang
     shell = sh.lang
     sig = sml.lang
     sl = slang.lang
     slang = slang.lang
     slsh = slang.lang
     sml = sml.lang
     spec = spec.lang
     sql = sql.lang
     sty = latex.lang
     style = style.lang
     syslog = log.lang
     tcl = tcl.lang
     tcsh = sh.lang
     tex = latex.lang
     tk = tcl.lang
     txt = nohilite.lang
     ui = xml.lang
     xhtml = xml.lang
     xml = xml.lang
     xorg = xorg.lang
     y = bison.lang
     yacc = bison.lang
     yy = bison.lang
   The complete list of output formats natively supported by this version
of Source-highlight (3.1), as reported by
--outlang-list, is the following:
     docbook = docbook.outlang
     esc = esc.outlang
     html = html.outlang
     html-css = htmlcss.outlang
     htmltable = htmltable.outlang
     javadoc = javadoc.outlang
     latex = latex.outlang
     latexcolor = latexcolor.outlang
     texinfo = texinfo.outlang
     xhtml = xhtml.outlang
     xhtml-css = xhtmlcss.outlang
     xhtmltable = xhtmltable.outlang
   The meaning of the suffix -css is explained in Output Language map1.
   
Please, keep in mind, that I haven't tested personally all these
language definitions: I actually checked that the definition files are
syntactically correct (with the command line option --check-lang
and --check-outlang, Invoking source-highlight), but I'm
not sure their definition actually respects that language syntax (e.g.,
I've put up together some language definitions by searching for
information in the Internet, but I've never programmed in that
language).  So, if you find that a language definition is not precise,
please let me know.  Moreover, if you have a program example in a
language that's not included in the tests directory, please send
it to me so that I can include it in the test suite.
source-highlight-settingsSince version 3.0, GNU Source-highlight includes also the program
source-highlight-settings, which can be used to check whether
source-highlight will be able find its language definition files, and
other configuration files, and in case, to store the correct settings in
a configuration file, in the user home directory.
   
In particular, the stored configuration file will be called source-highlight.conf and stored in $HOME/.source-highlight/.
For the moment, this file only stores the default value for
the --data-dir option.
In this section I'd like to go into details on the highlighting of some specific programming languages. These notes might be useful when the highlighted language has some “dialects” that might require some further specification at the command line (e.g., to select a specific dialect).
As Toby White explained to me, Fortran comes into different “flavors”:
a fixed-format, where some characters have a different semantics
depending on their column position in the source file, and a free-format
where this is not true.  For instance, in the former, * and
c start a command line, but only if they are specified in the
first column (while this is not true in the free-format).
   
By default, the free-format is assumed for Fortran files; if you want to
use the fixed-format, you need to specify fortran-fixed at
the --src-lang command line option.
Perl syntax forms, especially its regular expression specifications, are quite a nightmare ;-) I tried to specify as much as possible in the perl.lang but some particular regular expressions might not be highlighted correctly. Actually, I never programmed in Perl, so, if you see that some parts of your Perl programs are not highlighted correctly, please do not hesitate to contact me, so that I can improve Perl highlighting.
Moreover, although the standard extension for Perl files is .pl,
since the Prolog language definition was implemented in source-highlight
before Perl, this extension is assigned, by default, to Prolog files. 
However, you can use --infer-lang command line option, so that
source-highlight can try to detect the language by inspecting the first
lines of the input file (How the input language is discovered);
you can also use --src-lang=perl command line specification to
explicitly require Perl highlighting.
You can also use source-highlight as a simple formatter of input file, i.e., without performing any highlighting2.
You can achieve this by using, as the language definition file for input
sources the file nohilite.lang, using the command line option
--lang-def (Invoking source-highlight).  Since that
language definition is empty, no highlighting will be performed;
however, source-highlight will transform the input file in the output
format.  Note, in the input language associations in Supported languages, that nohilite.lang is also associated to txt files.
   
This, for instance, makes source-highlight useful in cases you want to transform a text file into HTML or LaTeX. During the output, in fact, source-highlight will correctly generate characters that have a specific meanings in the output format.
For instance, in this Texinfo manual,
if I want to insert a @ or a {
I have to “escape” them to make them appear literally
since they have a special meaning in Texinfo. 
The same holds, e.g.,
for <, > or & in HTML. 
If you use source-highlight,
it will take care of this, automatically for you. 
This is the Texinfo source of the above sentence:
     For instance, in this Texinfo manual,
     if I want to insert a @@ or a @{
     I have to ``escape'' them to make them appear literally
     since they have a special meaning in Texinfo.
     The same holds, e.g.,
     for @code{<}, @code{>} or @code{&} in HTML.
     If you use source-highlight,
     it will take care of this, automatically for you.
   This was processed by source-highlight as a simple text file, without no highlighting; however since it was formatted in Texinfo, all the necessary escaping was automatically performed. This way, it is very easy to insert, in the same document, a code, and its result (as in this example).
This is actually the formatting performed by source-highlight; except for the comment, this is basically what you should have written yourself to do all the escaping stuff manually:
     @c Generator: GNU source-highlight, by Lorenzo Bettini, http://www.gnu.org/software/src-highlite
     @example
     For instance, in this Texinfo manual,
     if I want to insert a @@@@ or a @@@{
     I have to ``escape'' them to make them appear literally
     since they have a special meaning in Texinfo.
     The same holds, e.g.,
     for @@code@{<@}, @@code@{>@} or @@code@{&@} in HTML.
     If you use source-highlight,
     it will take care of this, automatically for you.
     @end example
   In case source-highlight does not handle a specific input language, you
can still use the option --failsafe (Invoking source-highlight) and also in that case no highlighting will be
performed, but source-highlight will transform the input file in the
output format.
   
Note, however, that if the input language cannot be established, the default.lang will be used: an empty language definition file which you might want to customize.
Here we list some software related to source-highlight in the sense that it uses it as a backend (i.e., provides an interface to source-highlight) or it uses some of its features (e.g., definition files):
http://nilrogsplace.se/webdesign/rapidweaver/plugins/high-light/index_en.html
See the file INSTALL for detailed building and installation instructions; anyway if you're used to compiling Linux software that comes with sources you may simply follow the usual procedure, i.e., untar the file you downloaded in a directory and then:
     cd <source code main directory>
     ./configure
     make
     make install
   However, before you do this, please check that you have everything that is needed to build source-highlight, What you need to build source-highlight.
Note: unless you specify a different install directory by
--prefix option of
configure (e.g. ./configure --prefix=<your home>),
you must be root to run make install.
   
You may want to run ./configure --help to see all the possible
options that can be passed to the configuration script.
   
Files will be installed in the following directories:
Executablesprefix/bin
docs and output examplesprefix/share/doc/source-highlight
library examplesprefix/share/doc/source-highlight/examples
library API documentationprefix/share/doc/source-highlight/api
conf filesprefix/share/source-highlight
Default value for prefix is /usr/local
but you may change it with --prefix
option to configure.  For further configure options, you
can run configure --help.
   
Tiziano Muller wrote a bash completion configuration file for
source-highlight; this will be installed by default in the directory
sysconfdir/bash_completion.d, where sysconfdir defaults to
prefix/etc; however, typically, the directory where the bash
completion script searches for configuration file is
/etc/bash_completion.d.  Thus, we suggest you explicitly specify
this directory with the configuration script command line option
--with-bash-completion.
   
If you want to build and install the API documentation of
Source-highlight library, you need to run configure with the
option --with-doxygen, but you need the program Doxygen,
http://www.doxygen.org, to build the documentation. 
The documentation will be installed in the following directory:
     
Library API documentationprefix/share/doc/source-highlight/api
NOTE: Originally, instead of Source-highlight, there were two separate programs, namely GNU java2html and GNU cpp2html. There are two shell scripts with the same name that will be installed together with Source-highlight in order to facilitate the migration (however their use is not advised and it is deprecated).
You can download it from GNU's ftp site: ftp://ftp.gnu.org/gnu/src-highlite or from one of its mirrors (see http://www.gnu.org/prep/ftp.html).
I do not distribute Windows binaries anymore; since, they can be built by using Cygnus C/C++ compiler, available at http://www.cygwin.com. However, if you don't feel like downloading such compiler or you experience problems with the Boost Regex library (see also Tips on installing Boost Regex library; please also keep in mind that if you don't have these libraries installed, and your C/C++ compiler distribution does not provide a prebuilt package, it might take some time, even hours, to build the Boost libraries from sources), you can request such binaries directly to me, by e-mail (find my e-mail at my home page) and I'll be happy to send them to you. An MS-Windows port of Source-highlight is available from http://gnuwin32.sourceforge.net; however, I don't maintain those binaries personally, and they might be out of date.
Archives are digitally signed by me (Lorenzo Bettini) with GNU gpg (http://www.gnupg.org). My GPG public key can be found at my home page (http://www.lorenzobettini.it).
You can also get the patches, if they are available for a particular release (see below for patching from a previous version).
This project's CVS repository can be checked out through anonymous (pserver) CVS with the following instruction:
cvs -z3 -d:pserver:anonymous@cvs.savannah.gnu.org:/sources/src-highlite co src-highlite
Further instructions can be found at the address:
http://savannah.gnu.org/projects/src-highlite.
Please note that this way you will get the latest development sources of Source-highlight, which may also be unstable. This solution is the best if you intend to correct/extend this program: you should send me patches against the latest cvs repository sources.
If, on the contrary, you want to get the sources of a given release,
through cvs, say, e.g., version X.Y.Z, you must specify the tag
rel_X_Y_Z when you run the cvs command or the cvs update
command.
   
NOTE: This convention holds since release 2.1.
When you compile the sources that you get through the cvs repository,
before running the configure and make commands, you
should, at least the first time, run the command:
   
sh autogen.sh
This will run the autotools commands in the correct order, and also copy
possibly missing files.  You should have installed recent versions of
automake, autoconf and libtool in order for this to
succeed.  You will also need flex and bison.
   
Instead of running autogen.sh another option is to run
autoreconf -i
Since version 2.0 Source-highlight relies on regular expressions as provided by boost (http://www.boost.org), so you need to install at least the regex library from boost.
Most GNU/Linux distributions provide this library already in a compiled form. If you use your distribution packages, please be sure to install also the development package of the boost libraries.
If you experience problems in installing Boost Regex library, or in compiling source-highlight because of this library, please take a look at Tips on installing Boost Regex library.
If you want to use a specific version of the Boost regex library
(because you have many versions of it), you can use the configure option
--with-boost-regex to specify a particular suffix.  For instance,
./configure --with-boost-regex=boost_regex-gcc-1_31
Source-highlight has been developed under GNU/Linux, using gcc (C++), and bison (yacc) and flex (lex), and ported under Win32 with Cygnus C/C++compiler, available at http://www.cygwin.com.
I use the excellent
GNU Autoconf3,
GNU Automake4 and
GNU Libtool5. 
Since version 2.6 I also started to use Gnulib - The GNU Portability
Library6, “a central
location for common GNU code, intended to be shared among GNU packages”
(for instance, I rely on Gnulib for checking for the presence and
correctness of getopt_long function).
   
Finally I used GNU gengetopt (http://www.gnu.org/software/gengetopt), for command line parsing.
I started to use also doublecpp (http://doublecpp.sourceforge.net) that permits achieving dynamic overloading.
Actually, apart from the boost regex library, you don't need the other tools above to build source-highlight (indeed I provide the output sources generated by the above mentioned tools), unless you want to develop source-highlight.
However, if you obtained sources through CVS, you need some other tools, see Anonymous CVS Access.
If you experience no problem in compiling source-highlight, you can happily skip this section7 :-)
I created this section because many users reported some problems after installing Boost Regex library from sources; other users had problems in compiling source-highlight even if this library was already correctly installed (especially windows users, using cygwin). I hope this section sheds some light in installing/using the Boost Regex library. Please, note that this section does not explain how to compile the Boost libraries (the documentation you'll find on http://www.boost.org is well done); it explains how to tweak things if you have problems in compiling source-highlight even after a successful installation of Boost libraries.
First of all, if your distribution provides packages for the Boost regex
library, please be sure to install also the development package of the
boost libraries, i.e., those providing also the header files needed to
compile a program using these libraries.  For instance, on my Debian
system I had to install the package libboost-regex-dev, besides
the package libboost-regex.
   
If your distribution does not provide these packages then you have to download the sources of Boost libraries from http://www.boost.org and follow the instructions for compilation and installation. However, I suggest you specify /usr as prefix for installation, instead of relying on the default prefix /usr/local (unless /usr/local/include is already in the inclusion path of your C++ compiler), since this will make things easier when compiling source-highlight. I suggest this, since /usr/include is usually the place where C++ searches for header files during compilation.
If you successfully compiled and installed the Boost Regex library, or you installed the package from your distribution, but you STILL experience problems in compiling source-highlight, then you simply have to adjust some things as described in the following.
If the ./configure command of source-highlight reports this
error:
ERROR! Boost::regex library not installed.
then, the compiler cannot find the header files for this library. In this case, check that the directory /usr/include/boost actually exists; if it does not, then probably you'll find a similar directory, e.g., /usr/include/boost-1_33/boost, depending on the version of the library you have installed. Then, all you have to do is to create a symbolic link as follows:
ln -s /usr/include/boost-1_33/boost /usr/include/boost
Alternatively, you might run source-highlight's configure as follows:
./configure CXXFLAGS=-I/usr/include/boost-1_33/
If then ./configure command of source-highlight reports this
other error:
     ERROR! Boost::regex library is installed, but you
     must specify the suffix with --with-boost-regex at configure
     for instance, --with-boost-regex=boost_regex-gcc-1_31
   then, there's still another thing to fix: you must find out the exact names of the files of your installed Boost Regex libraries; you can do this by using the command:
$ ls -l /usr/lib/libboost_regex*
that, for instance, on one of my cygwin installation reports:
     -rwxr-x---+ Nov  9 23:29 /usr/lib/libboost_regex-gcc-mt-s-1_33.a
     -rwxr-x---+ Nov 22 09:22 /usr/lib/libboost_regex-gcc-mt-s.a
     -rwxr-x---+ Nov  9 23:29 /usr/lib/libboost_regex-gcc-mt-s-1_33.so
     -rwxr-x---+ Nov 22 09:22 /usr/lib/libboost_regex-gcc-mt-s.so
   Now, you have all the information to correctly run the source-highlight's configure command:
./configure --with-boost-regex=boost_regex-gcc-mt-s-1_33
or, if you solved the first problem in the second way8,
     ./configure CXXFLAGS=-I/usr/include/boost-1_33/ \
                 --with-boost-regex=boost_regex-gcc-mt-s-1_33
   Of course, you have to modify this command according to the names of your Boost Regex library installed files.
These instructions managed to let many users, who were experiencing problems, to compile source-highlight If you still have problems, please send me an e-mail.
If you downloaded a patch, say source-highlight-1.3-1.3.1-patch.gz (i.e., the patch to go from version 1.3 to version 1.3.1), cd to the directory with sources from the previous version (source-highlight-1.3) and type:
gunzip -cd ../source-highlight-1.3-1.3.1.patch.gz | patch -p1
and restart the compilation process (if you had already run configure a simple make should do).
This was suggested by Konstantine Serebriany. The script src-hilite-lesspipe.sh will be installed together with source-highlight. You can use the following environment variables:
     export LESSOPEN="| /path/to/src-hilite-lesspipe.sh %s"
     export LESS=' -R '
   This way, when you use less to browse a file, if it is a source file handled by source-highlight, it will be automatically highlighted.
CGI support was enabled thanks to Robert Wetzel; I haven't tested it personally. If you want to use source-highlight as a CGI program, you have to use the executable source-highlight-cgi. You can build such executable by issuing
make source-highlight-cgi
in the src directory.
Christian W. Zuckschwerdt added support for building an .rpm and an .rpm.src. You can issue the following command
rpmbuild -tb source-highlight-3.1.tar.gz
for building an .rpm with binaries and
rpmbuild -ts source-highlight-3.1.tar.gz
for building an .rpm.src with sources.
GNU Source-highlight is free software; you are free to use, share and modify it under the terms of the GNU General Public License that accompanies this software (see COPYING).
GNU source-highlight was written and maintained by Lorenzo Bettini http://www.lorenzobettini.it.
Here are some realistic examples of running source-highlight9.
Source-highlight only does a lexical analysis of the source code, so the program source is assumed to be correct!
Here's how to run source-highlight (for this example we will use C/C++ input files, but this is valid also for other source-highlight input languages):
     source-highlight --src-lang cpp --out-format html \
         --input <C++ file> \
         --output <html file> \
         --style-file <style file> \
         options
   For input files, apart from the -i (--input) option and the
standard input redirection, you can simply specify some files at the
command line and also use regular expressions (for instance
*.java).  In this case the name for the output files will be
formed using the name of the source file with a .<ext> appended, where
<ext> is the extension chosen according to the output format specified
(in this example it would be .html).  The style file
(Output format style)
contains information on how to format specific language parts
(e.g., keywords in blue and boldface, etc.).
   
IMPORTANT: you must choose one of the above two invocation modes: either
you use -i (--input), -o (--output) (possibly replacing
them with standard input/output redirection), or you specify one or many
files without -i (--input); if you try to mix them you'll get an
error:
     source-highlight -o main.html main.cpp
     Please, use one of the two syntaxes for invocation:
     source-highlight [OPTIONS]... -i input_file -o output_file
     source-highlight [OPTIONS]... [FILES]...
   If STDOUT string is passed as -o (--output) option, then
the output is forced to the standard output anyway.
   
If -s (--src-lang) is not specified, the source language is
inferred by the extension of the input file or from the file name itself
(possibly using also lower case versions); this, of course, does not
work with standard input redirection.  For further details, see How the input language is discovered.
   
If -f (--out-format) is not specified, the output will be
produced in HTML.
   
If --style-file is not specified, the default.style, which
is included in the distribution, will be used (see Output format style
for further information).
The default output format for HTML and XHTML uses fixed width fonts by
inserting all the formatted output between <tt> and </tt>. 
Thus, for instance, specification for fixed width and not fixed width
(see Output format style) will have no effect: every character
will have fixed width.  If you don't like this default behavior and
would like to have not fixed fonts by default (as it happens, e.g., with
LaTeX output) you can use the file html_notfixed.outlang with
the command line argument --outlang-def.  For XHTML output, the
corresponding file is xhtml_notfixed.outlang
   
Furthermore, the file htmltable.outlang can be used to generate HTML output enclosed in an HTML table (which will use also a background color if specified in the style file). The file xhtmltable.outlang does the same but for XHTML output.
When using LaTeX output format you can choose between monochromatic
output (by using -f latex) or colored output (by using -f
latexcolor).  When using colored output, you need the
color package (again this should be present in your system). 
Of course, you are free to define your own LaTeX output format,
see Output Language Definitions.
When using the Texinfo output format, you may want to use a dedicated
style file, texinfo.style, which comes with the source-highlight
distribution, with the option --style-file.  For instance, the
example in Examples is formatted with this style file.
DocBook output is generated using the <programlisting> tag.  If
the --doc command line option is given, an <article>
document is generated.
If you're using this output format, for instance together with
less (see Using source-highlight with less), you may
want to use the esc.style, which comes with the source-highlight
distribution, with the option --style-file.  This should
result in a more pleasant coloring output.
During execution, source-highlight needs some files where it finds
directives on how to recognize the source language (if not  specified
explicitly with --src-lang or --lang-def), on which output
format to use (if not specified explicitly with --out-format or
--outlang-def), on how to format specific source elements (e.g.,
keywords, comments, etc.), and source and output language definitions. 
These files will be explained in the next sections.
   
If the directory for such files is not explicitly specified with the
command line option --data-dir, these files are searched for in
the following order:
     
If you want to be sure about which file is used during the
execution, you can use the command line option --verbose.
You must specify your options for syntax highlighting in the file
default.style10. 
You can specify formatting options for each element defined
by a language definition file (you can get the list of such elements,
by using --show-lang-elements, see Listing Language Elements).
   
Since version 2.6, you can also specify the background color for the
output document, using the keyword bgcolor (this might be visible
only when the --doc command line option is used).
   
If many elements share the same formatting options, you can specify these elements in the same line, separated by a comma11.
Here's the default.style that comes with this distribution (this is formatted by using the style.lang that is shown in Tutorials on Language Definitions):
     bgcolor "white"; // the background color for documents
     context gray; // the color for context lines (when specified with line ranges)
     
     keyword blue b ; // for language keywords
     type darkgreen ; // for basic types
     usertype teal ; // for user defined types
     string red f ; // for strings and chars
     regexp orange f ; // for strings and chars
     specialchar pink f ; // for special chars, e.g., \n, \t, \\
     comment brown i, noref; // for comments
     number purple ;       // for literal numbers
     preproc darkblue b ; // for preproc directives (e.g. #include, import)
     symbol darkred ; // for simbols (e.g. <, >, +)
     function black b; // for function calls and declarations
     cbracket red; // for block brackets (e.g. {, })
     todo bg:cyan b;       // for TODO and FIXME
     
     //Predefined variables and functions (for instance glsl)
     predef_var darkblue ;
     predef_func darkblue b ;
     
     // for OOP
     classname teal ; // for class names, e.g., in Java and C++
     
     // line numbers
     linenum black f;
     
     // Internet related
     url blue u, f;
     
     // other elements for ChangeLog and Log files
     date blue b ;
     time, file darkblue b ;
     ip, name darkgreen ;
     
     // for Prolog, Perl...
     variable darkgreen ;
     
     // explicit for Latex
     italics darkgreen i;
     bold darkgreen b;
     underline darkgreen u;
     fixed green f;
     argument darkgreen;
     optionalargument purple;
     math orange;
     bibtex blue;
     
     // for diffs
     oldfile orange;
     newfile darkgreen;
     difflines blue;
     
     // for css
     selector purple;
     property blue;
     value darkgreen i;
     
     // for oz
     atom orange;
     meta i;
     
   This file tries to define a style for most elements defined in the language definition files that comes with Source-highlight distribution.
You can specify your own file (it doesn't have to be named
default.style) with the command line option
--style-file12, see
Invoking source-highlight.
   
You can also specify the color of normal text by adding this line
normal darkblue ;
As you might see the syntax of this file is quite straightforward: after
the element (or elements, separated by commas) you can specify the
color, and the background color13 by using
the prefix bg: (for instance, in the default.style above
the background color is specified for the todo element).
   
Note that the background color might not be available for all output formats: it is available for XHTML and LaTeX but not for HTML14.
Then, you can specify further formatting options such as bold, italics, etc.; these are the keywords that can be used:
     b = bold
     i = italics
     u = underline
     f = fixed
     nf = not fixed
     noref = no reference information is generated for these elements
   Since version 2.2, the color specification is not required. For instance, the texinfo.style is as follows (we avoid colors for Texinfo outputs):
     keyword, type b ;
     variable f, i ;
     string f ;
     regexp f ;
     comment nf, i, noref ;
     preproc b ;
     
     // line numbers
     linenum f;
     
     // Internet related
     url f;
     
     // for diffs
     oldfile, newfile i;
     difflines b;
     
     // for css
     selector, property b;
     value i;
   You may also specify more than on of these options separated by commas, e.g.
keyword blue u, b ;
Please keep in mind that in this case the order of these specified
options is kept during the generation of the output; for instance,
depending on the specific output format, the sequences u, b and
b, u may lead to different results.  In particular, the style
that comes first is used after the ones that follow.  For instance, in
the case of HTML, the sequence u, b will lead to the following
formatting: <u><b>...</b></u>.
   
The noref option specifies that for this element reference
information are not generated (see Generating References).  For
instance, this is used for the comment element, since we do not
want that elements in a comment are searched for cross-references.
   
These are all possible color logical names handled by source-highlight15:
     black
     red
     darkred
     brown
     yellow
     cyan
     blue
     pink
     purple
     orange
     brightorange
     green
     brightgreen
     darkgreen
     teal
     gray
     darkblue
   You can also use the direct color scheme for the specific output format,
by using double quotes, such as, e.g., "#00FF00" in
HTML16 or even string colors in double quotes17, such as "lightblue".  Of course, the double quotes will be
discarded during the generation.
   
For instance, this is the syslog.style used in the tests directory. This uses direct color schemes.
     date, keyword yellow b ;
     time "#9999FF" ;
     ip "lightblue" b ;
     
     type cyan b ;
     string "brown" b ;
     comment teal ;
     number red ;
     preproc cyan ;
     symbol green ;
     function "#CC66CC" b ;
     cbracket green b ;
     twonumbers green b ;
     port green b ;
     webmethod teal ;
     
     // foo option
     foo red b ; // foo entry
     
     
   Note that, if you use direct color schemes, source-highlight will
perform no transformation, and will output exactly the color scheme you
specified.  For instance, the specification "brown" is different
from brown: the former will be output as it is, while the latter
will be translated in the corresponding color of the output format (for
HTML the visible result is likely to be the same).
   
It is up to you to specify a color scheme string that is handled by the
specific output format.  Thus, direct color schemes might not be
portable in different output formats; for instance, "#00FF00" is
valid in HTML but not in LaTeX.
Since version 2.6 you can specify the output format style also using
a limited CSS syntax.  Please, note that this has nothing to do
with output produced by source-highlight using the --css option.
   
By using a CSS file as the style file (i.e., passing it to the
--style-css-file command line option) you will only specify the
output format style using the same syntax of CSS.  This means that you
can use a css syntax for specifying the output format style
independently from the actual output (this is what the output format
style is for).  Thus, you can use a css file as the output format style
also for LaTeX output (just like you would do with a source-highlight
output format style, Output format style).
   
This feature is provided basically for code re-use: you can specify the
output format style using a css file, and then re-use the same css file
as the actual style sheet of other HTML pages (or even output files
produced by source-highlight using the --css option).
   
Note that this feature is quite primordial, so only a limited subset
of CSS syntax is recognized.  In particular, selectors are always
intended as CSS class selectors, so they must start with a dot. 
/* */ comments are handled.  Properties (and their values) not
handled by source-highlight are simply (and silently) discarded).
   
This is an example of CSS specification handled correctly by source-highlight as a style format specification:
     body {
       background-color: <color specification>;
      }
     
     .selector {
       color: <color specification>;
       background-color: <color specification>;
       font-weight: bold; /* this is a comment */
       font-family: monospace;
       font-style: italic;
       text-decoration: underline;
      }
   Finally, this is the default.css that corresponds to default.style presented in Output format style:
     body {  background-color: white;  }
     
     /* the color for context lines (when specified with line ranges) */
     .context {  color: gray; }
     
     .keyword { color: blue; font-weight: bold; }
     .type { color: darkgreen; }
     .usertype, .classname { color: teal; }
     .string { color: red; font-family: monospace; }
     .regexp { color: orange; }
     .specialchar { color: pink; font-family: monospace; }
     .comment { color: brown; font-style: italic; }
     .number { color: purple; }
     .preproc { color: darkblue; font-weight: bold; }
     .symbol { color: darkred; }
     .function { color: black; font-weight: bold; }
     .cbracket { color: red; }
     .todo { font-weight: bold; background-color: cyan; }
     
     /* line numbers */
     .linenum { color: black; font-family: monospace; }
     
     /* Internet related */
     .url { color: blue; text-decoration: underline; font-family: monospace; }
     
     /* other elements for ChangeLog and Log files */
     .date { color: blue; font-weight: bold; }
     .time, .file { color: darkblue; font-weight: bold; }
     .ip, .name { color: darkgreen; }
     
     /* for Prolog, Perl */
     .variable { color: darkgreen; }
     .italics { color: darkgreen; font-style: italic; }
     .bold { color: darkgreen; font-weight: bold; }
     
     /* for LaTeX */
     .underline { color: darkgreen; text-decoration: underline; }
     .fixed { color: green; font-family: monospace; }
     .argument, .optionalargument { color: darkgreen; }
     .math { color: orange; }
     .bibtex { color: blue; }
     
     /* for diffs */
     .oldfile { color: orange; }
     .newfile { color: darkgreen; }
     .difflines { color: blue; }
     
     /* for css */
     .selector { color: purple; }
     .property { color: blue; }
     .value { color: darkgreen; font-style: italic; }
     
     /* for Oz */
     .atom { color: orange; }
     .meta { font-style: italic; }
     
   If you pass this file to the --style-css-file command line option
and you produce an output file, you will get the same result of using
default.style.
   
Source-highlight comes with a lot of CSS files that can be used either
as standard CSS files for HTML documents, or as style files to pass to
--style-css-file.  In the documentation installation directory
(see Installation) you will find the file
style_examples.html which shows many output examples, each one
with a different CSS style.
This file18 (the default file is
style.defaults) lists the default style for a language element
whose output style is not specified in the style file; in particular the
following line (comment lines start with #):
elem1 = elem2
tells that, if the style for an element, say elem1, is not specified in the style file, then elem1 will have the same style of elem2.
For instance, this is the style.defaults that comes with Source-highlight:
     # defaults for styles
     # the format is:
     # elem1 = elem2
     # meaning that if the style for elem1 is not specified,
     # then it will have the same style as elem2
     
     classname = normal
     usertype = normal
     preproc = keyword
     section = function
     paren = cbracket
     attribute = type
     value = string
     predef_var = type
     predef_func = function
     atom = regexp
     meta = function
     
   In this case the style for the element preproc will default to
the style of the element keyword.
   
This file is useful when you want to create your own style file and you
don't want to specify styles for all the elements that will have the
same output style in your style (e.g., the default style formats
preproc elements differently from keywords, but if in your style
you don't specify a style for it, a preproc element will still be
formatted as a keyword).
This configuration file associates a file extension to a specific
language definition file.  You can also use such file extension to
specify the --src-lang option (see Simple Usage). 
Source-highlight comes with such a file, called lang.map.
   
Of course, you can override the settings of this file by writing your
own language map file and specify such file with the command line option
--lang-map).  Moreover, as explained above, if a file
lang.map is present in the current directory, such version will
be used.  The format of such file is quite simple (comment lines start
with #):
extension = language definition file
The default language definition file is shown in Introduction.
These files are crucial for source-highlight since they specify the source elements that have to be highlighted. These files also allow to specify your own language definitions in order to deal with a language that is not handled by source-highlight19. The syntax for these files is explained in Language Definitions.
This configuration file associates an output format to a specific output
language definition file.  You can use the name of that output format to
specify the --out-format option (see Simple Usage). 
Source-highlight comes with such a file, called outlang.map.
   
Of course, you can override the settings of this file by
writing your own output language map file and specify such file
with the command line option --outlang-map). 
Moreover, as explained above, if a file outlang.map
is present in the current directory, such version will be used. 
The format of such file is quite simple:
output format name = language definition file
The default language definition file is shown in Introduction.
In particular, there is a convention for the output format name in the
output language map: the one with -css suffix is the one used
when --css command line option is given
These files are crucial for source-highlight since they specify how the source elements are highlighted. These files also allow to specify your own output format definitions in order to deal with an output format that is not handled by source-highlight20. The syntax for these files is explained in Output Language Definitions.
These files are part of source-highlight distribution, but they can also be downloaded, independently, from here:
http://www.gnu.org/software/src-highlite/outlang_files/
I encourage those who write new language definitions or correct/modify existing language definitions to send them to me so that they can be added to the source-highlight distribution!
Since these files require more explanations (that, however, are not necessary to the standard usage of source-highlight), they are carefully explained in separate parts: Language Definitions and Output Language Definitions.
These files are part of source-highlight distribution, but they can also be downloaded, independently, from here:
http://www.gnu.org/software/src-highlite/lang_files/
The format for running the source-highlight program is:
source-highlight option ...
source-highlight supports the following options, shown by
the output of source-highlight --detailed-help:
     source-highlight
     
     Highlight the syntax of a source file (e.g. Java) into a specific format (e.g.
     HTML)
     
     Usage:  [OPTIONS]... < input_file > output_file
            source-highlight [OPTIONS]... -i input_file -o output_file
            source-highlight [OPTIONS]... [FILES]...
     
       -h, --help                    Print help and exit
           --detailed-help           Print help, including all details and hidden
                                       options, and exit
       -V, --version                 Print version and exit
       -i, --input=filename          input file. default std input
       -o, --output=filename         output file. default std output. If STDOUT is
                                       specified, the output is directed to standard
                                       output
       -s, --src-lang=STRING         source language (use --lang-list to get the
                                       complete list).  If not specified, the source
                                       language will be guessed from the file
                                       extension.
           --lang-list               list all the supported language and associated
                                       language definition file
           --outlang-list            list all the supported output language and
                                       associated language definition file
       -f, --out-format=STRING       output format (use --outlang-list to get the
                                       complete list)  (default=`html')
       -d, --doc                     create an output file that can be used as a
                                       stand alone document (e.g., not to be
                                       included in another one)
           --no-doc                  cancel the --doc option even if it is implied
                                       (e.g., when css is given)
       -c, --css=filename            the external style sheet filename.  Implies
                                       --doc
       -T, --title=STRING            give a title to the output document.  Implies
                                       --doc
       -t, --tab=INT                 specify tab length.  (default=`8')
       -H, --header=filename         file to insert as header
       -F, --footer=filename         file to insert as footer
           --style-file=filename     specify the file containing format options
                                       (default=`default.style')
           --style-css-file=filename specify the file containing format options (in
                                       css syntax)
           --style-defaults=filename specify the file containing defaults for format
                                       options  (default=`style.defaults')
           --outlang-def=filename    output language definition file
           --outlang-map=filename    output language map file
                                       (default=`outlang.map')
           --data-dir=path           directory where language definition files and
                                       language maps are searched for.  If not
                                       specified these files are searched for in the
                                       current directory and in the data dir
                                       installation directory
           --output-dir=path         output directory
           --lang-def=filename       language definition file
           --lang-map=filename       language map file  (default=`lang.map')
           --show-lang-elements=filename
                                     prints the language elements that are defined
                                       in the language definition file
           --infer-lang              force to infer source script language
                                       (overriding given language specification)
     
     Lines:
       -n, --line-number[=padding]   number all output lines, using the specified
                                       padding character  (default=`0')
           --line-number-ref[=prefix]
                                     number all output lines and generate an anchor,
                                       made of the specified prefix + the line
                                       number  (default=`line')
     
     Filtering output:
     
      Mode: linerange
       specifying line ranges
           --line-range=STRING       generate only the lines in the specified
                                       range(s)
       each range can be of the shape:
       	single line (e.g., --line-range=50)
       	full range (e.g., --line-range=2-10)
       	partial range (e.g., --line-range=-30, first 30 lines,
       	--line-range=40- from line 40 to the end
     
           --range-separator=STRING  the optional separator to be printed among
                                       ranges (e.g., "...")
           --range-context=INT       number of (context) lines generated even if not
                                       in range
       The optional --range-context specifies the number of lines that are not in
       	range that will be printed anyway (before and after the lines in range);
       	These lines will be formatted according to the "context" style.
     
     
      Mode: regexrange
       specifying regular expression delimited ranges
           --regex-range=STRING      generate only the lines within the specified
                                       regular expressions
       when a line containing the specified regular expression is found, then
       the lines after this one are actually generated, until another line,
       containing the same regular expression is found (this last line is not
       generated).
       More than one regular expression can be specified.
     
     reference generation:
           --gen-references=STRING   generate references  (possible
                                       values="inline", "postline", "postdoc"
                                       default=`inline')
           --ctags-file=filename     specify the file generated by ctags that will
                                       be used to generate references
                                       (default=`tags')
           --ctags=cmd               how to run the ctags command.  If this option
                                       is not specified, ctags will be executed with
                                       the default value.  If it is specified with
                                       an empty string, ctags will not be executed
                                       at all  (default=`ctags --excmd=n
                                       --tag-relative=yes')
     
     testing:
       -v, --verbose                 verbose mode on
       -q, --quiet                   print no progress information
           --binary-output           write output files in binary mode
       This is useful for testing purposes, since you may want to make
       sure that output files are always generated with a final newline character
       only
           --statistics              print some statistics (i.e., elapsed time)
           --gen-version             put source-highlight version in the generated
                                       file  (default=on)
           --check-lang=filename     only check the correctness of a language
                                       definition file
           --check-outlang=filename  only check the correctness of an output
                                       language definition file
           --failsafe                if no language definition is found for the
                                       input, it is simply copied to the output
       -g, --debug-langdef[=type]    debug a language definition.  In dump mode just
                                       dumps all the steps; in interactive, at each
                                       step, waits for some input (press ENTER to
                                       step)  (possible values="interactive",
                                       "dump" default=`dump')
           --show-regex=filename     show the regular expression automaton
                                       corresponding to a language definition file
   Let us explain some options in details (apart from those that should be
clear from the --help output itself, and those already explained
in Simple Usage).
     
--data-dirprefix/share/source-highlight where
prefix is chosen at compilation time (see See Installation). 
Thus, source-highlight should be able to find all the files it needs
independently.  However, if you want to override this setting, e.g.,
because you have your own language definition files, or simply because
you installed a possible source-highlight binary in a different
directory from the one used during the compilation, you can use the
command line option --data-dir.
     --doc-d--title, the your source file name will be used as the title.
     --no-doc--doc option above is actually implied by other command line
options (e.g., --css).  If you do not want this (e.g., you want
to include the output in an existing document containing the global
style sheet), you can disable this by using --no-doc.
     --css-c--tab-t--style-file--style-css-file--style-defaults--output-dir--infer-lang--line-number0).
     --line-number-ref--line-number, this option numbers all the output lines, and,
additionally, generates an anchor for each line.  The anchor consists of
the specified prefix (default is line) and the line number (e.g.,
line25).  For instance, as prefix, if you deal with many files,
you can use the file name.  Note that some output languages might not
support this feature (e.g., esc, since it makes no sense in such
case).  See Anchors and References for defining how to generate an
anchor in a specific output language.
     --line-range--range-context--range-separator--line-range="-5","10","20-25","50-"
Only the following lines will be output: the first 5 lines, line 10, lines 20 to 25 and from line 50 to the end of input. (See also the examples in Line ranges).
Together with --line-range, you can also specify
--range-context: this is the number of lines that will be printed
before and after the lines of a range (i.e., the surrounding
“context”).  These lines will not be highlighted: they will be printed
according to the style context.  For instance, extending the
previous example,
     
--line-range="-5","10","20-25","50-" --range-context=1
Also the following lines will be output: 6, 9, 11, 19, 26, 49. (See also the examples in Line ranges (with context)).
Finally, you can specify a range separator line string with
--range-separator that will be printed between ranges (See also
the examples in Line ranges (with context)).  The separator string
is preformatted automatically, so, e.g., you don't have to escape
special output characters, such as the { } in texinfo output.
     
--regex-range--regex-range.  In this case the beginning of the
range will be detected by a line containing (in any point) a string
matching the specified regular expression; the end will be detected by a
line containing a string matching the same regular expression that
started the range.  This feature is very useful when we want to document
some code (e.g., in this very manual) by showing only specific parts,
that are delimited in a ad-hoc way in the source code (e.g., with
specific comment patterns).  You can see some usage examples
in See Regex ranges.
     The specified strings (this option accepts multiple occurrences) must be valid regular expressions (thus you must escape special characters accordingly), otherwise you will get an error.
Furthermore, --line-range and --regex-range cannot coexist
in the same command line.
     
--failsafeWhen using --failsafe, if no input language can be established,
source-highlight will use the input language definition file
default.lang, which is an empty file.  You might want to
customize such file, though.
     
--debug-lang--show-regexThe other command line options dealing with references are explained in more details in Generating References.
As already explained, Simple Usage, source-highlight uses a
language definition file according the language specified with the
option --src-lang, or --lang-def, or by using the input
file extension.
   
Since version 2.5, source-highlight can use an inference mechanism to deduce the input language. For the moment, it can detect script languages based on the “sha-bang” mechanism, i.e., when the first line of a script contains a line such as, e.g.,
#!/bin/sh
It also detects script languages specified by using the env
program23:
#!/usr/bin/env perl
Finally, it also recognizes the Emacs convention, of declaring the Emacs
major mode using the format -*- lang -*-.
   
For instance, a script starting as the following one:
     #!/bin/bash
     # -*- Tcl -*-
   will be interpreted as a Tcl script, and not as bash script.
This inference mechanism is performed, by default, in case the input language is neither explicitly specified nor found in the language map file by using the input file extension or the filename itself, possibly also the lowercase version (the input file may also have no extension at all, but, for instance, a ChangeLog input file will be highlighted using changelog.lang).
Furthermore, this mechanism can be given priority with the command line
option --infer-lang.  For instance, this is used in the script
src-hilite-lesspipe.sh (Using source-highlight with less)
when running source-highlight, in order to avoid the problem of
formatting a Perl script as a Prolog program (since the extension
.pl is associated to Prolog programs in the language map file,
Perl).
Since version 2.0 source-highlight uses a specific syntax to specify source language elements (e.g., keywords, strings, comments, etc.). Before version 2.0, language elements were scanned through Flex. This had the drawback of writing a new flex file to deal with a new language; even worse, a new language could not be added “dynamically”: you had to recompile the whole source-highlight program.
Instead, now, language elements are specified in a file, loaded dynamically, through a (hopefully) simple syntax. Then, these definitions are used internally to create, on-the-fly, regular expressions that are used to highlight the elements (see also How source-highlight works). In particular, we use the regular expressions provided by the Boost library (see Installation). Thus, when writing a language definition file you will surely have to deal with regular expressions. Don't be scared: for most of the languages you may never have to deal with difficult regular expressions, and you can also specify language keywords (such as, e.g., “if”, “while”, etc., see Simple definitions); moreover, for defining delimited language elements you will not have to write a regular expression, but just the delimiters (see Delimited definitions). However, there might be some language definitions that may require heavy use of more involved regular expressions (e.g., Perl, just to mention one).
Of course, we use the Boost regex library regular expression syntax. We refer to Boost documentation for such syntax, http://www.boost.org/libs/regex/doc/syntax.html, however, in Notes on regular expressions, we provide some notes on regular expressions that might be helpful for those who never dealt with them. By default, Boost regex library uses Perl regular expression syntax, and, at the moment, this is the only syntax supported by source-highlight.
Here, we see such syntax in details, by relying on many examples. This allows a user to easily modify an existing language definition and create a new one. These files have, typically, extension .lang.
Each definition basically associates a regular expression to a language
element and defines a name for the language element.  Such name will be
used to associate a particular style (e.g., bold face, color, etc.) when
highlighting such elements.  You cannot use names that are the same
of keywords used in the language definition syntax (e.g., start,
as shown later, is a reserved word).
   
Comments can be given by using #; the rest of the line is
considered as a comment.
   
Source-highlight will scan each line of the input file separately. So a regular expression that tries to match new line characters is destined to fail. However, the language definition syntax provides means to deal with multiple lines (see Delimited definitions and State/Environment Definitions).
Before getting into details of language definition syntax, it is crucial to describe the 3 possible ways of specifying a regular expression string. These 3 different ways, basically differ in the way they handle regular expression special characters, such, e.g., parenthesis. For this reason, one mechanism can be more powerful than another one, but it could also require more attention; furthermore, there can be situations where you're forced to use only one mechanism, since the other ones cannot accomplish the required goal.
"expression"" and not `` or
'') to specify a regular expression, then basically all the
characters, but the alternation symbol, i.e., the pipe symbol |,
are considered literally, and thus will be automatically escaped (e.g.,
a dot . is interpreted as the character . not as the
regular expression wild card).  Thus, for instance, if you specify
               "my(regular)ex.pre$$ion{*}"
     source-highlight will automatically transform it into
          my\(regular\)ex\.pre\$\$ion\{\*\}
     The special character |, unless it is meant to separate two
alternatives (Simple definitions), must be escaped with the
character \, e.g., \|.  Also the character \,
if it is intended literally, must be escaped, e.g., \\.
     
'expression''), instead of double quoted
strings.  This way, you can specify special characters with their
intended meaning.
     However, marked subexpressions are automatically transformed in non
marked subexpressions, i.e., the parts in the expression of the shape
(...) will be transformed into (?:...)  (as explained in
Notes on regular expressions, (?:...)  lexically groups
part of a regular expression, without generating a marked
sub-expression).
     
Thus, for instance, if you specify
'my(regular)ex.pre$ion*'
source-highlight will automatically transform it into
my(?:regular)ex.pre$ion*
Since marked subexpressions cannot be specified with this syntax, then backreferences (see Notes on regular expressions) are not allowed.
`expression`` while the previous one uses
') for specifying a regular expression was introduced to overcome
the limitations of the other two syntaxes.  With this syntax, the marked
subexpressions are not transformed, and so you can use regular
expressions mechanisms that rely on marked subexpressions, such as
backreferences and conditionals (see Notes on regular expressions).
     This syntax is also crucial for highlighting specific program parts of some programming languages, such as, e.g., Perl regular expressions (e.g., in substitution expressions) that can be expressed in many forms, in particular, separators for the part to be replaced and the part to replace which can be any non alphanumerical characters25, for instance,
          s/foo/bar/g
          s|foo|bar|g
          s#foo#bar#g
          s@foo@bar@g
     Using this syntax, and backreferences, we can easily define a single language element to deal with these expressions (without specifying all the cases for each possible non alphanumerical character):
regexp = `s([^[:alnum:][:blank:]]).*\1.*\1[ixsmogce]*`
Since version 2.11, in all kinds of regular expression specification, you can insert newline characters, which will simply be ignored. Thus, e.g., the file:
     # test_newlines.lang
     # test that newlines in expressions are simply discarded
     
     keyword = "foo
     |
     lang"
     
     (keyword,normal,classname) =
       `(\<struct)
     ([[:blank:]]+)
     ([[:alnum:]_]+)`
     
     preproc = '^[[:blank:]]*
     #([[:blank:]]*
     [[:word:]]*)'
   and the file:
     # test_nonewlines.lang
     # test that newlines in expressions are simply discarded
     # see the corresponding test_newlines.lang
     
     keyword = "foo|lang"
     
     (keyword,normal,classname) = `(\<struct)([[:blank:]]+)([[:alnum:]_]+)`
     
     preproc = '^[[:blank:]]*#([[:blank:]]*[[:word:]]*)'
   are equivalent. However, the former is surely more readable.
Note however, that space characters are NOT ignored in regular expression definitions.
The simplest way to specify language elements is to list the possible alternatives. This is the case, for instance, for keywords. For instance, in java.lang you have:
     keyword = "abstract|assert|break|case|catch|class|const",
               "continue|default|do|else|extends|false|final",
               "finally|for|goto|if|implements|instanceof|interface"
     keyword = "native|new|null|private|protected|public|return",
               "static|strictfp|super|switch|synchronized|throw",
               "throws|true|this|transient|try|volatile|while"
   You can separate quoted definitions with commas.  Alternatively, within
a quoted definition, alternatives can be separated with the pipe symbol
|.  The above definition defines the language element
keyword.  Each time an element is found in the source file, it is
highlighted with the style for the element with the same name in the
output format style file (note that all elements shown in the example
are taken from the language definition files that come with
source-highlight and there is a style for each of such elements, see
Configuration files).  If such an element is not specified in the
output format style file, it is simply not highlighted (actually, it is
highlighted with style normal, Configuration files) (so pay
attention to typos :-).
   
From the above example you may have noted that language element
definitions are cumulative, so the second keyword definition does
not replace the first one.   (Indeed, in some cases you may want to
actually redefine a language element; this is possible as explained in
Redefinitions and Substitutions).
   
Note that words specified in double quotes have to match exactly in a
source file, and they must be isolated (not surrounded by anything but
spaces).  Thus for instance class is matched as a keyword, but in
my_class the substring class is not matched as keyword. 
From the point of view of regular expressions a string such as
class in a double quote simple definition is intended as
\<(class)\>.
   
Special characters have to be escaped with the character \.  So
for instance if you want to specify the character |, which is
normally used to separate alternatives in double quoted strings, you
have to specify \|.
   
As explained in Ways of specifying regular expressions,
definitions in double quotes are interpreted literally (thus, e.g., a
dot . is interpreted as the character . not as the regular
expression wild card).   If you want to enjoy the full power of regular
expressions to specify a language alternative, you have to use single
quoted strings ('), instead of double quoted strings, or strings
quoted with backticks (`).
   
For instance, the following is the definition for a preprocessor directive in C/C++:
preproc = '^[[:blank:]]*#([[:blank:]]*[[:word:]]*)'
Note that the definition 'class' is different from
"class", as explained above.  Thus, for instance 'class'
matches also the sub-expression class inside my_class.
   
Furthermore, you are not allowed to specify, in the same list, double quoted strings and single quoted strings: you need to split such list definitions. Thus, for instance, the following definition is wrong:
preproc = "#define",'^[[:blank:]]*#([[:blank:]]*[[:word:]]*)'
while the following one is correct:
     preproc = "#define"
     preproc = '^[[:blank:]]*#([[:blank:]]*[[:word:]]*)'
   Finally, at the end of a list of definitions, one can specify the
keyword nonsensitive; in that case, the specified strings will be
interpreted in a non case sensitive way.  For instance, we use this
feature in Pascal language definition, pascal.lang where keywords
are parsed in a non sensitive way:
     keyword = "alfa|and|array|begin|case|const|div",
           "do|downto|else|end|false|file|for|function|get|goto|if|in",
           "label|mod|new|not|of|or|pack|packed|page|program",
           "put|procedure|read|readln|record|repeat|reset|rewrite|set",
           "text|then|to|true|type|unpack|until|var|while|with|writeln|write"
       nonsensitive
   
It is often useful to define a language element that affects all the
remaining characters up to the end of the line.   For such definitions,
instead of the = you must use the keyword start.   For
instance, the following is the definition of a single line comment in
C++:
comment start "//"
This means that when the two characters // are encountered in the
source file, everything from these characters on, up to the end of
the line, will be highlighted according to the style comment.
It is important to observe that the order of language definitions is important since it will be used during regular expression matching (this will be detailed in How source-highlight works). You then have to make sure that, if there are definitions that start with same characters, the longest expression is specified first in the file. For instance if you write
     symbol = "/"
     comment start "//"
   The first expression will always be matched first, and the second expression will never be matched. The right order is
     comment start "//"
     symbol = "/"
   
Many elements are delimited by specific character sequences. For instance, strings and multiline comments. The syntax for such an element definition is
     <name> delim <left delimited> <right delimiter> \
             {escape <escape character>} \
             {multiline} {nested}
   The escape statement specifies the escape character that may
precede one of the delimiters inside the element.  This is optional.
   
For instance, this is the definition of C-like strings:
string delim "\"" "\"" escape "\\"
Note that \ is a special characters in definitions so it has to
be escaped.  If the escape specification was omitted, the C
string "write \"hello\" string" would have been highlight
incorrectly (it would have been highlighted as the string
"write \", the normal character sequence hello\ and
the string " string").
   
The option multiline specifies that the element can spawn
multiple lines.  For instance, PHP strings are defined as follows:
string delim "\"" "\"" escape "\\" multiline
The option nested instructs to count possible multiple
occurrences of delimited characters and to match relative multiple
occurrences (using a stack).  For instance, if we wanted to highlight
C-like multiline comments in a nested way26, we could use the following definition:
comment delim "/*" "*/" multiline nested
If nested was not used, then the closing */ of the
following nested comment would conclude the comment (and the second
*/ would not be highlighted as a comment):
     /*
        This is a /* nested comment */
     */
   Note that, in order for a delimited language element to be nested, its starting and ending elements must be different; thus, for instance, the following definition is not correct:
string delim "\"" "\"" nested # WRONG!
As said above, definitions are cumulative, and they are also cumulative even when using different syntactic forms. Thus, for instance, the complete definition for C++-style comments are the following (actually, the definition of C-style comment is more involved, see the file c_comment.lang):
     comment start "//"
     comment delim "/*" "*/" multiline
   
It is possible to define variables to be re-used in many parts in a language definition file. A variable is defined by using
vardef <name of the variable> = <list of definitions>
   
Once defined, a variable can be used by prepending the
symbol $ to its name.  For instance,
     vardef FUNCTION = '(?:[[:alpha:]]|_)[[:word:]]*(?=[[:blank:]]*\()'
     function = $FUNCTION
   The capital letters are used only for readability.
It is also possible to concatenate variables and expressions, and reuse variables inside further variable definitions:
     vardef basic_time = '[[:digit:]]{2}:[[:digit:]]{2}:[[:digit:]]{2}'
     vardef time = '\<' + $basic_time + '\>'
   
With dynamic backreferences you can refer to a string matched by
the regular expression of the first element of a delim
specification27.  I called these
backreferences dynamic in order to distinguish them by the
backreferences of regular expression syntax, Ways of specifying regular expressions.  This is crucial in cases when the right delimiter
depends on a subexpression matched by the left delimiter; for instance,
Lua comments can be of the shape --[[ comment ]] or --[=[
comment ]=], but not --[=[ comment ]] neither --[[ comment
]=] (furthermore, they can be nested)28.  Thus, the regular expression of
the right element depends on the one of the left element.
   
A dynamic backreference is similar to a variable (Variable definitions), but there's no declaration, and have the shape of
     @{number}
   where number is the number of the marked subexpression in the
left delimiter (source-highlight will actually check that such a marked
subexpression exists in the left delimiter).
   
For instance, this is the definition of Lua comments (see also lua.lang):
     environment comment delim `--\[(=*)\[` "]" + @{1} + "]"
                 multiline nested begin
       include "url.lang"
       ...
     end
   Note how the left delimiter can match an optional =, as a
marked subexpression, and the right delimiter refers to that with @{1}.
   
Source-highlight will take care of escaping possible special characters
during dynamic backreference substitutions.  For instance, suppose that
you must substitute | for @{1}, because we matched |
with the subexpression [^[:alnum:]] in a delim element like the
following one:
     comment delim `([^[:alnum:]])` @{1}
   Since | is a special character in regular expression syntax
source-highlight will actually replace @{1} with \|.
   
IMPORTANT: the right delimiter can only refer to subexpressions of its left delimiter; thus, in case of nested delim element definitions (e.g., in states or environment, State/Environment Definitions), the left delimiter acts as a binder and hides possible subexpressions defined in outer delim elements.
This is crucial to correctly match nested delimited elements with backreferences: source-highlight will correctly recognize this nested (and syntactically correct) Lua comment:
     --[[
       first level comment
       --[=[
         second level
          --[[
            third level
         ]]
       ]=]
     ]]
   
It is possible to include other language definition files into another
file.  This is inclusion actually physically includes the contents of
the included file into the current file during parsing, at the exact
point of inclusion (just like the #include in C/C++).   This is
useful for re-using definitions in many files.   For instance, C++
comment definitions are given in a file c_comment.lang, and this
file is included in the Java and C++ definition files.  The same happens
for number and functions.  For instance, the file java.lang
contains the following include instructions:
     include "c_comment.lang"
     
     include "number.lang"
     
     keywords ...
     
     include "function.lang"
   Note that the order of inclusion is crucial since the order of
definition is crucial.  If function definition was included before
keyword definitions, then the sentence if (exp) would be
highlighted as a function invocation (see Order of definitions and
How source-highlight works).
Sometimes you want some source element to be highlighted only if they are surrounded by other elements. Source-highlight language definitions provides also this feature.
     state|environment <standard definition> begin
       <other definitions>
     end
   This structure is recursive (so other state/environment definitions can
be given within a state/environment).   The meaning of a
state/environment is that the definitions within the begin
... end are matched only if the definitions that define the
state/environment have been matched.  When entering a state/environment,
however, the definitions given outside the state/environment are not
matched.  The difference between state and environment is
that in the latter, normal parts of the source language (i.e., those
that do not match any definition) are highlighted according to the style
of the definition that defines the environment.
   
As an example, the following defines the multiline nested C comment, and highlights URL and e-mail addresses only when they appear inside a comment (note that this uses file inclusion):
     environment comment delim "/*" "*/" multiline nested begin
           include "url.lang"
     end
   Note that we used environment because everything else inside a
comment has to be formatted according to the comment style.
   
While for programming language definitions states/environments can be avoided (although they allow to highlight some parts only if inside a specific environment, e.g., URLs inside comments, or documentation tags in Javadoc comments), they are pretty important for highlighting files such as logs and ChangeLog files, since elements have to be highlighted when they appear in a specific position. For instance, for ChangeLog (see changelog.lang), we use a state for highlighting the date, name, e-mail or URL (taken from url.lang):
     state date start '[[:digit:]]{2,4}-?[[:digit:]]{2}-?[[:digit:]]{2}' begin
       include "url.lang"
       name = '([[:word:]]|[[:punct:]])+'
     end
   Note that definitions that appear inside a state/environment have the
same scope of the expressions that define the environment.  While this
makes sense for start and delim definitions, it may make
less sense for simple definitions (i.e., those that simply lists all
possible expressions): in fact, in this case, such expressions do not
define a scope.  For such definitions, the semantics of
state/environment is that the state/environment starts after matching one
of the alternatives.  And where will it end?   In this case you must
explicitly exit the environment.   For instance, you can say that, when
inside a state/environment, a specific language definition, when
encountered also exits the environment, with the keyword exit
(you can also specify the number of states to exit). 
You can even exit all the environments with exitall.   For
instance, the following definition, highlights a non empty string
following a web method:
     vardef non_empty = '[^[:blank:]]+'
     
     state webmethod = "OPTIONS|GET|HEAD|POST|PUT|DELETE",
               "TRACE|CONNECT|PROPFIND|MKCOL|COPY|MOVE|LOCK|UNLOCK" begin
       string = $non_empty exit
     end
   If you ever need such advanced features, you may want to take a look at the log.lang definition file that defines highlighting for several log files (access logs, Apache logs, etc.). Moreover, there might be cases, and the above one is one of such cases, explicit subexpressions with names will be enough (see Explicit subexpressions with names).
We conclude this section with an interesting example: comments in
M4 files can start with the dnl keyword (up to the end of line),
e.g.,
dnl @synopsis AC_CTAGS_FLAGS
Now if we want to highlight the dnl as a keyword, and the rest of
line as a comment, we cannot simply rely on an environment, since this
would highlight all the line with the same style.  Moreover, we want
to highlight elements starting with @ differently, so we actually
need a state (this would allow us also to highlight urls inside a comment
just like in C++ comments in the example above).  Thus, we need
to simulate an environment with a state, and we do this for M4 as follows
(see the file m4.lang):
     state keyword start "dnl" begin
       # avoid spaces in front of urls or @[[:alpha:]]+ be captured as prefixes
       comment = '[[:blank:]]+'
       include "url.lang"
       include "html.lang"
       type = '@[[:alpha:]]+'
       # everything else is a comment
       comment = '.+'
     end
   Once entered the state, every isolated space character is highlighted as
a comment; then we have rules for URLs and @ elements; then everything
else (.+) is highlighted as a comment.
   
One might think that a smarter way would be to have simply the following definition (after all, why bothering highlighting spaces as comments):
     state keyword start "dnl" begin
       include "url.lang"
       include "html.lang"
       type = '@[[:alpha:]]+'
       comment = '.+'
     end
   Well, with this definition spaces in front of matched URLs or @ elements would be highlighted as normal, being considered as prefixes. This is due to how source-highlight searches for matching rules; we refer to How source-highlight works for further details.
Often, you need to specify two program elements in the same regular expressions, because they are tightly related, but you also need to highlight them differently.
For instance, you might want to highlight the name of a class (or
interface) in a class (or interface)
definition (e.g., in Java).  Thus, you can rely on the preceding
class keyword which will then be followed by an identifier.
   
A definition such as
keyword = '(\<(?:class|interface))([[:blank:]]+)([$[:alnum:]]+)'
will not produce a good final result, since the name of the class will
be highlighted as a keyword, which is not what you might have wanted:
for instance, the class name should be highlighted as a type.
   
Up to version 2.6, the only way to do this was to use state or environments (State/Environment Definitions) but this tended to be quite difficult to write.
Since version 2.7, you can specify a regular expression with marked
subexpressions and bind each of them to a specific language element (the
regular expression must be enclosed in `, see Ways of specifying regular expressions):
(elem1,...,elemn) = `(subexp1)(...)(subexpn)`
Now, with this syntax, we can accomplish our previous goal:
     (keyword,normal,type) =
       `(\<(?:class|interface))([[:blank:]]+)([$[:alnum:]]+)`
   This way, the class (or interface) will be highlighted as
a keyword, the separating blank characters are formatted as
normal, and the name of the class as a type.
   
Note that the number of element names must be equal to the number of subexpressions in the expression; furthermore, at least in the current version, the expression can contain only marked subexpressions (no character outside is allowed) and no nested subexpressions are allowed.
Thus, the following specifications are NOT correct:
     (keyword,symbol) = `(...)(...)(...)` # number of elements doesn't match
     (keyword,symbol) = `(...(...)...)(...)` # contains nested subexpressions
     (keyword,symbol) = `...(...)...(...)` # outside characters
   This mechanism permits expressing regular expressions for some situation
in a much more compact and probably more readable way.  For instance,
for highlighting ChangeLog parts (the optional * as a symbol, the
optional file name and the element specified in parenthesis as a
file element, and the rest as normal) such as
       * src/Makefile.am (source_highlight_SOURCES): correctly include
       changelog_scanner.ll
     
       * this is a comment without a file name
   before version 2.6, we used to use these two language definitions:
     state symbol start '^(?:[[:blank:]]+)\*[[:blank:]]+' begin
       state file start '[^:]+\:' begin
         normal start '.'
       end
     end
     
     state normal start '^(?:[[:blank:]]+)' begin
       state file start '[^:]+\:' begin
         normal start '.'
       end
     end
   which can be hard to read after having written them. Now, we can write them more easily (see changelog.lang):
     (normal,symbol,normal,file)=
       `(^[[:blank:]]+)(\*)([[:blank:]]+)((?:[^:]+\:)?)`
     (normal,file)= `(^[[:blank:]]+)((?:[^:]+\:)?)`
   Since a language element definition using explicit subexpressions with names consists of more than one element, and thus of more than one formatting style, it cannot be used to start an environment (what would the default element be?); while, as seen above, they can be used to start a state.
These two features are useful when you want to define
a language by re-using an existing language definition
with some changes.  Typically you include another
language definition file and you redefine/substitute some
elements.
   
When you use redef you erase all the previous
definitions of that language elements with the new one. 
The new language element definition will be placed exactly
in the point of the new definition. 
We use this feature, for instance, when we define the
sml language by re-using the caml one:
they differ only for the keywords29.  In fact, the contents of
sml.lang is summarized as follows:
     include "caml.lang"
     
     redef keyword = "abstraction|abstype|and|andalso..."
     
     redef type = "int|byte|boolean|char|long|float|double|short|void"
   Since the new language element definition appears in the
exact point of the redefinition, this means that
such a regular expression will be matched only if all
the previous ones (the ones of the included file) cannot
be matched.  This may lead to unwanted results in some
cases (not in the sml case though). 
In other words the following code
     keyword = "foo"
     keyword = "bar"
     type = "int"
     redef keyword = "myfoo"
   is equivalent to the following one
     type = "int"
     keyword = "myfoo"
   If this is not what you want, you can use subst,
which is similar to redef apart from that it
replaces the previous first definition of that language
element in the exact point of that first definition
(all other possible definitions are simply erased). 
That is to say that the following code
     keyword = "foo"
     keyword = "bar"
     type = "int"
     subst keyword = "myfoo"
   is equivalent to the following one
     keyword = "myfoo"
     type = "int"
   It is up to you to decide which one fits best your needs. 
We could use this feature to define javascript in terms
of java, e.g.:
     include "java.lang"
     
     subst keyword = "abstract|break|case|catch|class|if..."
   Here using redef would have led to the unwanted behavior that
if (exp) would have been highlighted as a function call, since
the function element definition would have come first (and then
matched first) than the redefinition of if as a keyword. 
Another example is the language definition for C# by reusing the one
for C/C++, Highlighting C/C++ and C#.
As hinted at the beginning of Language Definitions, source-highlight uses the definitions in the language definition file to internally create, on-the-fly, regular expressions that are used to highlight the tokens of an input file. Here we provide some internal details that are crucial to understand how to write language definition files correctly30.
First of all, for each element definition an highlighting rule is created by source-highlight (even if they correspond to the same language element); thus, each language definition file will correspond to a list of highlighting rules. For each line of the input file, source-highlight will try to match all these rules against the whole line (more formally, against the part of the line that has not been highlighted yet). It will not stop as soon as an highlighting rule matched, since there might be another rule that matches “better”. Now, everything basically reduces to the semantics of that better match.
The strategy used by source-highlight is to select the first matching rule
where the prefix of a matched rule is the part of the examined
string that did not match31.  Thus, for instance, if we try to match
the simple regular expression = against the string
i = 10;
then the prefix is i , including the space.  Following the
terminology of regular expression, the remaining part that did not
match, i.e.,  10;, is the suffix.  When source-highlight
finds a matching rule, according to the above strategy, it formats the
matched part (and the prefix as normal), and then it starts again
searching for a matching rule on the suffix, until it processed the
whole line.
   
Let us explain this strategy a little bit further with an example. Consider the following language definition file:
     # an example for explaining the strategy of source-highlight
     type = "int"
     keyword = "null"
     symbol = "="
   and the following line to be highlighted:
int i = null
Then source-highlight performs these steps:
type; since it has
an empty prefix, there's no need to look any further: it highlights
int as type; the remaining part to be processed is now
 i = null;
     keyword, with the prefix
 i = ; since the prefix is not empty (nor it contain only
spaces), we inspect other rules;
     symbol, with prefix 
i , which is smaller than the one for keyword, and since there
are no other matching rules, the one for symbol is better, and we
highlight = as symbol; the remaining part to be processed is now
 null;
     keyword, and, since it has
a prefix with only spaces, we look no further, and we highlight
null as keyword.
        We conclude this section by showing the following language definition, which summarizes what we said about the highlighting strategy:
     keyword = "if|class"
     
     type = 'int'
     
     comment delim "/*" "*/"
     
     # thus this won't catch "/* */ /" as a regexp,
     # since comment elem definition comes first
     regexp = '/.*/.*/'
     
     # this won't match if ( ) as a function,
     # since keyword elem definition comes first
     function = '([[:alpha:]]|_)[[:word:]]*[[:blank:]]*\(*[[:blank:]]*\)'
     
     # the following order is conceptually wrong,
     # since "//" won't be highlighted as a comment, but as two symbols
     symbol = "/"
     comment start "//"
   
Although we refer to Boost documentation for such syntax32, we want to provide here some explanations of some forms of regular expressions that might be unknown but that are pretty useful in language definitions.
Typically, when you need to group sub-expressions with parenthesis,  but
you don't want the parenthesis to spit out another marked
sub-expression,  you can use a non-marking parenthesis
(?:expression).  This is not necessary in the language definition
syntax: even though you use standard parenthesis, source-highlight will
transform it into a non-marking parenthesis.
   
Source-highlight translates possible marked subexpressions, i.e.,
those enclosed in ( and ), into non-marked subexpressions
(i.e., those explained above).  Since version 2.7, if you specify the
expression inside ` the marked subexpressions are left as such
(see also Ways of specifying regular expressions).  This is useful
for backreferences and conditionals.
   
An escape character followed by a digit n, where n is in the range 1-9,
is a backreference matches the same string that was matched by
sub-expression n.  For example the expression ^(a*).*\1$ will
match the string: aaabbaaa but not the string aaabba. 
Backreferences are useful to write compact language elements, such
as in the case of Perl's substitution modifiers; thus
regexp = `s([^[:alnum:][:blank:]]).*\1.*\1[ixsmogce]*`
will match all these forms
     s/foo/bar/g
     s|foo|bar|g
     s#foo#bar#g
     s@foo@bar@g
   A useful regular expression form is the Forward Lookahead Asserts that come in two forms, one for positive forward lookahead asserts, and one for negative lookahead asserts:
(?=abc)(?!abc)For instance, in the definition of a function (function.lang) we use the following regular expression:
([[:alpha:]]|_)[[:word:]]*(?=[[:blank:]]*\()
Thus after the name of a function we test, with the regular expression
(?=\() whether an open parenthesis ( can be matched.  If
it can be matched, however, we leave that part in the input, so that the
parenthesis will not be formatted the same way of a function name (see
also How source-highlight works to understand better this language
element definition).
   
Please, be careful when using such regular expression forms: since part of the input is not actually removed you may end up always scanning the same input part (thus looping) if you do not write the regular expressions well. For instance, consider this language definition
     state foo = '(?=foo)' begin
       foo = '(?=foo)'
     end
   and the following input file:
     hello
     foo
     bar
   As soon as we match the word foo we leave it in the input and we
enter a state where we try to match the word foo still leaving it
in the input.  As you might have guess this will make source-highlight
loop forever.  Probably one might have wanted to write this
language definition:
     state foo = '(?=foo)' begin
       foo = 'foo'
     end
   but a cut-and-paste error had its way ;-)
You can also use Lookbehind Asserts:
(?<=pattern)(?<!pattern)Another advanced regular expression mechanism is the one of conditional expressions
(?(condition)yes-pattern|no-pattern)(?(condition)yes-pattern)Condition may be either a forward lookahead assert, or the index33 of a marked sub-expression (the condition becomes true if the sub-expression has been matched). For instance, the following expression34, that we wrote on more lines to try to make it more readable
     (?:
       (\()
       |(\[)
       |(\{)
     )
     [[:alpha:]]*
     (?:
       (?(1)
         \)
         |(?:(?(2)
           \]
           |(?:\}
     )))))
   will match (foo), [foo] and {foo} but not
(foo], {foo] or {foo).
Since version 2.7, the source-highlight package comes with a small additional program, check-regexp, that permits testing regular expressions on the command line.
You simply pass as the first command line argument the regular expression and then the strings you want to try to match (actually, the program searches the string for the given regular expression, so it is not required to match the whole string). It is crucial, in order to avoid shell substitutions, to enclose both the expression and the strings in single quotes.
The program then prints some information about the (possibly successful
matching).  The what[0] part represents the whole match, and
what[i] part represents the i-th marked subexpression that
matched.  The program also prints possible prefix and suffix.
   
Here's an example of output of the program:
     check-regexp '(a+)(.*)\1' 'aabcdaa' 'babbbacc'
     
     searching      : aabcdaa
     for the regexp : (a+)(.*)\1
     what[0]: aabcdaa
       what[1]: aa
       length: 2
       what[2]: bcd
       length: 3
     total number of matches: 1
     
     searching      : babbbacc
     for the regexp : (a+)(.*)\1
     prefix: b
     what[0]: abbba
       what[1]: a
       length: 1
       what[2]: bbb
       length: 3
     suffix: cc
     total number of matches: 1
   And here's the example of matching parenthesis we saw in Notes on regular expressions:
     check-regexp \
        '(?:(\()|(\[)|(\{))[[:alnum:]]*(?:(?(1)\)|(?:(?(2)\]|(?:\})))))' \
        '{ciao}' '(foo]' '[hithere]'
     
     searching      : {ciao}
     for the regexp : (?:(\()|(\[)|(\{))[[:alnum:]]*(?:(?(1)\)|(?:(?(2)\]|(?:\})))))
     what[0]: {ciao}
       what[3]: {
       length: 1
     total number of matches: 1
     
     searching      : (foo]
     for the regexp : (?:(\()|(\[)|(\{))[[:alnum:]]*(?:(?(1)\)|(?:(?(2)\]|(?:\})))))
     total number of matches: 0
     
     searching      : [hithere]
     for the regexp : (?:(\()|(\[)|(\{))[[:alnum:]]*(?:(?(1)\)|(?:(?(2)\]|(?:\})))))
     what[0]: [hithere]
       what[2]: [
       length: 1
     total number of matches: 1
   
In order for language definitions to be really useful they must be
used in proper combination with formatting styles (see Output format style).  However, these different files might not be developed
by the same person, or simply some one may want to customize one of
these.  In order to define good output formatting style files you should
be aware of each language element defined by a language definition file. 
Instead of having to look inside the language definition file itself
(and recursively in each included file) you can use the command line
option --show-lang-elements35, that
simply prints to the standard output all the language elements that
can be highlighted with a specific language definition file.
   
For instance, for cpp.lang you get:
     cbracket
     classname
     comment
     function
     keyword
     normal
     number
     preproc
     specialchar
     string
     symbol
     todo
     type
     url
     usertype
   while for log.lang you get:
     cbracket
     comment
     date
     function
     ip
     normal
     number
     port
     string
     symbol
     time
     twonumbers
     webmethod
   
By mixing all these features you can unleash your imagination and define
highlighting for complex source languages such as Flex and Bison by
writing few lines of code and re-use existing ones.  For instance, Flex
and Bison have their own syntax and lets you write C/C++ code in
specific parts of the source language, e.g., the code between  the
outmost brackets, in the following example, is C++ code, and should be
highlighted following C++ language definitions (apart from variables
that are prefixed with $):
     globaltags : options { if (...) { setTags( $1 ); } }
   This is easy to do (taken from flex.lang):
     state cbracket delim "{" "}" multiline nested begin
       variable = '\$.'
       include "cpp.lang"
     end
   Note that, since we used nested we can be sure
that the C++ language definitions are not considered
anymore when we matched the last closing }.
When writing a language definition file, it is quite useful to be able
to debug it (by using complex regular expressions one may experience
unwanted behaviors).  Since version 2.1 the command line option
--debug-lang is available.  When using this option, some
additional information are printed to the standard output.
   
Since version 2.5 this option also accepts the a sub specification (see
Invoking source-highlight).  When using dump (the default)
all the additional information explained below will be dumped without
interaction with the user.  When using interactive, for each
formatted string the program will stop waiting for a command from the
user.  In this very primordial version of interactive debug, the user
will only have to press ENTER to make the program continue until
the next formatted string.  This way, the programmer will have the
chance to step the highlighting of each part of the input file. 
Moreover, when debugging is enabled, no buffering will be performed by
the program, thus each formatted element will be immediately available
in the output.  For instance, you can use the command tail -f
to see the modifications on the output file on-the-fly.
   
When using this command line option the additional information produced has the following format:
     <.lang filename>:<line number>
     expression: <matched subexpression>
     formatting: <source file string to be formatted>
     entering: <next state's id>
     exiting state, level: <number of states>
   The lines starting with entering, exiting and
exitingall are related to entering a new state/environment and
exiting one and all states/environments (current state, if shown,
comes after entering and prints the same state's regular
expression but after the substitution of dynamic backreferences,
Dynamic Backreferences).  The first line shows a link to the
.lang definition file and the line number, i.e., and the
sub-expression that matched and the line starting with formatting
shows the source file string that matched with that expression.  If a
line starting with formatting is not preceded by a line with the
link to the sub-expression, it means that no particular regular
expression has matched, and thus the style normal will be used to
format that string.
   
Consider the following (simplified) Java source file:
     01: /*
     02:   This is to demonstrate --debug-lang
     03:   http://www.lorenzobettini.it
     04: */
     05:
     06: package hello;
     07:
     08: public class Hello {
     09:         // just some greetings ;-)  /*
     10:     int i = 10;
     11:     System.out.println("Hello World!");
     12: }
     13:
   Now you can debug the java.lang file by using the
--debug-lang command line option.  And the output is as follows:
     c_comment.lang:24
     expression: "/\*"
     formatting "/*" as comment
     entering state: 23
     formatting "  This is to demonstrate --debug-lang" as default
     formatting "  " as default
     url.lang:3
     expression: "(?:(?:<?)[[:word:]]+://[[:word:]\./\-_]+(?:>?))"
     formatting "http://www.lorenzobettini.it" as url
     c_comment.lang:24
     expression: "\*/"
     formatting "*/" as comment
     exiting state, level: 1
     java.lang:1
     expression: "\<(?:import|package)\>"
     formatting "package" as preproc
     formatting " hello" as default
     symbols.lang:1
     expression: "(?:~|!|%|\^|\*|\(|\)|-|\+|=|\[|\]|\\|:|;|,|\.|/|\?|&|<|>|\|)"
     formatting ";" as symbol
     ... omissis ...
     c_comment.lang:13
     expression: "//"
     formatting "//" as comment
     entering state: 12
     formatting " just some greetings ;-)  /*" as default
     c_comment.lang:13
     expression: "\z"
     formatting "" as comment
     exiting state, level: 1
     ... omissis ...
   This should provide enough information to understand how the regular
expressions are used and how the states/environments are entered and
exited.  Please note that the sub-expressions that are shown may
differ from the original ones specified in the .lang file.  This
is due to the preprocessing that is performed by Source-highlight. 
Moreover, some sub-expressions are not defined at all in the
.lang file: for instance, this is the case for line wide
definitions, i.e., those that are defined with the keyword start,
Line wide definitions.  The last lines above, showing
expression: "\z", means that we matched the end of a line.
   
Another useful feature in debugging is the option --show-regex
that shows, on the standard output, the regular expression automaton
that source-highlight creates.
   
For instance, consider this language definition (comment-show.lang):
     vardef TODO = '(TODO|FIXME)([:]?)'
     
     environment comment delim "/**" "*/" multiline begin
       type = '@[[:alpha:]]+'
       todo = $TODO
     end
     
     state cbracket delim "{" "}" escape "\\" multiline nested begin
       keyword = "if|then|else|endif"
     end
     
     string delim "<" ">"
     
     string2 delim "<<" ">>" multiline
   If you now execute the following command:
source-highlight --show-regex=comment-show.lang
you will get, on the standard output, the following output36:
     
               STATE 1 default: normal
            rule (comment) "/\*\*" (exit level: 0, next: 2)
              STATE 2 default: comment
                rule (comment) "\*/" (exit level: 1, next: 0)
                rule (type) "(?:\@[[:alpha:]]+)" (exit level: 0, next: 0)
                rule (todo) "(?:(?:TODO|FIXME)(?:[:]?))" (exit level: 0, next: 0)
            rule (cbracket) "\{" (exit level: 0, next: 3)
              STATE 3 default: normal
                rule (cbracket) "\}" (exit level: 1, next: 0)
                rule (cbracket) "\\." (exit level: 0, next: 0)
                rule (cbracket) "\{" (exit level: 0, next: 0, nested)
                rule (keyword) "\<(?:if|then|else|endif)\>" (exit level: 0, next: 0)
            rule (string) "<(?:[^<>])*>" (exit level: 0, next: 0)
            rule (string2) "<<" (exit level: 0, next: 4)
              STATE 4 default: string2
                rule (string2) ">>" (exit level: 1, next: 0)
          
   
   This shows the states and highlight rules of the regular expression automaton that source-highlight creates and will use to format an input source.
Each state is associated a unique number in order to identify it; moreover, the default element of the state is shown (i.e., if none of the state's rule match, then that part is highlighted with the default element style). For instance, in the initial state the default style is normal. Then for each state it shows the rules for that state. For each rule you can see the corresponding element of the rule, the regular expression for the rule and some other information, that we explain in the following.
We can see that if we match a /** (it is shown as a string with
escaped special characters, /\*\*) we enter a new state, in this
case the state 2 (next: 2).  This corresponds to the delimited
element defining a new environment (State/Environment Definitions).  The fact that it is actually an environment and not a
state37 can be seen by the fact
that the default element is the same of the environment itself.  If we
match a */, i.e., the end of the delimited element, we exit one
level (exit level: 1) meaning that we go back to state 1.  Then
we have the state for cbracket, which is not an environment, in
fact its default state is normal.  The second rule of this state,
\\. represents the escape string of the state definition. 
Since the delimited element is defined as nested, we have a third rule
{ which has the nested information; thus, if we match it,
we simply enter a new instance of state 3 itself.
   
The string and string2 show the difference implied by the
multiline option: since source-highlight handles a line of input
separately, the first delimited definition can be handled with a single
regular expression while the multiline version cannot.
   
Note that the states/environments are indented so that it's easier to understand the outer and the inner states.
Let us now consider a variation of the previous example:
     vardef TODO = '(TODO|FIXME)([:]?)'
     
     environment comment delim "/**" "*/" multiline nested begin
       type = '@[[:alpha:]]+'
       todo = $TODO
     end
     
     regexp = `([^[:alnum:]]).*(\1)`
     
     string delim "<" ">"
     
     string2 delim "<<" ">>" multiline
     
     (paren,normal,paren) = `(\[)(.*)(\])`
   and let us see the output of --show-regex
     
               STATE 1 default: normal
            rule (comment) "/\*\*" (exit level: 0, next: 2)
              STATE 2 default: comment
                rule (comment) "\*/" (exit level: 1, next: 0)
                rule (comment) "/\*\*" (exit level: 0, next: 0, nested)
                rule (type) "(?:\@[[:alpha:]]+)" (exit level: 0, next: 0)
                rule (todo) "(?:(?:TODO|FIXME)(?:[:]?))" (exit level: 0, next: 0)
            rule (regexp) "(?:([^[:alnum:]]).*(\1))" (exit level: 0, next: 0)
            rule (string) "<(?:[^<>])*>" (exit level: 0, next: 0)
            rule (string2) "<<" (exit level: 0, next: 3)
              STATE 3 default: string2
                rule (string2) ">>" (exit level: 1, next: 0)
            rule (paren normal paren) "(\[)(.*)(\])" (exit level: 0, next: 0)
          
   
   Since in the rule regexp we used the ` regular expression
(see Ways of specifying regular expressions), then, the marked
subexpressions are not translated in order to make backreferences work
correctly.
   
The last rule uses explicit subexpressions with names (see Explicit subexpressions with names); although that expression is made up of different elements, the expression is matched as a whole.
Now we provide some examples of language definitions. In the previous sections we have already provided some code snippets, while here we provide complete examples of language definitions that are included in the source-highlight distribution itself.
In particular we will first show the language definition for the
language definition syntax itself (file langdef.lang).  This will
be used to highlight the examples of language definitions that we will
show in this section (the highlighting will not be visible if you are
viewing this manual with the info command).  Of course, this
example is highlighted itself.
     # this is the language definition for the
     # language definition syntax itself
     comment start "#"
     
     preproc = "include"
     
     string delim "\"" "\"" escape "\\" multiline
     regexp delim "'" "'" escape "\\" multiline
     regexp delim "`" "`" escape "\\" multiline
     
     keyword = "state|environment|begin|end|delim|escape|start",
               "multiline|nested|vardef|exitall|exit",
               "redef|subst|nonsensitive"
     
     symbol = "=|+|,|(|)"
     
     vardef ID = '[[:word:]]+'
     
     variable = '\$' + $ID
     
     variable = $ID
     
   The style that is used to highlight these examples in Texinfo is texinfo.style that is shown in Output format style. The language definition for the style syntax (file style.lang) is even simpler:
     # this is the language definition for the
     # style definition syntax
     comment start "//"
     
     string delim "\"" "\"" escape "\\"
     
     keyword = "bgcolor|purple|orange|brightorange|brightgreen|darkgreen",
               "green|darkred|red|brown|pink|yellow|cyan",
               "black|teal|gray|darkblue|blue",
               "normal|linenum",
               "noref|nf|f|u|i|b"
     keyword = 'bg\:'
     
     symbol = ",|;"
     
     variable = '[[:word:]]+'
     
   Note that this definition is pretty simple since the language definition syntax is simple. In the next examples we will see how to use more complex features to highlight more complex language syntaxes.
This is the language definition for C, included in the file c.lang:
     # definitions for C
     include "c_comment.lang"
     
     (keyword,normal,classname) =
       `(\<struct)([[:blank:]]+)([[:alnum:]_]+)`
     
     state preproc start '^[[:blank:]]*#(?:[[:blank:]]*include)' begin
             string delim "<" ">"
             string delim "\"" "\"" escape "\\"
             include "c_comment.lang"
     end
     
     preproc = '^[[:blank:]]*#([[:blank:]]*[[:word:]]*)'
     
     include "number.lang"
     
     include "c_string.lang"
     
     keyword = "__asm|__cdecl|__declspec|__export|__far16",
       "__fastcall|__fortran|__import",
       "__pascal|__rtti|__stdcall|_asm|_cdecl",
       "__except|_export|_far16|_fastcall",
       "__finally|_fortran|_import|_pascal|_stdcall|__thread|__try|asm|auto",
       "break|case|catch|cdecl|const|continue|default",
       "do|else|enum|extern|for|goto",
       "if|pascal",
       "register|return|sizeof|static",
       "struct|switch",
       "typedef|union",
       "volatile|while"
     
     type = "bool|char|double|float|int|long",
       "short|signed|unsigned|void|wchar_t"
     
     include "symbols.lang"
     
     cbracket = "{|}"
     
     include "function.lang"
     
     include "clike_vardeclaration.lang"
     
   Note that this makes use of lots of includes since these parts
are reused in other language definitions (e.g., Java has lots of parts
that are in common with C/C++ so we wrote these parts in separate
files).  In particular the comments definitions:
     # c_comment.lang
     
     vardef TODO = '(TODO|FIXME|BUG)([:]?)'
     
     # comments with documentation tags
     environment comment start "///" begin
       include "url.lang"
       include "html.lang"
       type = '@[[:alpha:]]+'
       todo = $TODO
     end
     
     comment start "//"
     
     # comments with documentation tags
     environment comment delim "/**" "*/" multiline begin
       include "url.lang"
       include "html.lang"
       type = '@[[:alpha:]]+'
       todo = $TODO
     end
     
     # standard comments
     environment comment delim "/*" "*/" multiline begin
       include "url.lang"
       todo = $TODO
     end
   Here we have the definitions for line-wide comments (//) and
for multi line comments where we highlight also URL addresses and
e-mail addresses (defined in the file url.lang not shown here). 
Moreover, for comments that are used in automatic documentation
generation tools (such as Doxygen or Javadoc), i.e., those that start
with /** or ///) we also highlight the complete HTML
syntax (defined in the file html.lang not shown here).
   
Going back to c.lang we see that we use subexpressions with names
(see Explicit subexpressions with names) for highlighting the
struct name (when preceded by struct, highlighted as a keyword).
   
For preprocessor directives #include we use a state definition
since in this case the file included with the <file> syntax must
be formatted as strings (and only in this context the <> must be
considered as strings, anywhere else they are operators).  Since a state
erases definitions defined outside the state we must include
c_comment.lang again in order to highlight comments also in this
context38.  Then we
have a definition of preproc that catches all the other
preprocessor directives.
   
The included file number.lang defines the regular expression that catches number constants (not shown here), then we include the file c_string.lang that define strings (again shared by Java):
     vardef SPECIALCHAR = '\\.'
     
     environment string delim "\"" "\"" begin
       specialchar = $SPECIALCHAR
     end
     
     environment string delim "'" "'" begin
       specialchar = $SPECIALCHAR
     end
     
   inside a string we want to highlight in a different way the special
characters (such as, e.g., \n, \t, etc.)  and in general
escaped characters, matched by the regular expression `\\.'.
   
The included file symbols.lang defines all the symbols (shared also by other languages):
     symbol = "~","!","%","^","*","(",")","-","+","=","[",
             "]","\\",":",";",",",".","/","?","&","<",">","\|"
   This has nothing interesting but the fact that it shows that
the character \ and | have to be escaped.
   
The included file function.lang defines the regular expression to match a function definition or invocation:
     vardef FUNCTION = '([[:alpha:]]|_)[[:word:]]*(?=[[:blank:]]*\()'
     function = $FUNCTION
   that shows an example of forward lookahead assert for the opening parenthesis (see Notes on regular expressions). As noted in File inclusion, it is crucial that this file is included after the keyword definition.
Finally, c.lang includes the file clike_vardeclaration.lang:
     (usertype,usertype,normal) =
     `([[:alpha:]](?:[^[:punct:][:space:]]|[_])*)
     ((?:<.*>)?)
     (\s+(?=[*&]*[[:alpha:]][^[:punct:][:space:]]*\s*[[:punct:]\[\]]+))`
     
   This definition, using subexpressions with names (see Explicit subexpressions with names), tries39 to match
user types (e.g., struct names) in function parameter and variable
declarations.  It basically tries to match a type identifier, then a
possible template specification40 and then we have a complete lookahead assert
(Notes on regular expressions) that tries to match the variable
identifier, possibly with & and * reference and pointer
specification, followed by an assignment = or a ;, more
generally a [:punct:] or [] (for array specifications). 
This should catch the user types in the correct contexts, as in the
following (where we intentionally highlighted usertype in
italics):
     Integer i = 10;
     Boolean b;
     String args[];
     const MyType args[];
     const My_Type args[];
     List<Integer> mylist;
     List<List<Integer> > mylist;
     myspace::InputStream iStream ;
     MyType *t;
     MyType **t;
     const MyType &t;
     if (argc > 0) { }
     
   Note that since for the third group we use a lookahead assert, what is
matched is not actually formatted but it is put back in the input stream
so that it can be formatted using other rules (e.g., symbol for
* and =).
   
Since, at least syntactically, C++ is an extension of C, the language definition for C++, included in the file cpp.lang, relies on c.lang41:
     # definitions for C++
     # most of it is shared with c.lang
     
     (keyword,normal,classname) =
       `(\<(?:class|struct|typename))([[:blank:]]+)([[:alnum:]_]+)`
     
     keyword = "class|const_cast|delete",
       "dynamic_cast|explicit|false|friend",
       "inline|mutable|namespace|new|operator|private|protected",
       "public|reinterpret_cast|static_cast",
       "template|this|throw|true",
       "try|typeid|typename",
       "using|virtual"
     
     include "c.lang"
     
   In particular, it extends the set of keywords.  Moreover, note that we
use subexpressions with names (see Explicit subexpressions with names) for highlighting the class (or struct) name (when preceded by
class, struct or typename, highlighted as a
keyword).  A similar rule was also present in c.lang, but it
concerned only struct.
   
Now that we wrote the language definition for C/C++, writing the one
for C# is straightforward, since we only need to add the keyword
using as a preprocessor element, and redefine  (or better,
“substitute”, Redefinitions and Substitutions) the keywords
and types:
     # definitions for C-sharp
     # by S. HEMMI, updated by L. Bettini.
     preproc = "using"
     
     number =
     '\<[+-]?((0x[[:xdigit:]]+)|(([[:digit:]]*\.)?
     [[:digit:]]+([eE][+-]?[[:digit:]]+)?))([FfDdMmUulL]+)?\>'
     
     include "cpp.lang"
     
     subst keyword = "abstract|event|new|struct",
      "as|explicit|null|switch",
      "base|extern|this",
      "false|operator|throw",
      "break|finally|out|true",
      "fixed|override|try",
      "case|params|typeof",
      "catch|for|private",
      "foreach|protected",
      "checked|goto|public|unchecked",
      "class|if|readonly|unsafe",
      "const|implicit|ref",
      "continue|in|return",
      "virtual",
      "default|interface|sealed|volatile",
      "delegate|internal",
      "do|is|sizeof|while",
      "lock|stackalloc",
      "else|static",
      "enum|namespace",
      "get|partial|set",
      "value|where|yield"
     
     subst type = "bool|byte|sbyte|char|decimal|double",
      "float|int|uint|long|ulong|object",
      "short|ushort|string|void"
     
   Now we want to highlight files that are generated by diff
(typically used to create patches).  This program can generate outputs
in three different formats (at least at best of my knowledge).
   
With the option -u|--unified the differences among files
are shown in the same context, for instance (the examples of the
diff files shown here are manually modified so that they can
fit in the page width):
     diff -ruP source-highlight-2.1.1/source-highlight.spec ...
     --- source-highlight-2.1.1/source-highlight.spec ...
     +++ source-highlight-2.1.2/source-highlight.spec ...
     @@ -6,8 +6,8 @@
     
      Summary:   syntax highlighting for source documents
      Name:      source-highlight
     -Version:   2.1.1
     -Release:   2.1.1
     +Version:   2.1.2
     +Release:   2.1.2
      License:   GPL
      Group:     Utilities/Console
      Source:    ftp://ftp.gnu.org/gnu/source-highlight/%{name}-%{version}.tar.gz
     
   With the option -c--context the differences are shown into
two different parts:
     diff -rc2P source-highlight-2.1.1/source-highlight.spec ...
     *** source-highlight-2.1.1/source-highlight.spec ...
     --- source-highlight-2.1.2/source-highlight.spec ...
     ***************
     *** 7,12 ****
       Summary:   syntax highlighting for source documents
       Name:      source-highlight
     ! Version:   2.1.1
     ! Release:   2.1.1
       License:   GPL
       Group:     Utilities/Console
     --- 7,12 ----
       Summary:   syntax highlighting for source documents
       Name:      source-highlight
     ! Version:   2.1.2
     ! Release:   2.1.2
       License:   GPL
       Group:     Utilities/Console
     diff -rc2P source-highlight-2.1.1/src/latex.outlang ...
     *** source-highlight-2.1.1/src/latex.outlang ...
     --- source-highlight-2.1.2/src/latex.outlang ...
     ***************
     *** 35,37 ****
     --- 35,38 ----
       "--" "-\\/-"
       "---" "-\\/-\\/-"
     + "\"" "\"{}" # avoids problems with some inputenc
       end
     
   Without options it generates only the essential difference information without any addition context lines:
     diff -rP source-highlight-2.1.1/source-highlight.spec ...
     9,10c9,10
     < Version:   2.1.1
     < Release:   2.1.1
     ---
     > Version:   2.1.2
     > Release:   2.1.2
   Summarizing, we would like to be able to handle all these three
different syntaxes; note that the first format and the second format
have something conflicting: the first one uses the --- to
indicate the new version of a file while the second format uses it to
indicate the old version of a file.  Since we want to highlight
differently the old parts and the new parts (this is not visible in the
Texinfo highlighting due to the lack of enhanced formatting features,
but it is visible for instance in HTML output where we use two
different colors), this behavior adds some difficulties.  Of course, we
could define three different language definitions, one for each diff
output format.  However, we prefer to handle them all in the same
file!
   
This is the language definition for diff files:
     # language definition for files created with 'diff'
     
     # diff created with -u option
     state oldfile = '(?=^[-]{3})' begin
       oldfile start '^[-]{3}'
       oldfile start '^[-]'
       newfile start '^[+]'
       difflines start '^@@'
     end
     
     # diff created with -c option
     state oldfile = '(?=^[*]{3})' begin
       environment oldfile = '^[*]{3}[[:blank:]]+[[:digit:]]' begin
         normal start '^[[:space:]]'
         newfile = '(?=^[-]{3})' exit
       end
       oldfile start '^[*]{3}'
     
       environment newfile = '^[-]{3}[[:blank:]]+[[:digit:]]' begin
         normal start '^[[:space:]]'
         newfile = '(?=^[*]{3})' exit
         normal start '^diff' exit
       end
       newfile start '^[-]{3}'
     end
     
     # otherwise, created without options
     state difflines = '(?=^[[:digit:]])' begin
       difflines start '^[[:digit:]]'
       oldfile start '^[<]'
       newfile start '^[>]'
     end
     
   Since we can safely assume that when we process a diff file it contains
only information created with the same diff command line switch, we
define three different states that correspond to the three diff output
formats.  Note that these states are entered with a simple definition;
as noted in State/Environment Definitions, this means that no
automatic exit means are provided, and since no explicit exit condition
is specified, this means that once one of this state is entered it will
never be exited.  This is consistent with our goal.  Of course, the
expression that makes us enter a state must be defined correctly, and in
particular we first search for an initial --- sequence since this
is used as the first difference specification by the -u|--unified
option, so this is a distinguishing feature to be used to
infer which diff format file we are processing.
   
Another interesting thing, is that we use the forward lookahead assert for the opening parenthesis (see Notes on regular expressions), since we only want to see which file format we are processing. Once we entered the right state we can define the regular expressions for the elements of the specific diff file format.
For the files created with the option -c|--context we define two
inner environments, one for the new file part and one for the old file
part (these are delimited by a --- or *** and line number
information).  Note that these are environments, so anything that is
not matched by any expression is formatted according to the style of the
element that defines the environment.  Thus, we provide an expression
for text that must be formatted as normal.  For diff files this
corresponds to a line that start with a space or with diff (take
a look at the examples above).  In particular the latter case can take
place only during the new file part.  In both environments we must
define the exit conditions.  In both cases these correspond to the
beginning of the complementary part; also in this case we use forward
lookahead assertions, since we use it only to exit the environment.  The
outer definitions for oldfile and newfile are used to
match the lines with source file information information.
   
The third state, corresponding to the normal diff output format, should be straightforward by now.
Source-highlight, by means of regular expressions can only perform lexical analysis of the input source. In particular, it is based on the assumption that the input source is syntactically correct with respect to the input language. However, by using the language definition syntax and by writing the right regular expression it is possible to simulate some sort of semantic analysis of the input source.
For instance, consider the following C (or C++) source file:
     // test special #if 0 treatment
     
     int main() {
     #if 0 // equivalent to a comment
       int i = 10;
       printf("this should never be executed\n");
       return 1;
     #else
       printf("Hello world!\n");
       return 0;
     #endif
     
       printf("never reach here!\n");
     }
     
   It is easy to verify that the code between #if 0 and
#else will be never executed (indeed it will not even
be compiled).  Thus, we might want to format it as a comment.
   
We then write another language definition file, based on the file cpp.lang:
     environment comment start '^[[:blank:]]*#if[[:blank:]]+0' begin
       comment start '^[[:blank:]]*#(else|endif)' exit
     end
     
     include "cpp.lang"
     
   We intentionally included an error in this first version:
we used the start element to start the environment,
but such element has the scope of a single line, thus,
it does not have the desired behavior:
     // test special #if 0 treatment
     
     int main() {
     #if 0 // equivalent to a comment
       int i = 10;
       printf("this should never be executed\n");
       return 1;
     #else
       printf("Hello world!\n");
       return 0;
     #endif
     
       printf("never reach here!\n");
     }
     
   A better solution is the following one:
     environment comment = '^[[:blank:]]*#[[:blank:]]*if[[:blank:]]+0' begin
       comment start '^[[:blank:]]*#[[:blank:]]*(else|endif)' exit
     end
     
     include "cpp.lang"
     
   here we enter the comment environment by not using a delimited
element, but simply the regular expression to match #ifdef 0. 
Then we exit the environment either when we match an #else or a
#endif.  This seems to work:
     // test special #if 0 treatment
     
     int main() {
     #if 0 // equivalent to a comment
       int i = 10;
       printf("this should never be executed\n");
       return 1;
     #else
       printf("Hello world!\n");
       return 0;
     #endif
     
       printf("never reach here!\n");
     }
     
   However, it does not work if we consider nested #if...#else; for
instance consider the following code, formatted with the previous
language definition:
     // test special #if 0 treatment
     
     int main() {
     #if 0 // equivalent to a comment
       int i = 10;
       printf("this should never be executed\n");
     #  ifdef FOO
       printf("foo\n");
     #     ifndef BAR
       printf("no bar\n");
     #     else
     #     endif
     #  else
       printf("no foo\n");
     #  endif // FOO
       return 1;
     #else
       printf("Hello world!\n");
       return 0;
     #endif
     
       printf("never reach here!\n");
     }
     
   The problem is that the previous language definition does not consider
nested #if and thus, the first time it matches a #else or
an #endif it exits the comment environment.
   
We must then take into account possible nested occurrences.  This can be
done by using a delimited element with the nested option
(Delimited definitions):
     # treat the preprocess statement
     #  #if 0
     #    ...
     #  #else
     # as a comment
     
     environment comment = '^[[:blank:]]*#[[:blank:]]*if[[:blank:]]+0' begin
       comment start '^[[:blank:]]*#[[:blank:]]*else' exit
       comment delim '^[[:blank:]]*#[[:blank:]]*if'
                     '^[[:blank:]]*#[[:blank:]]*endif' multiline nested
     
     end
     
     include "cpp.lang"
     
     
   This time the right block of code is correctly formatted as a comment:
     // test special #if 0 treatment
     
     int main() {
     #if 0 // equivalent to a comment
       int i = 10;
       printf("this should never be executed\n");
     #  ifdef FOO
       printf("foo\n");
     #     ifndef BAR
       printf("no bar\n");
     #     else
     #     endif
     #  else
       printf("no foo\n");
     #  endif // FOO
       return 1;
     #else
       printf("Hello world!\n");
       return 0;
     #endif
     
       printf("never reach here!\n");
     }
     
   Note that it is crucial to exit the environment even when we match an
#else (not only an #endif, since, this way, we can match
again another #ifdef 0; consider, for instance, the following
code:
     // test special #if 0 treatment
     
     int main() {
     #if 0 // equivalent to a comment
       int i = 10;
       printf("this should never be executed\n");
       return 1;
     #else
       printf("Hello world!\n");
     #   if 0 // another one
       return 1;
     #   else
       return 0;
     #   endif
     #endif
     
       printf("never reach here!\n");
     }
     
   
Since version 2.1 source-highlight uses a specific syntax to specify output formats (e.g., how to format in HTML, LaTeX, etc.). Before version 2.1, in order to add a new output format, many C++ classes had to be written. This had the drawback that a new output format could not be added “dynamically”: you had to recompile the whole source-highlight program.
Instead, now, an output format is specified in a file, loaded dynamically, through a (hopefully) simple syntax. Then, these definitions are used internally to create, on-the-fly, text formatters.
Here, we see such syntax in details, by relying on many examples. This allows a user to easily modify an existing output format definition and create a new one. These files have, typically, extension .outlang.
Each definition basically associates a text style (such as, e.g., bold,
italics, colors, etc.)  to the representation of that style into the
output format (such as, e.g., <b>$text</b> in HTML).  The
representation is given in " and you can use the classic escape
character \ to use the " inside the definition.  If you
want to specify the ASCII code for a character you can do so by
specifying the numeric code in hexadecimal notation preceded by
\x, for an example, see Style template.
   
If no definition is given for a specific style, e.g., bold, then when that style is requested during formatting, the text will be formatted as it is, i.e., the style without the definition is simply ignored.
Comments can be given by using #; the rest of the line is
considered as a comment.
   
Files can be included in the same way as for language definitions, File inclusion.
In any case, if a definition for a style is given more than once, the last definition replaces all the others.
With the line:
extension "<file extension>"
you define the default file extension (without the .) used to
generate files formatted according to this output format.  This is used
when no output file name is specified; if the file extension is not
included in the .outlang is not defined, and no output file name
is specified, an error will occur.
   
For instance, this is used in html_common.outlang:
extension "html"
These are the text styles that one can define:
     bold
     italics
     underline
     notfixed
     fixed
   These, of course, correspond to the ones used to specify the output format style, Output format style.
These definitions, for instance, are from the HTML format definition:
     bold "<b>$text</b>"
     italics "<i>$text</i>"
     underline "<u>$text</u>"
   Inside a definition you use the special variable $text to specify
where the actual text to be formatted has to be inserted.  For instance,
the definition of bold above says that if you need to format the
keyword class in bold in HTML, the following text will be
generated: <b>class</b>.  This variable is used also when mixing
more than one styles recursively, in particular if you want to format in
bold and italics (i.e, first bold and then italics, or, in other words,
the sequence i, b is used in the the output format style file, see
Output format style), then first the text
class is substituted for $text into <b>$text</b>
and then the text <b>class</b> will be substituted for
$text into <i>$text</i>, thus obtaining
<i><b>class</b></i>.
The definition for using colors during formatting requires
the definition for the color style
color "..."
and for the bgcolor style42:
bgcolor "..."
This definition concerns only the background color for a specific
highlighted element, i.e., the color specified in the style file with
the prefix bg: (see Output format style) or the property
background-color specified in a CSS file passed to
--style-css-file (see Output format style using CSS). 
Thus it should not be confused with the background color of the entire
output (i.e., the one specified using bgcolor in a style file or
the property background-color of the body selector in a
CSS).  The background color for the entire document is explained in
Document template.
   
Note that the background color might not be available for all output formats. For instance, for HTML we only have:
color "<font color=\"$style\">$text</font>"
while for XHTML we have:
     color "<span style=\"color: $style\">$text</span>"
     bgcolor "<span style=\"background-color: $style\">$text</span>"
   Apart from the variable $text that we already saw, we
have also the variable $style, that will be replaced
with the actual color.
   
Source-highlight recognizes a number of color constants, see Output format style.
You then must associate a color constant to the color definition in the
output format, through the colormap definition:
     colormap
     "color constant" "color representation"
     "color constant" "color representation"
     ...
     default "default color representation"
     end
   The default row (note the absence of ") defines the
color to be used in case a color constant is used during formatting, but
it is not defined in the output format.
   
For instance, for HTML we have:
     colormap
     "green" "#33CC00"
     "red" "#FF0000"
     "darkred" "#990000"
     "blue" "#0000FF"
     "brown" "#9A1900"
     "pink" "#CC33CC"
     "yellow" "#FFCC00"
     "cyan" "#66FFFF"
     "purple" "#993399"
     "orange" "#FF6600"
     "brightorange" "#FF9900"
     "brightgreen" "#33FF33"
     "darkgreen" "#009900"
     "black" "#000000"
     "teal" "#008080"
     "gray" "#808080"
     "darkblue" "#000080"
     default "#000000"
     end
   If your output format does not handle colors you can simply avoid the
definitions of color and colormap and Source-highlight
will simply ignore colors.
   
The color is applied after applying the other styles, e.g., bold, italics, etc.
Thus, by continuing the example of the previous section, suppose you defined the following output style for keywords:
keyword blue i, b;
then the class text will be replaced to $text variable and
the value #0000FF to $style inside the color definition
<font color="$style">$text</font> obtaining <font
color="#0000FF">class</font> which will then be replaced to
$text in <b>$text</b> and so on for italics, finally
obtaining
   
<i><b><font color="#0000FF">class</font></b></i>.
When using the command line option --line-number-ref
(Invoking source-highlight) an anchor is generated in the output
file for each line numbering.  The style of the anchor is defined by the
definition anchor.  If this is not defined, the option
--line-number-ref has no effect.  The $linenum variable will
be replaced with the line number, and the $text variable
with the actual text.
   
For instance, for HTML we have
anchor "<a name=\"$linenum\">$text</a>"
Since version 2.2 source-highlight can also generate references to
several elements (e.g., variables, class definitions, etc.),
Generating References.  Also in this case the definition
anchor is used; furthermore, the definition of reference
is required.   In the definition of anchor and reference,
apart from the variable $linenum, we also have the variables
$infile (the name of the original input file) and
$infilename (the name of the original input file without the
path) and in the definition of reference we also have the
variable $outfile (the name of the file where the anchor is). 
One can decide how to define an anchor and a reference by using these
two variables.  For instance, for HTML we have
reference "<a href=\"$outfile#$linenum\">$text</a>"
Note, that in this case we use the $outfile since we actually
generate a link to another (or possibly the same) output file.
   
On the contrary, for LaTeX, since we do not generate a “clickable”
reference, we refer to the original input file (we use both
$infilename and $linenum in both definitions of anchor
and reference):
     anchor "\label{$infilename:$linenum}$text"
     reference "{\hfill $text $\rightarrow$ $infile:$linenum, \
                page~\pageref{$infilename:$linenum}}"
   In particular, we use $infilename for generating the
\label and not $infile because the path symbol would
“disturb” LaTeX (while we use the complete file path in the textual
information of the reference).
   
This will generate a right aligned reference.  Note that it is assumed
that when generating references in LaTeX one uses
--gen-references=postline or --gen-references=postdoc and
not --gen-references=inline (Generating References), since
it makes no sense to generate an inline reference (or at least I would
not know how to generate a nice looking one :-).
   
Furthermore, for Texinfo:
     anchor "@anchor{$infilename:$linenum}$text"
     reference "@flushright
     @xref{$infilename:$linenum,$text,$text $infile:$linenum}.
     @end flushright"
   Note that using both $infilename (and not $infile for
the same reasons) and $linenum also in the definition of
anchor somehow ensures that there are no duplicate anchors; this
is done for LaTeX and Texinfo but not for HTML because it is assumed
that the generated .tex and .texinfo file is included
directly in a master file, as it is done in this manual (while, for
instance, it is assumed that a separate HTML file is generated for each
source and kept separate).  If this is not your case you can change the
definitions of anchor and reference as you see fit.  Some
examples of outputs with references in Texinfo are shown in
Examples.
   
Indeed, one can use three more definitions for reference that
corresponds to the three arguments that can be passed to
--gen-references command line option (Generating References): inline_reference, postline_reference and
postdoc_reference.  If one of this not defined, then the same
definition of reference is used.  Having the possibility of
specifying different definitions is useful for instance in the case of
HTML: the same style for an inline reference is pretty ugly when used
also for a postline or postdoc reference:
     postline_reference "<a href=\"$outfile#$linenum\">$text -> $infile:$linenum</a>"
     postdoc_reference "<a href=\"$outfile#$linenum\">$text -> $infile:$linenum</a>"
     reference "<a href=\"$outfile#$linenum\">$text</a>"
   
If the output format you are defining does not have a specific style
for bold, italics, ... and for colors you can simply use the definition
onestyle, where you can use both $style and $text. 
This will be used for any style (indeed any other definition such as
bold, italics, color will be ignored).  Indeed, in this case, it is
assumed that the style of each source element is defined in a file with
its own syntax, i.e., not with a syntax defined by Source-highlight. 
(This is the case, for instance, of HTML using CSS style sheets.) 
Moreover, since the output format style is not used, during formatting
the variable $style will be replaced with the name of the element
to highlight (e.g., keyword, comment, etc.).
   
For instance, for HTML CSS, we simply have:
onestyle "<span class=\"$style\">$text</span>"
In fact, HTML CSS relies on style definitions provided in a separate
file (the .css file indeed).  Thus, when formatting a
keyword, e.g., abstract, we will obtain:
<span class="keyword">abstract</span>
Of course, the style for keyword must be defined in the
.css file.
Some output formats are based on a unique template that where the other styles are composed; during composition the styles can be separated with a specific separator:
     styletemplate "..."
     styleseparator "..."
   This is used, for instance, for the ANSI color escape sequence output format (esc.outlang):
     styletemplate "\x1b[$stylem$text\x1b[m"
     styleseparator ";"
     
     bold "01$style"
     underline "04$style"
     italics "$style"
     color "$style"
   Note that, since more than one style can be mixed into the style
template, bold, underline, ... explicitly use the variable
$style.
This feature allows you to generate a string as the prefix of each generated line that corresponds to an input line (i.e., this prefix is not generated for other generated output elements, e.g., the lines in the header, footer, etc.).
We use this feature in the LaTeX output (LaTeX output):
     lineprefix "\mbox{}"
   This way each line in the LaTeX output is prefixed with
\mbox{}43.
   
Another interesting example that uses lineprefix is the javadoc
output, see Generating HTML output.
Some character sequences that are in the source file may have a special meaning in an output format, so they need some preprocessing (e.g., escaping them). You can specify the translation table with:
     translations
     "original sequence" "transformed sequence"
     'regex' "transformed sequence"
     ...
     end
   The difference between "original sequence" and
'regex'44 is that with the former
you specify a character sequence that will be matched literally, apart
from special characters such as \ (which, if needed to be
inserted, must be escaped), \n (new line) and \t (tab
character).  Instead, with the latter, you can specify a regular
expression (this is basically the same difference between " and
' in language definitions, see Simple definitions).
   
For instance, for HTML, we have the following translation table:
     translations
     "&" "&"
     "<" "<"
     ">" ">"
     end
   For LaTeX, the translation table is a little bit bigger; here we
show only a little part, that shows how to escape special characters
(such as \), to translate a new line character and tab
character:
     translations
     "<" "$<$"
     ">" "$>$"
     "&" "\\&"
     "\\" "\\textbackslash{}"
     "\n" " \\\\\n"
     " " "\\ "
     "\t" "\\ \\ \\ \\ \\ \\ \\ \\ "
     end
   Note that, since a new character must be translated in LaTeX with
\\, we have to escape two \ (i.e., \\\\) and then
we want to actually insert a new line in the output file \n.
   
For HTML with not fixed font by default, html_notfixed.outlang
(see HTML and XHTML output), we need two translate two space sequence
(i.e., two adjacent spaces, since in HTML more adjacent spaces are
rendered as only one space45, while we want them as they are), and we also
need to translate a space starting a new line in the source (thus we
use the regular expression ^ , enclosed in '); thus we
have:
     translations
     "\n" "<br>\n"
     "  " "  "
     '^ ' " " # a space at the beginning of a line
     "\t" "        "
     end
   
You can define the beginning and the end of an output file, with
     doctemplate
     "...beginning..."
     "...end..."
     end
        nodoctemplate
     "...beginning..."
     "...end..."
     end
   The first one is used when the --doc command line option is
specified, while the second one is used in the other case46.
   
For instance, for HTML we have
     nodoctemplate
     "<!-- Generator: $additional -->
     $header<pre><tt>"
     "</tt></pre>$footer
     "
     end
   Note that in the end part there is an explicit new line.
In the definition of the doctemplate and nodoctemplate the
following variables can be used and will be replaced during the output
generation:
     
$title--title command line option;
$header--header;
$footer--footer;
$css--css;
$additional$docbgcolor47bgcolor of the .style
file (see Output format style) or in the body selector of
the CSS file passed with --style-css-file (see Output format style using CSS). 
For instance, for an HTML document with css, (file htmlcss.outlang) we have:
     doctemplate
     "<!DOCTYPE HTML PUBLIC \"-//W3C//DTD HTML 4.0//EN\"
         \"http://www.w3.org/TR/REC-html40/strict.dtd\">
     <html>
     <head>
     <meta http-equiv=\"Content-Type\"
     content=\"text/html; charset=iso-8859-1\">
     <meta name=\"GENERATOR\" content=\"$additional\">
     <title>$title</title>
     <link rel=\"stylesheet\" href=\"$css\" type=\"text/css\">
     </head>
     <body>
     $header<pre><tt>"
     "</tt></pre>
     $footer</body>
     </html>
     "
     end
   For an HTML document with header and footer, (file
html.outlang) we have (note the use of $docbgcolor):
     doctemplate
     "<!DOCTYPE HTML PUBLIC \"-//IETF//DTD HTML//EN\">
     <html>
     <head>
     <meta http-equiv=\"Content-Type\" content=\"text/html; charset=iso-8859-1\">
     <meta name=\"GENERATOR\" content=\"$additional\">
     <title>$title</title>
     </head>
     <body bgcolor=\"$docbgcolor\">
     $header<pre><tt>"
     "</tt></pre>
     $footer</body>
     </html>
     "
     end
   And for an HTML table output (file htmltable.outlang):
     doctemplate
     "<table  BGCOLOR=\"$docbgcolor\" NOSAVE >
     <tr NOSAVE>
     <td NOSAVE>
     <pre><tt>"
     "</tt></pre>
     </td>
     </tr>
     </table>
     "
     end
   
As a complete example we show the file html_common.outlang which contains the common definitions for the various HTML output formats (html.outlang, htmltable.outlang, etc.):
     include "html_ref.outlang"
     
     extension "html"
     
     bold "<b>$text</b>"
     italics "<i>$text</i>"
     underline "<u>$text</u>"
     color "<font color=\"$style\">$text</font>"
     
     colormap
     "green" "#33CC00"
     "red" "#FF0000"
     "darkred" "#990000"
     "blue" "#0000FF"
     "brown" "#9A1900"
     "pink" "#CC33CC"
     "yellow" "#FFCC00"
     "cyan" "#66FFFF"
     "purple" "#993399"
     "orange" "#FF6600"
     "brightorange" "#FF9900"
     "brightgreen" "#33FF33"
     "darkgreen" "#009900"
     "black" "#000000"
     "teal" "#008080"
     "gray" "#808080"
     "darkblue" "#000080"
     default "#000000"
     end
     
     translations
     "&" "&"
     "<" "<"
     ">" ">"
     end
     
   Moreover, this file is also used for generating javadoc output:
     include "html_common.outlang"
     
     doctemplate
     " * <!-- Generated by Source-highlight -->
      * <pre><tt>
     "
     " * </tt></pre>
     "
     end
     
     nodoctemplate
     " * <!-- Generated by Source-highlight -->
      * <pre><tt>
     "
     " * </tt></pre>
     "
     end
     
     lineprefix " * "
     
     translations
     "*/" "*/" # this avoids the */ to be interpreted as
     # the end of a comment inside a javadoc comment
     end
     
   The javadoc output format is useful to format code snippets that have to
be included inside a javadoc comment of another Java
file48.  Apart from being formatted nicely in the
generated HTML documentation, this also releases the programmer from
escaping specific characters in the code snippet (i.e., &,
< and >).  Note also that it also avoids the sequence
*/ to be interpreted as the closing of the (javadoc) comment. 
For instance, if you write this code:
     /**
      * This is an example of usage
      *
      * <pre><tt>
      * System.out.println("*/");
      * </tt></pre>
      */
   The resulting Java code contains a syntax error. If you use source-highlight to format the code to insert in a javadoc comment you will avoid these problems.
An example of a javadoc generated HTML page containing a code snippet formatted with source-highlight can be found in the file SimpleClass-doc.html in the documentation directory.
Since version 2.2 Source-highlight also produces references to fields, variables, etc. In order to do this it relies on the program Exuberant Ctags, by Darren Hiebert, available at http://ctags.sourceforge.net. Thus, you must install this program if you want Source-highlight to provide this feature.
The ctags program generates an index (or “tag”) file for a
variety of language objects found in file(s).  This allows these items
to be quickly and easily located by a text editor or other utility (as
in this case for Source-highlight).  A “tag” signifies a language
object for which an index entry is available (or, alternatively, the
index entry created for that object)49.
   
This means that Source-highlight is able to generate references for a
specific source language if and only if ctags handles such
language.  We refer to the command line options of ctags:
--list-maps and --list-languages to find out the
associations of file extensions and supported languages.
   
Reference generation is enable by using the command line option
--gen-references (Invoking source-highlight).  This option
takes an argument that rules how references will be generated:
     
inlinepostlinepostdocThere is an exception: when an element has more than one reference
(because a variable is defined in many sources or because a method is
overloaded) then if inline is specified, the generation switches
to postline for that occurrence.
   
When --gen-references is specified, Source-highlight first
invokes ctags.  The use can customize this call by using the
command line option --ctags (Invoking source-highlight). 
In particular, if one does not want ctags to be invoked by
Source-highlight (e.g., because the tags file has already been
generated) then --ctags must be passed an empty string,
"".  In this case or when the specified ctags command line
generates an alternative output tag file (the default generated file is
tags), one must specify the exact tag file with the command line
option --ctags-file.
   
Once the tag file is generated, Source-highlight relies on the library
readtags provided by the ctags distribution, and included
in the Source-highlight sources.
   
Note that if a program element is formatted according to a style that
has the option noref (see Output format style) then this
element is not considered a tag, and no reference is generated.  This is
the case, for instance, for a comment element: each string that
is generated with the comment style, since this is declared with
the option noref, it is not considered a tag (see Examples).
Here we provide some examples of sources formatted with
Source-highlight using the -f texinfo
command line option.  Please keep in mind that the highlighting
will not be visible in the Info file, but only in the
printed manual and in the HTML output (well, at least line
numbers are visible everywhere :-).
The first example is produced by using the command:
source-highlight -f texinfo -i test.java -o test.java.texinfo -n
and here's the result
     01: /*
     02:   This is a classical Hello program
     03:   to test source-highlight with Java programs.
     04:   
     05:   to have an html translation type
     06:
     07:         source-highlight -s java -f html --input Hello.java --output Hello.html
     08:         source-highlight -s java -f html < Hello.java > Hello.html
     09:
     10:   or type source-highlight --help for the list of options
     11:
     12:   written by
     13:   Lorenzo Bettini
     14:   http://www.lorenzobettini.it
     15:   http://www.gnu.org/software/src-highlite
     16: */
     17:
     18: package hello;
     19:
     20: import java.io.* ;
     21:
     22: /**
     23:  * <p>
     24:  * A simple Hello World class, used to demonstrate some
     25:  * features of Java source highlighting.
     26:  * </p>
     27:  * TODO: nothing, just to show an highlighted TODO or FIXME
     28:  *
     29:  * @author Lorenzo Bettini
     30:  * @version 2.0
     31:  */ /// class
     32: public class Hello {
     33:     int foo = 1998 ;
     34:     int hex_foo = 0xCAFEBABE;
     35:     boolean b = false;
     36:     Integer i = null ;
     37:     char c = '\'', d = 'n', e = '\\' ;
     38:     String xml = "<tag attr=\"value\">ä</tag>", foo2 = "\\" ;
     39:
     40:     /* mymethod */
     41:     public void mymethod(int i) {
     42:         // just a foo method
     43:     }
     44:     /* mymethod */
     45:
     46:     /* main */
     47:     public static void main( String args[] ) {
     48:         // just some greetings ;-)  /*
     49:         System.out.println( "Hello from java2html :-)" ) ;
     50:         System.out.println( "\tby Lorenzo Bettini" ) ;
     51:         System.out.println( "\thttp://www.lorenzobettini.it" ) ;
     52:         if (argc > 0)
     53:             String param = argc[0];
     54:         //System.out.println( "bye bye... :-D" ) ; // see you soon
     55:     }
     56:     /* main */
     57: }
     58: /// class
     59:
     60: // end of file test.java
   
This example shows the use of --gen-references
functionality.  In particular, the following output is generated with
the command:
     source-highlight -f texinfo -i test.h -o test_ref.h.texinfo -n \
          --gen-references=postline
   and here's the result (note how the comment line containing the string
mysum does not contain references, since it is a comment
element, and this element has the option noref in the
texinfo.style, see Output format style.  The same holds for
the _TEXTGEN_H comment in the last comment line).
     01: /**
     02: ** Copyright (C) 1999-2007 Lorenzo Bettini
     03: **  
     04:   http://www.lorenzobettini.it
     05:   
     06:   r2 = r2 XOR (1<<10);
     07:   cout << "hello world" << endl;
     08: **  
     09: */
     10:
     11: // this file also contains the definition of mysum as a #define
     12:
     13: // textgenerator.h : Text Generator class &&
     14:
     15: #ifndef _TEXTGEN_H
     See _TEXTGEN_H.
     
     16: #define _TEXTGEN_H
     17:
     18: #define foo(x) (x + 1)
     19:
     20: #define mysum myfunbody
     21:
     22: #include <iostream.h> // for cerr
     23:
     24: #include "genfun.h" /* for generating functions */
     25:
     26: class TextGenerator {
     27:   public :
     28:     virtual void generate( const char *s ) const { (*sout) << s ; }
     29:     virtual void generate( const char *s, int start, int end ) const
     30:       {
     31:         for ( int i = start ; i <= end ; ++i )
     32:           (*sout) << s[i] ;
     33:         return a<p->b ? a : 3;
     34:       }
     35:     virtual void generateln( const char *s ) const
     36:         {
     37:             generate( s ) ;
     See generate.
     
     See generate.
     
     38:             (*sout) << endl ;
     39:         }
     40:     virtual void generateEntire( const char *s ) const
     41:         {
     42:             startTextGeneration() ;
     See startTextGeneration.
     
     See startTextGeneration.
     
     43:             generate(s) ;
     See generate.
     
     See generate.
     
     44:             endTextGeneration() ;
     See endTextGeneration.
     
     See endTextGeneration.
     
     45:         }
     46:     virtual void startTextGeneration() const {}
     47:     virtual void endTextGeneration() const {}
     48:     virtual void beginText( const char *s ) const
     49:         {
     50:             startTextGeneration() ;
     See startTextGeneration.
     
     See startTextGeneration.
     
     51:             if ( s )
     52:                 generate( s ) ;
     See generate.
     
     See generate.
     
     53:         }
     54:     virtual void endText( const char *s ) const
     55:         {
     56:             if ( s )
     57:                 generate( s ) ;
     See generate.
     
     See generate.
     
     58:             endTextGeneration() ;
     See endTextGeneration.
     
     See endTextGeneration.
     
     59:         }
     60: } ;
     61:
     62: // Decorator
     63: class TextDecorator : public TextGenerator {
     See TextGenerator.
     
     64:   protected :
     65:     TextGenerator *decorated ;
     See TextGenerator.
     
     66:
     67:   public :
     68:     TextDecorator( TextGenerator *t ) : decorated( t ) {}
     See TextGenerator.
     
     See decorated.
     
     69:
     70:     virtual void startTextGeneration() const
     71:     {
     72:         startDecorate() ;
     73:         if ( decorated )
     See decorated.
     
     74:             decorated->startTextGeneration() ;
     See startTextGeneration.
     
     See decorated.
     
     See startTextGeneration.
     
     75:     }
     76:     virtual void endTextGeneration() const
     77:     {
     78:         if ( decorated )
     See decorated.
     
     79:             decorated->endTextGeneration() ;
     See endTextGeneration.
     
     See decorated.
     
     See endTextGeneration.
     
     80:         endDecorate() ;
     81:         mysum;
     See mysum.
     
     82:     }
     83:
     84:     // pure virtual functions
     85:     virtual void startDecorate() const = 0 ;
     86:     virtual void endDecorate() const = 0 ;
     87: } ;
     88:
     89: #endif // _TEXTGEN_H
     90:
   
This is an example that uses --line-range command line
option on the input file shown in See Simple example:
     source-highlight -f texinfo -i test.java -n \
          --line-range="12-18","29-34"
   This generates the following output
     12:   written by
     13:   Lorenzo Bettini
     14:   http://www.lorenzobettini.it
     15:   http://www.gnu.org/software/src-highlite
     16: */
     17:
     18: package hello;
     29:  * @author Lorenzo Bettini
     30:  * @version 2.0
     31:  */ /// class
     32: public class Hello {
     33:     int foo = 1998 ;
     34:     int hex_foo = 0xCAFEBABE;
     
   Note that, although the specified line ranges span comment environments, the highlighting is respected: the starting of the comment is not printed, but the remaining parts of the comment are correctly highlighted as comment.
This is an example that uses the command line option --line-range
together with the --range-context and --range-separator:
     source-highlight -f texinfo -i test.java -n \
          --line-range="12-18","29-34" \
          --range-context=2 \
          --range-separator="{... not in range ...}"
   This generates the following output
     {... not in range ...}
     10:   or type source-highlight --help for the list of options
     11:
     12:   written by
     13:   Lorenzo Bettini
     14:   http://www.lorenzobettini.it
     15:   http://www.gnu.org/software/src-highlite
     16: */
     17:
     18: package hello;
     19:
     20: import java.io.* ;
     {... not in range ...}
     27:  * TODO: nothing, just to show an highlighted TODO or FIXME
     28:  *
     29:  * @author Lorenzo Bettini
     30:  * @version 2.0
     31:  */ /// class
     32: public class Hello {
     33:     int foo = 1998 ;
     34:     int hex_foo = 0xCAFEBABE;
     35:     boolean b = false;
     36:     Integer i = null ;
     {... not in range ...}
     
   Note the two additional 2 lines before and after the ranges (compare it
with the output in Line ranges).  Note that the (elements of the)
context lines are not highlighted.  Moreover, the range separator line
"{... not in range ...}" is printed between ranges (the
separator string is preformatted automatically, so, e.g., you don't have
to escape special output characters, such as the { } in texinfo
output).
Ranges can be expressed also using regular expressions, with the command
line option --regex-range.  In this case the beginning of the
range will be detected by a line containing (in any point) a string
matching the specified regular expression; the end will be detected by a
line containing a string matching the same regular expression that
started the range.  This feature is very useful when we want to document
some code (e.g., in this very manual) by showing only specific parts,
that are delimited in a ad-hoc way in the source code (e.g., with
specific comment patterns).
   
For instance, the following output was produced, starting from the source file shown in See Simple example, by specifying:
--regex-range="/// [[:alpha:]]+"
Note that the lines containing /// class, which determine the
range, are not shown in the output:
     32: public class Hello {
     33:     int foo = 1998 ;
     34:     int hex_foo = 0xCAFEBABE;
     35:     boolean b = false;
     36:     Integer i = null ;
     37:     char c = '\'', d = 'n', e = '\\' ;
     38:     String xml = "<tag attr=\"value\">ä</tag>", foo2 = "\\" ;
     39:
     40:     /* mymethod */
     41:     public void mymethod(int i) {
     42:         // just a foo method
     43:     }
     44:     /* mymethod */
     45:
     46:     /* main */
     47:     public static void main( String args[] ) {
     48:         // just some greetings ;-)  /*
     49:         System.out.println( "Hello from java2html :-)" ) ;
     50:         System.out.println( "\tby Lorenzo Bettini" ) ;
     51:         System.out.println( "\thttp://www.lorenzobettini.it" ) ;
     52:         if (argc > 0)
     53:             String param = argc[0];
     54:         //System.out.println( "bye bye... :-D" ) ; // see you soon
     55:     }
     56:     /* main */
     57: }
     
   Furthermore, the line numbers are consistent with the lines of the original file.
If we want to output only what is included between /* main */, we
specify (note that we must escape the special regular expression
character *):
--regex-range="/\* main \*/"
and we get:
     47:     public static void main( String args[] ) {
     48:         // just some greetings ;-)  /*
     49:         System.out.println( "Hello from java2html :-)" ) ;
     50:         System.out.println( "\tby Lorenzo Bettini" ) ;
     51:         System.out.println( "\thttp://www.lorenzobettini.it" ) ;
     52:         if (argc > 0)
     53:             String param = argc[0];
     54:         //System.out.println( "bye bye... :-D" ) ; // see you soon
     55:     }
     
   If we want to show only the methods, which in the source file are delimited by comment lines containing the method's name, we can specify:
--regex-range="/\* [[:alpha:]]+ \*/"
     41:     public void mymethod(int i) {
     42:         // just a foo method
     43:     }
     47:     public static void main( String args[] ) {
     48:         // just some greetings ;-)  /*
     49:         System.out.println( "Hello from java2html :-)" ) ;
     50:         System.out.println( "\tby Lorenzo Bettini" ) ;
     51:         System.out.println( "\thttp://www.lorenzobettini.it" ) ;
     52:         if (argc > 0)
     53:             String param = argc[0];
     54:         //System.out.println( "bye bye... :-D" ) ; // see you soon
     55:     }
     
   In this case, we might have also specified:
--regex-range="/\* main \*/","/\* mymethod \*/"
since --regex-range accepts multiple regular expressions.
   
IMPORTANT: the order of regular expression specification is crucial, since they are tested in the same order they are specified at the command line.
If you find a bug in source-highlight, please send electronic mail to
bug-source-highlight at gnu dot org
   
Include the version number, which you can find by running ‘source-highlight --version’. Also include in your message the output that the program produced and the output you expected.
If you have other questions, comments or suggestions about source-highlight, contact the author via electronic mail (find the address at http://www.lorenzobettini.it). The author will try to help you out, although he may not have time to fix your problems.
The following mailing lists are available:
help-source-highlight at gnu dot org
   
for generic discussions about the program and for asking for help about it (open mailing list), http://mail.gnu.org/mailman/listinfo/help-source-highlight
info-source-highlight at gnu dot org
   
for receiving information about new releases and features (read-only mailing list), http://mail.gnu.org/mailman/listinfo/info-source-highlight.
If you want to subscribe to a mailing list just go to the URL and follow the instructions, or send me an e-mail and I'll subscribe you.
I'll describe new features in new releases also in my blog, at this URL:
http://tronprog.blogspot.com/search/label/source-highlight
"expression": Ways of specifying regular expressions'expression': Ways of specifying regular expressions--data-dir: Invoking source-highlight--data-dir: Configuration files--data-dir: The program source-highlight-settings--infer-lang: How the input language is discovered--infer-lang: Invoking source-highlight--infer-lang: Perl--show-lang-elements: Listing Language Elements--show-lang-elements: Output format style--style-css-file: Output format style using CSS--style-file: Output format style--with-doxygen: Installation`expression`: Ways of specifying regular expressionsbgcolor: Output format style[1] Up to version 2.9, there were also the suffixes
-doc and -css-doc, but this mechanism was quite confusing
and complex; hopefully, this new one should be better.
[2] Although this might have been achieved with previous version, it is an official supported feature since version 2.5.
[3] http://www.gnu.org/software/autoconf
[4] http://www.gnu.org/software/automake
[5] http://www.gnu.org/software/libtool
[6] http://www.gnu.org/software/gnulib
[7] Since version 2.11, the configure script should be able
to correctly find the boost regex library if it is in the compiler
default path.
[8] Command
lines that are too long are split into multiple indented lines separated
by a \.  Of course these commands are to be given in one line
only, anyway.
[9] Command lines that are too long are
split into multiple indented lines separated by a \.  Of course
these commands are to be given in one line only, anyway.
[10] Before version 2.1, this file was called tags.j2h which used to be a very obscure name. I hope this name convention is a better one :-).
[11] Since version 2.6.
[12] Before version 2.1, this command line
option was called --tags-file which used to be a very obscure
name.  I hope this name convention is a better one :-).
[13] Since version 2.6.
[14] Of course, if you use HTML and an external CSS file you will achieve the same result.
[15] You can see these colors in HTML in the file colors.html.
[16] Note that, since version 2.2, you must use double quotes.
[17] Since version 2.6.
[18] Since version 2.9.
[19] This is the main difference introduced in version 2.0 with respect the previous version.
[20] This is the main difference introduced in version 2.1 with respect the the previous version.
[21] As explained before, originally Source-highlight was thought mainly for generating HTML output, this is why the term css is used for style sheets.
[22] Padding character can be specified since version 2.8.
[23] Since version 2.7.
[24] Since version 2.7.
[25] This issue concerning Perl regular expression syntax was raised by Elias Pipping, and this also pushed me to deal with this more powerful syntax that permits using backreferences, for instance. Although we're still far from highlighting Perl syntax completely (Perl), I definitely must thank Elias for his precious information about this matter :-)
[26] As Ed Kelly correctly pointed out, C-style comments are NOT nested; it's a big shame I've been using C++ and Java for years and have always thought they were nested :-)... Thus, in previous versions of source-highlight distributions, C-style comments were (uncorrectly) defined as nested. Thank you Ed, for your feedback!
[27] Since version 2.8
[28] I'm grateful to Jurgen Hotzel for rising this issue about Lua comments; this led me to introduce dynamic backreferences.
[29] At least, to the best of my knowledge :-)
[30] The strategy used by source-highlight for matching regular expressions changed since version 2.11 (and in version 2.10 the strategy used was not completely conceptually correct and it had a lot of overhead).
[31] according to the terminology of regular expressions.
[32] http://www.boost.org/libs/regex/doc/syntax.html
[33] the index only, without the escape character.
[34] This expression was provided by John Maddock, the author of the Boost regex library, as a solution of a problem I posted on the boost list,
http://thread.gmane.org/gmane.comp.lib.boost.devel/158237/focus=158276
[35] Since version 2.4.
[36] Up
to version 2.9 the output of --show-regex was a little bit more
complex to read; hopefully this output is better.
[37] Please note that this concept of state is different from the concept of “state” of an automaton.
[38] As a future extension we might think of providing a way, in the language definition syntax, to define a state/environment that extends the outer contexts instead of overriding them.
[39] This was not tested extensively and might not catch all the correct situations.
[40] OK, there are no templates in C, and they are only in C++, but we think it should no harm when highlighting C files.
[41] Before version 2.9, there was only cpp.lang which was used both for C and C++; however, this way, if you had a C program where you were using a C++ keyword as a variable name—which of course is correct in C—that variable was actually highlighted as a keyword and this was not correct.
[42] Since version 2.6.
[43] This is a sort of trick to insert spaces at
the beginning of a line without using a tabular environment; without the
leading \mbox{} these spaces would be ignored.  This is the
only way I found to achieve this, if you have suggestions, please let me
know!
[44] Since version 2.4.
[45] Unless they are inside a
<tt>...</tt>.
[46] Up
to version 2.9, there was only doctemplate and for --doc
there was a separate .outlang file; I think the present solution
is better and reduces the number of files.
[47] Since version 2.6.
[48] Although I haven't tested it, I think this will work also for Doxygen comments.
[49] This description is taken from the ctags man page