Abstract
Vilistextum is a html to ascii converter specifically programmed to output ascii text suitable for reading.
Some features:
- stdin/stdout capability
- can swallow multiple empty lines
- set width of output text
- removes empty ALT tags
- set default string for IMG without an ALT tag
- can convert characters and entities between 128 and 159 from the windows1252 charset to meaningful strings in 8859-1. Eg 0x93 is converted to '"'. There are quite a lot of broken documents on the web that use windows1252.
- output can be optimized for ebook reading
- GUI-frontend using kaptain
REQUIREMENT:
For the main program a decent gcc installation suffices.
If you want to use the GUI-frontend, you need to have installed kaptain.
INSTALL:
make or gmake
DOWNLOAD:
vilistextum-v2.3.2.tar.gz
vilistextum-v2.3.2.tar.bz2
USAGE:
vilistextum [OPTIONS] [inputfile|-] [outputfile|-]
This is the command line program.
kilistextum
GUI-frontend using kaptain. Its usage should be obvious, even if you haven't read this manual.
Start with "kilistextum". The makefile tries to guess where kaptain resides. If it fails you can add something like "#!/pathto/kaptain" to the first line or start it with "kaptain kilistextum".
Command line arguments
- inputfile,- resp. outputfile,-
- Replace inputfile with '-' for reading from standard input, likewise outputfile with '-' for writing to standard output.
- --version
- Reports version number and release date.
- -h,--help
- Prints a list of the command line options.
- -c, --convert-tags
- Some of the tags will be converted to special characters.
Eg: "<B>Bold</B> isn't <I>italic</I> isn't <U>underlined</U> isn't <EM>emphasized<EM> but is like <STRONG>strong</STRONG>."
will be output as "*Bold* isn't /italic/ isn't _underlined_ isn't /emphasized/ but is like *strong*."
- -p, --palm
- This outputs text more suitable for reading on a PDA.
Palm textreader do their own wordwrapping, so the width is set to infinity and the program doesn't rightjustify or center the text.
- -w, --width number
- The width of the output text.
Default: 72.
- -m, --nomicrosoft
- The entities from windows1252 that are € - Ÿ and their proper names will not be converted.
- -i, --defimage string
- IMG tags without alt attribute are output as [string].
Default: Image.
- -r, --remove-empty-alt
- If there is an empty ALT attribute in a IMG tag (eg <IMG href="..." alt='">), don't output '[]'.
- -s, --shrinklines
- If there are more than two newlines, output only two. There is at most one completely empty line.
- -l, --links
- Numbers the links in the document and prints the corresponding addresses at the end of the file. Similar to 'lynx -dump'. Note: Relative URIs are not resolved and won't be printed.
- -e, --errorlevel number
- Increase level of verbosity for error messages.
0: No error messages
1: Show unrecognized entities
2: Show unknown tags
>2: Mostly debugging information
BUGS and similar features:
The handling of OL is broken. The program treats it as UL and more than 6 nested lists confuse it.
Text is never justified.
Bugreports or comments:
You can send your comments or bugreports to this address. If you've discovered a bug, please give the link or attach a copy of the html file that caused that particular bug.
Patric Müller
Last modified: Tue May 22 00:17:22 CEST 2001