Example of Use

lidc detects the language and character encoding of textual input and supports a variety of common input formats. This document gives some examples on how to use lidc effectively.

Basic Example of Use

Assume you have written an HTML document and are in doubt in which character encoding your editor stored it to disk. By using lidc, obtaining this information is trivial:

Shell$ lidc -i document.html
German, deu, UTF-8

lidc automatically guesses the document's type to be HTML by evaluating its file extension. The results tell you that the document is encoded using the "UTF-8" character set and is written in German (ISO 639-3 code: "deu").

It is also possible to use lidc within a pipe, but as in this case no filename is known to lidc, it cannot automatically determine the document's type, so it is necessary to set it manually ("-t"):

Shell$ cat document.html | lidc -t html
German, deu, UTF-8

Customizing lidc's output

By default, lidc displays the following information, separated by commas:

  • the detected language's name
  • the detected language's ISO 639-3 code
  • the identified character encoding's name

Customizing the output to fit your needs is easily achieved by using placeholders embedded in a free form format string.

In the next example, lidc is instructed to output only the file's name ("%f") along with the declared ("%d") and identified character encoding ("%e"):

Shell$ lidc -i document.html -f "%f: %d - %e\n"
document.html: ISO-8859-1 - UTF-8

In the example above, the declared character encoding does not match the actual one used. lidc can be utilized to determine common character encoding errors that may occur.

As the format string can be freely customized, more elaborate output formats such as XML documents or SQL statements can be produced by lidc as needed. The following examples show both the output of an XML document and a SQL statement.

Shell$ lidc -i document.html -f "<file>\n\t<lang>%l</lang>\
\n\t<charset>%e</charset>\n</file>\n\n"
<file>
    <lang>German</lang>
    <isocode>deu</isocode>
    <charset>UTF-8</charset>
</file>

$ lidc -i document.html \
-f "INSERT INTO documents VALUES('%f','%l','%e');\n"
INSERT INTO documents VALUES('document.html','German','UTF-8');

Have a look at the list of placeholders for the complete set of placeholders supported by lidc.

Using external Filters

lidc supports a large set of different input formats. If you need to use lidc with a format not supported so far, you may want to use a third party filter on your document. Its output can be passed to lidc using a pipe.

The following example shows how to detect the language of a PDF document, using pdftotext(1) (part of xpdf) and lidc.

Shell$ pdftotext manual__eng.pdf - | lidc -f "%l\n"
English


Have a look at lidc's man page and/or manual for an in-depth description of all of its options.