Using Lingua::Lid in a Threaded Application
As of version 0.02 Lingua::Lid is thread-safe if compiled with a recent version of lid (3.0.0 or higher).
This allows you to safely call Lingua::Lid's language and charset identification functions, like lid_ffile and lid_fstr, simultaneously within your application by making use of Perl's ''threads'' module. As thread support in Perl is a compile time option, you will need a thread-enabled version of Perl as shipped by most modern Linux distributions like Debian Lenny or Ubuntu Lucid - or ActiveState's version for Windows.
Of course, you are free to use Lingua::Lid in any non-threaded code as well. Software written using Lingua::Lid v0.01 will stay functional without modification.
If you are not familiar with using Perl's threads, the perlthrtut tutorial and the threads module documentation are a good place to start.
The following example application, lingua-lid-thread-example.pl provides
a basic example on how Lingua::Lid may be used in a threaded application.
A set of files is given to the application as an argument.
The application will then create $max_threads threads, each identifying
the language and character encoding of one file, and print the results as
long as there are no files to identify left.
Perl#!/usr/bin/perl -w use strict; use Config; die "usage: $0 file(s)\n" unless scalar @ARGV; ## check whether the used version of Perl has been compiled ## with thread support unless ($Config{useithreads}) { die "The used version Perl does not support threads!\n"; } require threads; require Lingua::Lid; my $max_threads = 2; my $nr = 0; ## while there are files given as arguments left... while (@ARGV) { my $file = shift(@ARGV); ## ...create a thread that identifies the file's language ## and charset and returns the determined results in ## scalar context when it is requested to join() to the ## main thread of control again. threads->create({ context => "scalar" }, sub { ## identify language and charset of the file using ## lid's lid_ffile my $res = Lingua::Lid::lid_ffile($file); return { file => $file, ## $res will be undef if no result could be ## computed result => $res, ## in this case, Lingua::Lid::errstr() will ## return the error message reported by lid's ## lid_strerror() function. errstr => Lingua::Lid::errstr() }; }); ## if the maximum amount of concurrent threads has been ## reached or no files are left to identify, join all ## threads and print their results. if (scalar @ARGV % $max_threads == 0 || ! scalar @ARGV) { foreach my $thread (threads->list()) { my $rv = $thread->join(); printf("%02d: %s: %s\n", ++$nr, $rv->{file}, $rv->{result} ? join(", ", $rv->{result}->{language}, $rv->{result}->{isocode}, $rv->{result}->{encoding}) : "ERROR: $rv->{errstr}" ); } } }
Download the source code: Lingua-Lid-thread-example.pl
Please note that the package variable Lingua::Lid::errstr could have been
used instead of Lingua::Lid::errstr(), too.
Internally it is tied to Lingua::Lid::errstr() using Tie::Scalar and
therefore thread-safe as well -- however, it is recommended to use the
function to obtain an error message in any new code, because the package
variable may be removed in a future release of Lingua::Lid because it -by
concept- implies a lack of thread-safety.
Here is an example invocation using a set of text files in a variety of languages and charsets intermixed by some non existent or "special" files to demonstrate Lingua::Lid's error handling facilities.
Shell$ perl lingua-lid-thread-example.pl danish.txt \
dutch.txt non-existent.txt english.txt /dev/null \
french.txt german.txt /dev/zero swedish.txt
01: danish.txt: Danish, dan, UTF-8
02: dutch.txt: Dutch, nld, UTF-8
03: non-existent.txt: ERROR: Failed to open file
04: english.txt: English, eng, UTF-8
05: /dev/null: ERROR: Insufficient input length
06: french.txt: French, fra, UTF-8
07: german.txt: German, deu, UTF-8
08: /dev/zero: ERROR: Binary input data
09: swedish.txt: Swedish, swe, UTF-8

2010-06-21 09:12