Man page of tw_classify(3) and tw_classify_file(3)

Index


NAME

tw_classify, tw_classify_file - classify a document

SYNOPSIS

C/C++ #include <tw.h>

 tw_errno_t tw_classify(tw_t *tw, const char *str, short n,
                        tw_prob_t ***probs);

 tw_errno_t tw_classify_file(tw_t *tw, const char *path, short n,
                        tw_prob_t ***probs);

DESCRIPTION

tw_classify() and tw_classify_file() classify documents into previously learned categories.

tw_classify() processes strings while tw_classify_file() handles documents stored within the file system.

Both functions store a user-definable number of categories along with their probabilities (category/probability pairs) to probs.

PARAMETERS

tw_classify() and tw_classify_file() have any but the second parameter in common:

tw (tw_t *)

Pointer to an initialized Textweiser object.

n (short)

Number of category/probability pairs that should be stored within probs.

probs (tw_prob_t ***)

Pointer to pointer to pointer to category/probability pairs (tw_prob_t).

Declare probs as a pointer to pointer to tw_prob_t and initialize it to NULL. Pass a pointer to probs to tw_classify() or tw_classify_file().

On success probs will contain a NULL terminated array of the relevant n or less pairs in descending order of probability.

APPLICATION CODE EXAMPLE:

 tw_prob_t **probs = NULL;

 tw_classify(&tw, string, 5, &probs);

For a complete example, have a look at EXAMPLE below.

tw_classify() expects as a second parameter:

str (const char *)

The document's content as a string.

tw_classify_file() expects as a second parameter:

path (const char *)

The path to the document within the file system.

RETURN VALUE

Both tw_classify() and tw_classify_file() return an error indicator (tw_errno_t). A return value of TW_OK indicates success, any other value discriminates the occurred error.

The function tw_strerror(3) can be used to obtain a natural language error message.

NOTES

o

Both functions require the input to be plain text and should be in a supported language - see the Textweiser User Manual for details.

o

Both functions should be passed valid UTF-8 input for a maximum of classification speed and accuracy.

o

On Windows, the category/probability pairs have to be freed using tw_free_prob_t(3). On any other OS using tw_free_prob_t(3) is recommended.

o

The classification results stored in probs are sorted in descending order by probability. As a result, the category with the highest probability will be placed in the first array element.

CATEGORY/PROBABILITY DATA STRUCTURE

The tw_prob_t data structure contains the following members:

category (char *)

The name of the category.

probability (float)

The calculated probability that the input document belongs to category.

EXAMPLE

For brevity, the following example assumes that Textweiser uses an SQLite database backend and a set of categories has already been added and trained.

C/C++ #include <stdio.h>
 #include <stdlib.h>

 #include <tw.h>

 int main(int argc, char *argv[])
 {
     tw_errno_t   rv     = TW_OK;
     tw_prob_t  **probs  = NULL;
     const char  *string = "The house prices have risen.";
     tw_t         tw     = TW_INITIALIZER;

 


     /* Initialize a Textweiser object using the SQLite
      * database backend. */
     rv = tw_init(&tw, NULL, NULL, NULL, "example.sqlt", 0);

     if (rv != TW_OK)
     {
         fprintf(stderr, "Failed to initialize: %s\n",
                 tw_strerror(rv));

         return EXIT_FAILURE;
     }

 


     rv = tw_classify(&tw, string, 2, &probs);

     tw_free(&tw);

     if (rv == TW_OK)
     {
         if (probs)
         {
             short i = 0;

             for (i = 0; probs[i]; i++)
             {
                 printf("Category \"%s\" -> %.2f%%\n",
                        probs[i]->category, probs[i]->probability);
             }

             tw_free_prob_t(probs);
         }
         else
         {
             puts("No results");
         }

         return EXIT_SUCCESS;
     }
     else
     {
         fprintf(stderr, "Failed to classify: %s\n",
                 tw_strerror(rv));

         return EXIT_FAILURE;
     }

     return EXIT_SUCCESS;
 }

Example output:

 Category "Economy & Markets" -> 100.00%
 Category "Holidays" -> 13.02%

SEE ALSO

tw-classify(1)

tw_free_prob_t(3), tw_learn(3), tw_learn_file(3), tw_free(3), tw_strerror(3)

Textweiser User Manual

http://www.lingua-systems.com/text-classifier/textweiser-library/