Progress
Programming
Handbook


Using Word-break Tables

You can create word-break tables that specify word separators using a rich set of criteria. To specify and work with word-break tables involves:

Specifying Word Delimiter Attributes

As mentioned previously, to break down the contents of a word-indexed field into individual words, Progress needs to know which characters delimit words and which do not. The distinction can be subtle and sometimes depends on context. For example, consider the function of the dot in the character strings in Table 9–5.

Table 9–5: Is the Dot a Word Delimiter? 
Character String
Function of the Dot
Is the Dot a Word Delimiter?
“Balance is $25,125.95”
Decimal separator
No
“Shipment not received.Call customs broker”
Period at end of sentence.
Yes

In the first character string, the dot functions as a decimal point and does not divide one word from another. Thus, you can query on the word “$25,125.95.” In the second character string, by contrast, the dot functions as a period, dividing the word “received” from the word “call.”

To help define word delimiters systematically while allowing for contextual variation, Progress provides eight word delimiter attributes, which you can use in word-break tables. The eight word delimiter attributes appear in Table 9–6.

Table 9–6: Word Delimiter Attributes
Word Delimiter Attribute
Description
Default
LETTER
Always part of a word.
Assigned to all characters that the current attribute table defines as letters.
In English, these are the uppercase characters A–Z and the lowercase characters a–z.
DIGIT
Always part of a word.
Assigned to the characters 0–9.
USE_IT
Always part of a word.
Assigned to the following characters:
  • Dollar sign ($)
  • Percent sign (%)
  • Number sign (#)
  • At symbol (@)
  • Underline (_)
BEFORE_LETTER
Part of a word only if followed by a character with the LETTER attribute.
Else, treated as a word delimiter.
BEFORE_DIGIT
Treated as part of a word only if followed by a character with the DIGIT attribute.
Assigned to the following characters:
  • Period (.)
  • Comma (,)
  • Hyphen (–)
For example, “12.34” is one word, but “ab.cd” is two words.
BEFORE_LET_DIG
Treated as part of a word only if followed by a character with the LETTER or DIGIT attribute.
-
IGNORE
Ignored.
Assigned to the apostrophe (’).
For example, “John’s” is equivalent to “Johns.”
TERMINATOR
Word delimiter.
Assigned to all other characters.

Understanding the Syntax of Word-break Tables

Word delimiter attributes form the heart of word break tables, and you specify them using the following syntax:

SYNTAX
[ #define symbolic-name symbol-value ] ... 
[ Version = 9 
   Codepage = codepage-name 
   wordrules-name = wordrules-name 
   type = table-type 
] 
word_attr = 
{ 
  { char-literal | hex-value | decimal-value } , word-delimiter-attribute 
      [ , { char-literal | hex-value | decimal-value } 
           , word-delimiter-attribute ] ... 
}; 

symbolic-name

The name of a symbol.

For example: DOLLAR-SIGN

symbol-value

The value of the symbol.

For example: ’$’

NOTE: Although some versions of Progress let you compile word-break tables that omit all items within the second pair of square brackets, Progress Software Corporation (PSC) recommends that you always include these items. If the source-code version of a compiled word-break table lacks these items, and the associated database is not so large as to make this unfeasible, PSC recommends that you add these items to the table, recompile the table, reassociate the table with the database, and rebuild the indexes.

codepage-name

The name, not surrounded by quotes, of the code page the word-break table is associated with. The maximum length is 20 characters.

For example: UTF–8

wordrules-name

The name, not surrounded by quotes, of the compiled word-break table. The maximum length is 20 characters.

For example: utf8sample

table-type

The number 2.

NOTE: Some versions of Progress allow a table type of 1. Although this is still supported, Progress Software Corporation (PSC) recommends, if feasible, that you change the table type to 2, recompile the word-break table, reassociate it with the database, and rebuild the indexes.

char-literal

A character within single quotes or a symbolicname, which represents a character in the code page.

For example: ’#’

hex-literal

A hexadecimal value or a symbolicname, which represents a character in the code page.

For example:0xAC

decimal-literal

A decimal value or a symbolicname, which represents a character in the code page.

For example: 39

word-delimiter-attribute

In what context the character is a word delimiter. You can use one of the following:

Examples of Word-break Tables

The following is an example of a word-break table for Unicode:

/* a word-break table for Unicode */ 
#define DOLLAR-SIGN ’$’ 
Version = 9 
Codepage = utf-8 
wordrules-name = utf8sample 
type = 2 
word_attr = 
{ 
  ’.’,         BEFORE_DIGIT, 
  ’,’,         BEFORE-DIGIT, 
  0x2D,        BEFORE_DIGIT, 
  39,          IGNORE, 
  DOLLAR-SIGN, USE-IT, 
  ’%’,         USE-IT, 
  ’#’,         USE_IT, 
  ’@’,         USE-IT, 
  ’_’,         USE_IT, 
}; 

As the preceding example illustrates, word-break tables can contain comments delimited as follows:

/* this is a comment */ 

For more examples, see the word-break tables that Progress provides in source-code form. They reside in the DLC/prolang/convmap directory and have the file extension .wbt.

NOTE: Progress supplies a word-break table for each code page it supports.

Compiling Word-break Tables

After you create or modify a word-break table, you must compile it with the PROUTIL utility. The syntax is as follows:

Operating System

Syntax
UNIX
Windows
proutil -C wbreak-compiler src-file rule-num

src-file

The name of the word-break table file to be compiled.

rule-num

A number between 1 and 255 inclusive that identifies this word-break table within your Progress installation.

The PROUTIL utility names the compiled version of the word-break table proword.rulenum. For example, if rulenum is 34, PROUTIL names the compiled version proword.34.

Associating Compiled Word-break Tables with Databases

After you compile a word-break table, you must associate the compiled version with a database using the PROUTIL utility. The syntax is as follows:

Operating System

Syntax
UNIX
Windows
proutil database -C word-rules rule-num

database

The name of the database.

rule-num

The value of rulenum you specified when you compiled the word-break table.

To associate the database with the default word-break rules, set rulenum to zero.

NOTE: Setting rulenum to zero associates the database with the default word-break rules for the current code page. For more information on code pages, see the Progress Internationalization Guide .

Rebuilding Word Indexes

For word indexing to work as expected, the word-break table Progress uses to write the word indexes (to add, modify, or delete a record that contains a word index) and the word-break table Progress uses to read word indexes (to process a query that contains the CONTAINS operator) must be identical. To ensure this, when you associate the compiled version of a word-break table with a database, Progress writes cyclical redundancy check (CRC) values from the compiled word-break table into the database. When you connect to the database, Progress compares the CRC values in the database to the CRC value in the compiled version of the word-break table. If they do not match, Progress displays an error message and terminates the connection attempt.

If a connection attempt fails and you want to avoid rebuilding the indexes, you can try associating the database with the default word-break rules.

NOTE: This might invalidate the word indexes and require you to rebuild them anyway.

To rebuild the indexes, you can use the PROUTIL utility with the IDXBUILD or IDXFIX qualifier.

The syntax of PROUTIL with the IDXBUILD qualifier is:

Operating System

Syntax
UNIX
Windows
proutil db-name -C idxbuild [ all ]
[ -T dir-name ] [ -TB blocksize ]
[ -TM n ] [ -B n ]

The syntax of PROUTIL with the IDXFIX qualifier is:

Operating System

Syntax
UNIX
Windows
proutil db-name -C idxfix

For more information on the PROUTIL utility, see the Progress Database Administration Guide and Reference.

Providing Access to the Compiled Word-break Table

To allow database servers and shared-memory clients to access the compiled version of the word-break table, it must reside either in the Progress installation directory or in the location pointed to by the environment variable PROWDrulenum. For example, if the compiled word-break table has the name proword.34 and resides in the DLC/mydir/mysubdir directory, set the environment variable PROWD34 to DLC/mydir/mysubdir/proword.34.

NOTE: Although the name of the compiled version of the word-break table has a dot, the name of the corresponding environment variable does not.


Copyright © 2004 Progress Software Corporation
www.progress.com
Voice: (781) 280-4000
Fax: (781) 280-4095