addedValues Plugin

...Powerful Free! Database Expansion for Manila
logoBottle:

(1 or more words)


Get tropes here!
Click to see internals
Report bug


Arcade Fire

sitepic_af0711.jpg:
Viewable with Any Browser

Members
Join Now
Login

Text Indexing

A variable which is indexed and for which Index by Word is true, is said to be text indexed. The value of the variable is broken into words and each word is indexed seperately. For example in the preceding sentence, the words are

Defining a Word

The definition of a word is determined by a regular expression, which can be set in the preferences for each site. Regular expressions are well documented, including the implementation used by by Frontier prior to the release of Frontier 9.1, at Script Merdian. A new implementation of regex was added to Frontier 9.1.

One possible definition of a word is \w+ , which means any number of adjacent numbers or letters. A better definition is [A-Za-z][a-zA-Z0-9\-]*[a-zA-Z0-9] which eliminates leading numbers, allows hyphens but not trailing, and enforces a minimum length of 2 characters. A further level of sophistication is to disallow trailing s characters using [A-Za-z][a-zA-Z0-9-]*[a-rtzA-RT-Z0-9]. In our example sentence this generates this list

For European languages with latin character sets, a variant which handles diacritical characters such as ¸, Â and others correctly, is [[:alpha:]][[:alnum:]-]*[[:alnum:]], however it doesn't seem to be possible to strip trailing s characters. Its is language idiosyncracies like this that makes it desirable to allow the Managing Editors of each site to individually change the definition of a word for their sites.

Stopwords

Bearing in mind that the point of indexing is to perform searches, it is common to strip "noise" words that contribute nothing to a search but do increase the numebr of indexed words and hence the speed of the search. In addedValues these are called stopwords and addedValues ships with a set for English, German, French, Italian, and Dutch. Using the default stopwords list for English reduces the words from our example sentence to

The server administrator can install a set of stopwords that apply to all addedValues text indexing operations for any language, overriding addedValues default list. In addition the Managing Editor of each site can speify a list that applies only to their website, overriding the default and the server administrators list. Both of these override operations need to be performed manually in addedValues 1.0.

b162 - indexing stopwords are now set by language; default tables have been provided for English, German (thanks to Peter Baumgartner for locating a list and for Hans Lohninger for creating it); French (thanks to Jerome Camus), Italian (Jerome Camus!) and Dutch (Jan Storms) and Danish (from apseek.org). The defaults can be overridden for each and any language at the server and website level.

b167 - the containsAny and containsAll operators, which compare the value of a variable to a string , have been implemented. They can only be used variables that have been indexed by word. The words tested are those obtained by breaking the string into distinct index-eligible words - using the same word pattern definition and stopwords as indexing. ContainsAny returns true if the value contains any of the words; containsAll returns true if they all appear in the string at least once.

- a search of a variable which is indexed by word, now does a contains any search when the search item is not a single word.

fc12 - when a search form includes an element for a variable which is indexed by word, its permitted to submit multiple word search targets, Until now, addedValues treated that as an OR search. That is a search target of "Jack Jill" (not including the quotes) returned the union of hits search for each term separately. Now, the default is an intersection (AND). However, to search for the union of hits use search target "Jack OR Jill" - the OR must be upper case.