Tuesday, January 29, 2008

For NCBI

Your searches use:
-mainly nouns and adjectives, as keywords;
-conjunctions, as formal logical operators.

The parts of speech in English display a distribution which becomes evident when you put one of the freely available web dictionaries in an Excel spreadsheet, like in the attached file.
The most numerous words are “carriers of sense” (sémantèmes, they exist because of either the concepts of substance they are corresponding to: nouns, adjectives; or the concepts action they are corresponding to: verbs).
A certain number of words exist because the sémantèmes need to be put in relation. As a consequence, they express certain types of relation, valid no matter what the sémantèmes are.
When sémantèmes undergo changes to allow relations with other words, they are submitted to “flexion”.
Dictionaries take into account only the “basal” form of a word: nominative for nouns, infinitive for verbs.
“Flexion” is a changed form under which words appear; it takes in charge to make specific the relation of a given word to other words in a phrase, according to their sense.
“Highly flexionary languages”, like Sanskrit, German, Finnish, Estonian, Hungarian, Italian - to cite only the classic and those modern ones having importance beyond their national borders- , are in fact languages with a dominant enclitic flexion. Including all the words resulting from flexion in a dictionary would generate a huge list.
This is no longer the case with many modern languages, including English, which have “delegated” most of the sense of enclitic flexion to auxiliaries, mainly prepositions and certain adjectives and conjunctions. Strangely, a term like “extraclitic” (or “exoclitic”?) flexion hasn’t been adopted!
Flexion –enclitic, “exoclitic”- carries the logic of sense –the “semantic logic”. It is polyvalent.
But a part of the language seems to apply an “existential judgment” to groups of words, taking or not into account their semantic value. This existential judgment uses what could be named a “metalanguage” composed of “to exist”, “yes”, ”no”, ”and”, ”or”, ”either/or”. These are known as “logical operators”.
Search engines use this existential metalanguage. They retrieve items “existentially”: criteria exist together or exclude each other. They do not look for items where search criteria are related on semantic bases. Hence, they don’t use flexion.
In English such relations need not enclitic flexion (which would lead to a huge increase in the words in a dictionary) because the flexion is “exoclitic”. So, a search engine could have an enormous “semantic gain” if it accepted at least the most common words of the exoclitic flexion, namely prepositions : 1) in, inside, within –for the latin locative; 2) from –for the latin ablative- and to –for the latin dative; 3) of the, “ ‘ “, “’s” –for the genitive.
A quite exhaustive list of the prepositional correspondences of the cases could probably be found in a Finnish dictionary. But I doubt if they are of much use in medicine and biology. This is seemingly the case of adverbs too. Probably they will be useful for general purpose search engines (assuming they stop giving tens of millions of answers in less than a second and assuming they stop give priority to advertising sites).
And, of course: if search engines (medical or not) worked in German or Finnish or…








Natural languages use semantic operators too.
Prepositions are the most important among them (let's say, because they assume the role of cases in languages like latin, sanskrit, german, finnish, estonian, hungarian, italian - to cite only the classic and those modern ones having importance beyond their national borders.


Searches in database like yours benefit of introducing them, at least partially. I consider the following for a beginning:
-in, within, inside;
-from (for equivalence with the ablative in latin);
-to (for equivalence with dative in latin);

-',...'s for the genitive.

As a quick example of their use, one has merely to compare the results of the searches like :1) rna cytoplasm/rna IN cytoplasm. With the latter, you get an answer which corresponds more to your interest.

Another example: (rna “outside the nucleus”) finds results where only “outside” appears, but only if “nucleus” exists somewhere in the page even without relation to “outside” (5512). Astonishingly, (rna “outside” “the nucleus”) finds far less results (2), where “outside” and “nucleus” both appear, separately.
These results seem normal if only formal logic is used.
But these results are disappointing if you look for something precise: articles speaking about rna in the nucleus!

No comments: