Mathieu Tozer's Dev Blog

Cocoa, the development of Words, and other software projects (including those dang assessment tasks).

Parsing Issues

Published Friday, January 13, 2006 by Mathieu | E-mail this post

I wrote all this today while at work at MegaSun. I tapped it out on my newton, the screen wasn't so readable, and I haven't proofed through it yet. So it might be all crap. So there you go!

--

The dictionary look-up utility must be versatile enough to lookup words in any dictionary, in any language. The dictionary choosing utility must be able to handle the lookup utility, or rather must interface with it neatly. so that it can handle the steady stream of words coming in, and (perhaps) advise the lookup utility which language [encoding] the word is so that it searches the right dictionary.

Complexity generally increases towards the dictionary lookup facility, then decreases afterwards as the word gets through to the actual application and is inserted in the appropriate list under its language.

What if the development of some part of this could be done in C++, and factored out into one of my uni assignments, so that I can minimise the amount of work being done. I don't want to have to work on some dumb project unrelated to my Words project while trying to get that done as well.

It is difficult to asses the actual amount of work that each component amounts to. Various unseen obstacles will most likely stand in the way, unforeseen things that didn't rear their ugly heads in the specs.

If I can encapsulate the various parsing modules, develop them, and put them all together into a working unit, where the OUTPUT of the last module interfaces with Words, then great. This removes some of the complexity from the main app (but is used by it).

So these are the steps.
1. Initial formatting (make plain text etc, correct encoding)
2. Separation of texts into words (with their context saved)
3. Checking for prior existence in user dic (warning:: synonym killer)
4. Determining Encoding, therefore language, attaching to word (if not done in first step)
5. Passing word to lookup utility which
5.1 Checks active lang dicts, then other lang dicts for a hit. Word is output with dictionary data attached.
6. Word is input into the application proper and other metadata (date, knowledge level, list and language word belongs to) is added.
7. Application adds new words to the user dictionary for that language. (Could the same lookup utility be used for looking up user dictionaries?) = Less code! And less code is always better code.

Then we will have our goal. Lists of words that you have seen, and haven't seen, displayed in user defined smart lists, with their dictionary attributes showing, and all the user had to do was drag in a block of text.

The above is what would happen in a 'perfect' world. There are some snags along the way.
Like the synonyms getting chucked in the bin by showing that they've been seen before. This could be overcome by dictionaries that flag words that are synonyms.

There's also the issue of some words having multiple meanings, so the dictionary searches and chucks words out the other end with more than one translation attached to them. The best solution I have for this so far is to have a clarifier ask the user which word they mean. In practice, I hope, users won't have THAT many new words to learn, so that they won't be swamped with long hours of classifying words into what they mean. That would be the direct opposite of what this program is designed to do. But if it were something like 10 words or so in a bunch that needed some clarification (synonyms etc.) then that could work.
The option for words that go un-clarified (say the user never bothers to do it) is to leave all translations attached to the word.

Think: What would users (I almost said I) be doing manually?
When they encounter a word in a dictionary, and see many variations, do they only choose the one and chuck out the rest for a later time, or do they try to absorb them all so that in the future when they encounter it in the real world they have a better chance of understanding it in context. Something to find out would be whether dictionaries give a 'most common' usage of a word, so that can be displayed most prominently to minimise clutter, and then with further disclosure a word can reveal its other, similar meanings.

THUS Adhering to simplicity, reveal complexity when it it needed, which is some of the time. Otherwise, show only the basics, so that (as per the google homepage) what you want to get done most of the time is shown most of the time, without clutter.

Closed swiss army knife.

I think this strategy might work. Expand a row containing a word to reveal more detailed information about it.

The other is the issue of determining which language a word (or set of words) belongs to. Generally an input text is all in a single language. No can that, word lists given by teachers sometimes have the translations put in them already. (Note: users should be able to change the autolookup translation to anything their teacher tells them it is, we don't want students loosing marks because they called something sematically the same but not what the teacher ordered) And some texts or webpages might contain various languages. Thefore language (if it is determinable at all) should be determined on a word by word basis. I hope the added processing doesn't kill the speed of proceedings too harshly!

There is also the issue of keeping the textual context in the word. Does the parser go back once the word has made it through the other texts and grab the sentence the word has just come from by searching for the first instance of it? Or does the word object carry it though the whole system from word go? The latter seems to be complex, and makes the system carry a lot more extra baggage than is really necessary. The former method means that the text will need to be kept as long enough is needed to get all the words that made it through aligned with their context text and then put away neatly.

There is also a need to the user dictionary to record how many 'hits' a word gets, so that the importance or regularity that a word has can be monitored and reflected in the application. This should be a simple matter of incrementing an integer each time a word is queried.

So there I have several modules that all fall under the header of 'parser', and is essentially the crux of the main functionality of the program (other functions include flashcarding, iPod export and record, networked features etc.)
Therefore this is the section that I need to start on first.

I don't need to decide yet what language I'm going to be implementing in, that is a minor consideration cmpared to the overall architecture. And I needn't worry about what will happen after the neatly packaged 'new word' object is passed to the application proper.

Let's just believe that computer science at Monash University doesn't do much for your programming confidence if you're not very good at writing command line disk drivers and semaphore handlers in procedural C. No matter how good a designer or OO programmer you are you're not going to get very good marks.Your design and algorithms should just slip into code when they are ready to. Don't force it, just keep on designing.

So to design the first phase of the parser: Formatter.

\\excerpt\
Ok, back to the formatter.
Input: Text as plain, rtf, or HTML etc, from 1 to x number of words long.
Only works if separate words are separated by whitespace. Therefore Asian Scripts as they normally are, are not supported at this stage unless they are seaparated by a space, for example. Will of course work fine for single words.

Output: Plain text object to be parsed in encoding information intact.

Inside, the text is made plain text. All HTML are removed. Imgages are removed. Basically the text has to be removed of all the junk. Excess punctuation is removed where it is not part of a word. Factor out all these filters to a different function so that errors in the code are easilier identified. Or so that when things like the punctuation removal are not working properly I know exactly where to look.
Make everything lower case. Put the text into a special object and pass it on as the output.

The second step for the journey is that of a small handler that dishes out words consecutively.
It could actually encase the dictionary look up utility(step 3) in the sense that it calls it to look up a word on every iteration, passing a single word to it. It takes as input a formatted text object and outputs individual words as a string with their encoding details attached (if this is available to be used to identify languages).

The third step is the lookup utility.
It takes as input an individual word, and it has access to the various user and non user language dictionaries in the system.
At this step it is only concerned with the user dictionaries. When a simple word object (a wrappered string with encoding info attached?) is sent to it, it always only check the user dictioanry (although this might upset the general nature of the utility, the only need I see for it at this stage is for the parsing). We'll roll with it.
When called at this stage and passed only a single word, the dictionary looks in the appropriate language's user dictionary and sees if an entry already exists. If there is an identical entry, and the entry does not have a synonym, then the word is chucked (ie, already seen the word before) The entry does get updated with a new 'times seen' increment. If there is no prior entry, the word object is output with 'new' flagged. So The lookup utility OUTPUTS a new word object of class word, at this stage with the fact that it has not been seen before flagged.

The fourth step for a word is to be INPUT again into the Lookup Utility again, but this time, the lookup utility searches language dictionaries (see note about language selection) for instances of the word. The procedures for this have been documented elsewhere, but the end result is that thelookup utility then OUTPUTS the word object with all the search results that it obtained in the lookup entered into the object.

(Now I am thinking that the two lookup utilities described would be better as two separate code bases as the second one has to do considerably more. Although there are some similarities... perhaps closer examination of actually what data the entities will manipulate will help make this distiction clearer and help in deciding).
The next step involves the contextualiser, which takes as INPUT the original text (I mean the very original, not the formatted one) and it is scanned for each new word. The sentence it occurs in (from last full stop to the proceeding full stop perhaps) is coped into the object and OUTPUT to the next stage.

The next step is to use the insert function on the appropriate user dictionaries to INSERT all the new word objects into the user dictionaries.

Now we have a pool of objects (although this all really happens ina streamlined manner, without objects really waiting anywhere) to be inserted into the application sphere, which is not discussed here in detail, but I will say that thet are set and ready to have other application (ie non word specfic) metadata associated to them such as the time and date, their initial knowledge level set (nothing) etc etc, and also to which list they belong to.

The next steps involve researching which dictionaries I am going to be interfacing with, and what kind of results I can expect to have returned from them when given search criteria. #What kind of serch criteria do they require?
Then it might be a good idea to see what frameworks there are already available that I can use, so that I don't have to invent them myself. Like the input of various files and their conversion to plain text.