Wo3 Hen3 Chan2

Wo3 Hen3 Chan2: Concept

I'm picturing something along the following lines:

As you type/cut & paste in the text field, the area below shows you stays up to date with the most relevant results to the item that you typed into the text field. The text field displays its data in a comprehensible format (e.g. GB2312 data pasted directly into the text field (and displayed and interpreted as such) should be displayed as chinese glyphs).

Startup

The startup time (bringing up the main UI window pictured above) should not depend on the size of any single dictionary, nor should it depend on the completion of any network operation. Any initialization involving dictionary data or network-accessible data should start immediately, so as to take advantage of user idle time. However, said network or data operations should occur in the background. Once the user requests information from one of these sources, results for UI operation in question should come back as soon as data is available, and progress should be displayed to the user.

Data sources should be located with a minimum amount of assistance from the user. If a canonical URL exists for a particular data source that must be mirrored to the user's computer, the URL should be known ahead of the time by the program.

Supported Data Sources

Wo3 Hen3 Chan2 should be able to take advantage of information from the following data sources:

Encoding Detection

There is some discussion of this at http://sourceforge.net/forum/message.php?msg_id=34059

The pulldown in the upper right represents the "encoding" of the input. The pulldown has two modes, autodetect mode and manual mode.

In autodetect mode, the pulldown displayed field is based on the input in the text field, and perhaps some sort of user history, if we wanted to get fancy. In this mode, the pulldown displays the input type name followed by the string "[auto]."

For example:
If the user typed in 'x', the pulldown would display pinyin[auto] (since 'x' is more likely to be the start of a pinyin than of an english word). The user types in 'y', and the pulldown switches to displaying english[auto].

If the user clicks on the pulldown, the pulldown items would be:

pinyin
english
...
GB2312

Note that the displayed pulldown item (with [auto]) is not displayed. If the user selects one of the encodings, the pulldown switches into manual mode.

In manual mode, the displayed encoding is also one of the pulldown selections. The only way for the selection to change is if the user manually changes it using the pulldown widget. The selections are the same as in automatic mode, with one addition:

[autodetect]

Selecting this will detect the encoding based on the current value of the text field, and switch back to automatic mode.

I'm picturing the following modes:

Name Encoding

English ISO8859-1

Yale Cantonese ISO8859-1

Pinyin ISO8859-1

GB Simplified Chinese GB2312

HZ (different than rfc 1842?) HZ-GB-2312

ISO2022 Simplified ISO-2022-CN

ISO2022 Traditional ISO-2022-CN

Big5 Traditional Chinese BIG5

EUC Traditional Chinese EUC-TW

Unicode (simplified) UTF-8

Unicode (traditional) UTF-8

Unicode (passthrough) UTF-8

Are there other encodings that I want? Do the three unicode encodings make sense? (Does unicode have separate spaces for traditional and simplified? Probably... but should check). What about ISO-2022-CN and ISO-2022-CN-EXT? Is it better to define another attribute: simplified vs. traditional?

What about UTF-7 (RFC 1642, 2152)

What about RFC 1815?

Might be a good idea to read RFC 1502

Searches

Searches should be done in parallel as to ensure the lowest possible latency for searches involving the network. The operating system is presumed to be able to optimize any disk contention on local searches.

Character Map: Concept

The character map (probably might look something like microsoft's Character Map) provides a non-online, quicker lookup time version of various character charts available in the world (e.g. Unihan).

It has the following goal:

Test a java font setup for CJK character display capability.

Name	Encoding
English	ISO8859-1
Yale Cantonese	ISO8859-1
Pinyin	ISO8859-1
GB Simplified Chinese	GB2312
HZ (different than rfc 1842?)	HZ-GB-2312
ISO2022 Simplified	ISO-2022-CN
ISO2022 Traditional	ISO-2022-CN
Big5 Traditional Chinese	BIG5
EUC Traditional Chinese	EUC-TW
Unicode (simplified)	UTF-8
Unicode (traditional)	UTF-8
Unicode (passthrough)	UTF-8