Home | Shorter Path | About Me
Home
About Me
RSS Feed

Planners (you know you want it)

Archive

2004

01

02

03

04

05

06

07

08

09

10

11

12

 

2005

01

02

03

04

05

06

07

08

09

10

11

12

 

2006

01

02

03

04

05

06

07

08

09

10

11

12


Blogroll
 
Borland
Allen Bauer
Anders Ohlsson
Chris Bensen
Malcolm Groves
Michael Swindell
Steve Trefethen
Borland Blogs
TeamB
TeamB Blog Server
Nick Hodges
Other
Algorithms for the Masses
Brad Abrams
Chris Brumme
Chris Pratley
Dan Miser
Don Box
Falafel Flogs
iunknown.com
Joel on Software
Matt Pietrek
Suzanne Cook
The Daily WTF
The New Old Thing
Wintellog

Archive for 06/2004

Foresight
Monday, June 21, 2004 08:27 PM

37 years ago, on June 5th, 1967, Israel launched an attack against Egypt. The war was quickly expanded to three fronts - Egypt on the south, Jordan on the east, and Syria on the north. The 1967 war was one of Israel's greatest military achievements - in only six days, it has defeated its enemies on all three fronts, and has conquered the Sinay peninsula, Gaza strip, the West Bank (including the Jordanian part of Jerusalem), and the Golan Heights. The war ended on June 10th. To this day, Israel occupies the West Bank and the Gaza Strip (the Occupied Territories).

In the years following the war, Israeli society was in a state of total euphoria. Some rejoiced at the "liberation" of Jewish land, while others basked in the glory of the military victory. It took years for any real objection to the occupation and the subjugation of the Palestinians to rise.

Very few people, however, realized the meaning and consequences of the occupation immediately. On September 22nd, 1967, the ad reproduced here was published in Haaretz, a prominent Israeli newspaper. The ad text reads as follows:

Our right to defend ourselves from destruction
does not give us the right to oppress others

Occupation leads to Foreign Government
Foreign Government leads to Resistance
Resistance leads to Oppression
Oppression leads to Terrorism and Counter-Terrorism

The victims of terror are usually innocent

Possession of the Occupies Territories will make us a nation of murderers and murdered

Let Us Get Out of the Occupies Territories Now

At the bottom are twelve signatures. The last signature in the center row is my father's. I stumbled across this a few months ago and was struck by the accuracy of these predictions, and the courage these people had to make them at that time.

|

TeamB blogs
Thursday, June 17, 2004 11:47 PM

TeamB blogs are now up. The server, graciously hosted by Borland, runs a custom version of .Text using InterBase as the database back end and a data provider written in Delphi for .NET, courtesy of Robert Love.

|

Refactoring tools
Sunday, June 06, 2004 04:01 AM

Don Box points to this leaked image of a VB.NET context menu, showing the new refactoring support in Whidbey. As Microsoft's Paul Vick explains:

One thing that is likely, however, is that VB.NET won't be using the term “refactoring” in the IDE to refer to these features. While that term is fairly common in some quarters, we believe there are lots of VB programmers who are likely to see a menu item called “Refactor” and just ignore it because they don't know what the term means.

|

And another one
Saturday, June 05, 2004 04:56 AM

This is spreading like the Plague: Borland's Corbin Dunn (who's pretty much responsible for any Delphi feature that heavily uses color) is now blogging. An Atom feed is available here, and I notice it contains the full articles.

|

Internationalization
Saturday, June 05, 2004 01:36 AM

Roy tries to post in Hebrew. He says, and I quote: "???? ?????. ?????? ?????? ????? ???."

No, there's nothing wrong with your browser, and there's nothing wrong with my post. Hebrew support is just very difficult to implement. In fact, support for any language other than English can be difficult to implement, but Hebrew, Arabic, and several other languages make things even harder.

If you're developing any kind of application aimed at international users ("international user" meaning "anyone that doesn't live in the US, uses a foreign language, or even types words like 'résumé'"), be it a Windows program or a web page, you need to know some things about internationalization. Some of these things are cultural (for example, images and icons may have different meanings - sometimes even offensive meanings - in other cultures), and I won't discuss those here. You may need to consult with experts about such things. The technical aspects, however, are within your control, so that's what we'll cover.

First, some terminology. The two basic concepts in international applications are "internationalization" and "localization". Internationalization (affectionately known as "i18n") means preparing your application to accept multiple locales (a locale is a collective name for the user's environment, including the language, date, time, and number formats, and other culture-specific settings). This includes things like displaying numbers in a way the user can read them, accepting input in a variety of formats, and allowing users to enter data in their own language, when appropriate.

Localization means actually translating your application to a specific locale. This includes things like translating every text displayed by your application (including menus, toolbars, error messages and so on), making sure visual elements such as images and icons match the user's culture, and re-organizing the visual layout of your application to match the display language.

In this post, I'll focus on internationalization, and - since this was inspired by an attempt to post to a blog - on text. I'll also discuss some of the specific issues related to Hebrew and other similar scripts.

International Text

The main issue with supporting international text is that text has meaning and context, and proper handling of text requires consideration of these attributes. When I say "meaning" and "context", I'm not referring to the thoughts and ideas that might be expressed by the text, but to the way a digitized text stream should be interpreted. For example, English text contains both lower and upper case characters. When displaying text, programs should take this attribute into account. Sorting and searching, however, often require treating lower as upper case characters as identical, or at least very similar.

I won't go into the history of digitized text storage, or into the details of character sets, code pages, or even Unicode. You can find plenty of information elsewhere. For a quick primer about these things, take a look at this article by Joel Spolsky's. Joel makes the correct point that for a computer to handle text properly, it needs to know things about the text. In modern systems, such as Windows or the Web, that information is called "encoding". In regards to text, encoding means the collection of information a computer needs to know about how text is stored, such as what binary code is used for each character, how many bits of storage are required for a character, and how text strings should be sorted.

Platforms and Applications

If you want your code to support international text, the bare minimum you have to do is make it aware of text encoding. The level of support for encoding depends on your application. Simple applications could probably get away with just supporting the user's locale, relying on the operating system to handle text using the default settings. Platforms, on the other hand, must take encoding into consideration or they, and any application that runs on them, will be useless to international users.

Fortunately, most platforms today already consider text encoding. Let's consider Roy's blog, and see why his test post failed. Roy's blog runs on .Text, an increasingly popular blog engine written in C# and running on ASP.NET. ASP.NET is part of the .NET Framework, runs over IIS, and uses Microsoft SQL Server as its database back end. The framework, IIS, and SQL Server run on Microsoft Windows. .Text provides a web interface (that is, a page displayed by a web browser) to enter new blog posts.

All the Microsoft server products mentioned in the previous paragraph are platforms, and Microsoft made sure they all fully support international text. They all store text internally using a Unicode encoding, and all of them can import and export text in a variety of other encodings. The problem is that for internationalization to work, every component in the chain must support it. If only one fails, everything fails. In this case, it was .Text.

If you look at the HTML typically generated by .Text, you may notice something conspicuously missing. The <HEAD> section doesn't contain any encoding information. Web page encoding is stated by including a tag such as the following one in the <HEAD> section of a page:

<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">

This tag (taken from the page you're now reading) tells the browser the page contains HTML and uses the Unicode UTF-8 encoding. Since .Text apparently doesn't care about encoding, it omits the tag, and the browser tries to guess the best encoding for the page. In order to post a message in Hebrew, the following chain of events must take place:

  • The user types in Hebrew text in the browser.
  • The browser posts the text to IIS, passing along the encoding information it thinks is correct.
  • IIS asks Windows to convert the text to Unicode, based on the requested encoding. The success of this may depend on Hebrew being supported by the specific installation of Windows.
  • IIS passes the request to ASP.NET.
  • ASP.NET passes the request to .Text.
  • .Text passes the data to SQL Server, specifying whatever encoding it thinks is appropriate.
  • SQL Server stores the data as Unicode.

.Text is more than an application: it's a platform for running blogs. As such, it should probably support internationalization. By ignoring text encoding, it renders itself useless for international users. Fortunately, since every other link in the chain supports internationalization, this should be fairly easy to fix.

Hebrew and Complex Scripts

Selecting the proper encoding for text, marking it as using that encoding, and handling it based on that information using standard system services are all the actions you need to take to properly support most languages. Certain languages, however, require a little more work. Microsoft refers to such languages as "complex scripts", and provides a special API for handling them. The API, called "Uniscribe", is supported in Windows 2000 and later. It is also included with Internet Explorer 5.0 and later. Hebrew, Arabic, and Thai are considered complex scripts.

You don't have to use the Uniscribe API to support complex scripts, but you do have to take certain things into consideration.

Alignment

Complex scripts may be aligned to the right, and users reading or writing complex scripts expect applications to support right-alignment of text.

Reading order

Even more important than alignment is the reading order. Right-to-left languages, such as Hebrew and Arabic, are actually bidirectional scripts. While text is read from right to left, numbers, for example, are read from left to right. Certain characters (mostly punctuation marks) are considered language-neutral, and are placed according to the current reading order, while others (such as English characters) are language specific and control the reading order. For example, the string "abc." would appear as ".abc" in RTL reading order. Other characters may change when displayed in RTL reading order. For example, parenthesis are reversed. Bidirectional support requires handling both the logical order of a string (as entered using the keyboard) and the visual layout, which may be very different. It should also handle things like caret movement and hit testing.

For web pages, alignment and reading order can be controlled by the DIR attribute, which you can apply to almost any tag. Applying the attribute to the <HTML> or <BODY> tags controls the direction of the page (for example, <HTML DIR="RTL"> causes the entire page to be right-aligned, and even moves the vertical scroll bar to the left side of an Internet Explorer window). Applying the attribute to a <P> tag controls the direction of a single paragraph.

Special Considerations

Certain RTL languages, like Arabic, require contextual shaping, which means displaying a different glyph for a character depending on its position and the characters around it.

Arabic script is heavily based on ligatures. A series of characters are joined into a single glyph. Arabic and Hebrew also support diacritics, which are placed directly over a character glyph.

Thai, another RTL language, has complex rules for word breaks and text justifications. A text entry control must be aware of these rules. Thai support must also prevent certain character combinations.

Conclusion

If you're developing an application and expect it to reach the hands of international users, you must take certain things into account. Ignoring text encoding may even render your application useless for such users. In today's world, where global distribution has become the default, you may not be able to afford that.

וזה לא כל כך קשה לכתוב בעברית, אם יש לך כלים מתאימים.

|

New Borland blog
Thursday, June 03, 2004 08:31 PM

Chris Bensen is blogging. An Atom feed can be found here, even though the link doesn't (yet) appear on the page.

|

Copyright 2004 Yorai Aminov