ExpressionEngine CMS
Open, Free, Amazing

Thread

This is an archived forum and the content is probably no longer relevant, but is provided here for posterity.

The active forums are here.

Unicode Sorting Issues

August 02, 2009 5:44am

Subscribe [7]
  • #1 / Aug 02, 2009 5:44am

    Kim Ryu Hyun

    65 posts

    This question may be related to a resolved thread.


    All of my weblogs’ and categories’ Alphabetical sorting order seems to be wrong. Here’s the screen capture of one of my category groups in EE Alphabetical sorting order:

    http://screencast.com/t/8VpcVB1uD

    It may be hard for non-Korean to know the difference but as you can see all of Hangul characters are placed between A and B. And among Hangul characters themselves, it’s not in the proper order. If you would like to see the proper order for Hangul characters, please refer to the following Wikipedia article:

    http://en.wikipedia.org/wiki/Hangul

    I know it’s a long article but if you search for “Consonantal jamo names” within the page, you get see the characters and their Roman names. Below that under “Unicode Chart” of Hangul, you can see that Unicode itself has everything in proper order. All of my English weblogs and categories Alphabetical sorting order works fine. Only the Korean weblogs and categories have problems. I have mixed Korean and English articles in my database.

    Did you encounter similar issues with multi-byte languages in the past? Here are my system settings for now:


    MySQL character set : UTF-8 Unicode (utf8) and MySQL connection collation:  utf8_general_ci

    ExpressionEngine 1.6.8
    Build:  20090723

  • #2 / Aug 02, 2009 4:25pm

    Greg Aker

    6022 posts

    Greetings!

    We are going to check with the team on this one, and someone will be back with you.

    Thanks!

    -greg

  • #3 / Aug 04, 2009 4:35am

    Kim Ryu Hyun

    65 posts

    Even though I am storing all my data in Unicode and database itself is set to use it, EE seems to read and write them to and from database in Latin1 natively. I think that may be the cause of this problem.

  • #4 / Aug 04, 2009 6:28am

    Brendon Carr

    135 posts

    I think that’s the cause of the problem too. My own use of EE for Korean-language text is frustrated by this—I need to use a number of custom fields to get “alphabetic” sorts on Korean names and that’s not an elegant or intuitive solution at all.

    Are you in Seoul? That would be three forum members that I know of.

  • #5 / Aug 04, 2009 8:57am

    Kim Ryu Hyun

    65 posts

    Hi Brendon,

    I am glad to see you here. And yes, I am in Seoul. I am relatively new to EE and am the Korean translator for EE lang pack. This seems to be a major problem affecting not just Korean but rest of the non Lantin languages that are using multi-byte characters such as Japanese, Chinese, Hebrew and Russian.

    I am surprised to find that no one really tackled this issue until now. It seems to a make or break issue for EE if they are serious about rest of the world market other than Latin based. I have found a Korean EE user who has hacked EE to force read and write in Unicode. I am not sure it solves this particular problem but I have to check.

    In all of EE web pages, hacking is not recommended and I do not wish to go that route until I am forced to do so as a last resort. If it comes to that, I really need to rethink about continuing to use EE for any of my projects.

    In the mean time, I would like to know what you have done to solve this issue even it’s not the most elegant solution.

    Thanks.

  • #6 / Aug 04, 2009 10:22am

    Sue Crocker

    26054 posts

    Just a quick update, this issue has been escalated to the Dev team. Thanks in advance for your patience!

  • #7 / Aug 04, 2009 8:41pm

    Robin Sowell

    13255 posts

    Sorry this one sat for a bit.  I was reading the history- and am slightly concerned that the import may be playing a part.  Though multi-byte can prove tricky all by itself.

    To focus in on what’s up- I’d like to try a really simple replication.  Can you give me say- three titles in the order they SHOULD be sorting- that do NOT sort in that order on your site?  Something I can copy/paste to try and replicate (if necessary, put it in a text file, zip and attach).  I’ll start by seeing what happens on a UTF-8 site on EH to try and narrow in on the problem.

  • #8 / Aug 04, 2009 11:44pm

    Kim Ryu Hyun

    65 posts

    Here’s the titles with correct sorting order:

    EllisLab(1)
    Friend(2)
    강건일(3)
    박건자(4)

    Depending on the back-end database collation, English words can come before or after Hangul words. They come out all mixed-up on my system which has database collation set to utf8_general_ci. Here’s how they are sorted in my category group edit page:

    EllisLab(1)
    박건자(4)
    강건일(3)
    Friend(2)

    Another system with database collation set to latin1_swedish_ci comes out little diffrently still mixed as follows:

    EllisLab(1)
    강건일(3)
    박건자(4)
    Friend(2)

    Thanks.

  • #9 / Aug 06, 2009 11:44am

    Robin Sowell

    13255 posts

    Ouch- these hurt my brain.  I duplicate wonkiness- utf8.  On the plus side- my 2.0 install works correctly.  Which is a giant hint.  Now to figure out why….

  • #10 / Aug 06, 2009 12:35pm

    Robin Sowell

    13255 posts

    OK- after a fair bit of comparison… I suspect the client collation may be the culprit.

    IF everything is definitely utf-8?  All db settings and all data (you don’t have old/imported data that’s not in utf-8)- check this kblog entry- and try the hack regarding

    mysql_query("SET NAMES 'utf-8'");

    While hacks are not encouraged, this issue will be fully addressed in 2.0 (and a test shows no problems with it in the beta).  With the 1.x branch, we can’t approach things in quite the same way due to the variety of possible db collation settings being used.

  • #11 / Aug 06, 2009 2:54pm

    Kim Ryu Hyun

    65 posts

    It’s not clear what you want me to do here. Are you suggesting that I convert all my latin1 encoded utf8 database to utf8 everything before trying the hack? Otherwise, this wouldn’t work.

    I am pretty sure that all my data is in utf8 but since EE reads and writes in latin1 natively, my database is currently encoded in latin1 even though I have set it to be utf8.

    This is why I have inquired regarding not able to read Korean correctly using any other tool except EE CP in my earlier thread related to this one. EE must not have significant user base outside of latin world or they must be all using hacks like this one.

    Even if I convert my database and apply the hack as you suggest, am I going to have a smooth upgrade path to 2.0? If so, please expain how?

  • #12 / Aug 06, 2009 5:49pm

    Robin Sowell

    13255 posts

    Blargh- I thought utf all around and it would work out.  Unfortunately- while the hack makes it work- it only does so after I go through and resave each entry.  Which was no big deal since I only had entries.  However, with 2.0 looming, I wouldn’t want to do a full conversion right now.

    If you have any choice in the matter?  I would leave this for now.  While we could work around it, given the proximity of 2.0 I really don’t know that I’d want to manually convert the database over.

    I know the sort issue is an inconvenience, but it is a livable one at the moment?  It’s the best option if it’s possible.

  • #13 / Aug 06, 2009 11:36pm

    Kim Ryu Hyun

    65 posts

    Well Robin,

    I have got thousands of entries with complex relationships. It’s not just inconvenience but I simply cannot go to production this way.

    Is there any way I could get hold of 2.0 beta? I have deadline to meet and am in deep trouble if I can’t.

  • #14 / Aug 07, 2009 1:21pm

    Robin Sowell

    13255 posts

    The second phase of the 2.0 beta hasn’t started yet- as soon as it does, I’d suggest applying for it and referencing the unicode issues as one element in support of why you’d make a good beta tester.

    That said- it’s still in beta and even at that point, I would not put an important live site on it.  So time wise, I don’t think that will fit with your deadline.

    An alternative that shouldn’t endanger the deadline would be to use a custom field to specifically set the sort order- order on that rather than title.  There are some third party offerings that make the interface for that simple as well- see Reeorder.

    My apologies on this one.  But at the moment?  The custom order is really the method I’d recommend given all of the circumstances.

  • #15 / Aug 07, 2009 10:30pm

    Kim Ryu Hyun

    65 posts

    This is not an acceptable solution at all. I have got thousands of entries and I have to go through each entry and do a custom order by hand. Let me be clear here.

    You acknowledge that you have a unicode sorting bug in EE 1.x software. This probably affects all of your non-Latin customers with any significant database.

    As far as I know from my previous conversations with EE, 1.x software will continue to be sold even after 2.0 release. And EllisLab’s official position is NOT to fix this bug or do ANYTHING about it?

.(JavaScript must be enabled to view this email address)

ExpressionEngine News!

#eecms, #events, #releases