Thread

This is an archived forum and the content is probably no longer relevant, but is provided here for posterity.

The active forums are here.

High ASCII characters

September 18, 2009 12:23pm

Subscribe [3]

1
2
next

#1 / Sep 18, 2009 12:23pm

Deron Sizemore
1033 posts

It was brought to my attention to day that we’ve got some weird characters showing up. For example: http://ksba.org/board-team-development/student-achievement

All double and single quotes (maybe other characters as well) seem to be displaying weird symbols on the site.

Can anyone point me in the right direction as to how to fix this and maybe give me a club as to what I’m doing wrong so it doesn’t happen again? I know the user that updates the site copy/pastes from Word a lot and according to him on the live site it usually converts the characters on the live site, but in the control panel in the edit entry area it shows the weird characters.

I’ve encountered this before on another site and just went through every single entry and changed them. That’s not really an option on this site since it’s much larger.
#2 / Sep 18, 2009 12:39pm
Ingmar
29245 posts
The symptoms are pretty clear, you have UTF-8 encoded characters displayed as Latin-1. Even though you have
```
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
```
in your code, your server sends HTML pages as iso-8859-1.
```
HTTP/1.x 200 OK
Date: Fri, 18 Sep 2009 15:35:13 GMT
Server: Apache/2.2.3 (Debian) PHP/5.2.0-8+etch13 mod_ssl/2.2.3 OpenSSL/0.9.8c
Transfer-Encoding: chunked
Content-Type: text/html; charset=iso-8859-1
```
This is, in all probability, a server configuration issue. If on Apache, check for AddDefaultCharset, it defaults to iso-8859-1.
#3 / Sep 18, 2009 12:52pm

Deron Sizemore
1033 posts

Ingmar,

I’m glad it’s clear to someone at least… it’s about as clear as mud to me. 😊

So, to break this down I have UTF-8 encoded characters displaying at Latin-1 on the site. The Latin-1 characters are the weird ones, correct?

If I have the charset set to UTF-8, how then is the server still sending HTML pages as iso-8859-1?

Sorry, I need to do more research on all of this. I’ve never really understood all of the encoding types and what they do and why I use one over others, etc.
#4 / Sep 18, 2009 1:01pm
Ingmar
29245 posts
So, to break this down I have UTF-8 encoded characters displaying at Latin-1 on the site. The Latin-1 characters are the weird ones, correct?

Yes. For plain 7-bit ASCII characters (English letters, numbers and a few other symbols) encoding does not matter, at all. For high ASCII characters, however, like typographical quotes, umlauts etc, it does. There is more than one way to encode them. In western languages we usually use iso-8859-1 (aka Latin-1) or utf-8 (aka Unicode). The latter is preferable, and clearly the more future-compatible way. Your characters are encoded as utf-8, so that’s fine.

Now we also need to tell the browser that it is, in fact, dealing with such characters. This is usually done via HTML headers. Since we seldom set them directly, the
```
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
```
line is generally used, an “HTTP Header Equivalent”. So far, you’re doing everything correctly and by the book. The problem, now, is that your server does force a “real” HTTP header on all connections. Every HTML page it serves it does with a “iso-8859-1” header, regardless of the actual code.

If I have the charset set to UTF-8, how then is the server still sending HTML pages as iso-8859-1?

It’s a server side override, if you will (or disregard for your stated preference, if your prefer). For a quick fix, try adding
```
AddDefaultCharset Off
```
to your .htaccess. If that doesn’t work, please speak to your host about the issue.

I’ve never really understood all of the encoding types and what they do and why I use one over others, etc.

It’s so much easier these days, just use utf-8 for everything.
#5 / Sep 18, 2009 1:27pm
Deron Sizemore
1033 posts
Thanks for the detailed explanation Ingmar. Makes a whole lot more sense to me now. I used to think all of the issues I’ve seen in the past where I get the weird characters stemmed from the way information was encoded in the database. Maybe that’s where I was getting confused.

The only bit I’m still fuzzy on is that you said for High ASCII characters there’s more than one way to code them; either iso-8859-1 or utf-8. I’m trying to encode them utf-8 but the header information is sending iso-8859-1 and overriding it. So, if iso-8859-1 is actually one of the ways we can encode High ASCII characters, why doesn’t it work on the site? Or is it simply because it’s confused; I’m trying to encode utf-8 but iso-8859-1 is getting processed thus the weird characters. If I had just used
```
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />
```
instead of utf-8 from the start, would everything be working fine since it would match up with the server? Or would it still be broken?

Anyway, I tried adding AddDefaultCharset Off to the .htaccess file and it did not fix the characters, so I will get in contact with my host.

Here is what my .htaccess file looked like after I made the change:

RewriteEngine On
RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
RewriteRule ^(.*)$ /index.php/$1 [L]

AddDefaultCharset Off

Thanks
#6 / Sep 18, 2009 4:18pm
Ingmar
29245 posts
The only bit I’m still fuzzy on is that you said for High ASCII characters there’s more than one way to code them; either iso-8859-1 or utf-8.

There are other code pages, too, but these two are the most common ones for the Western European language group (iso-8859-15 if you need the € symbol. Which you probably won’t. I digress.)

The problem is always that, while the basic 7bit remain the same, the 8th bit, which gives us an additional 128 characters, are not the same in all languages. Which makes sense, if you think about it, because Russian obviously needs different 128 characters than Norwegian, or French, or Hungarian. So we need to tell the browser how to interpret these “high” chracters, which character set to choose from. (Unicode is an exception here in that it can basically display all characters from all known languages (pretty much), and is clearly the superior choice.)

I’m trying to encode them utf-8 but the header information is sending iso-8859-1 and overriding it.

Yes. Let me give you a crude example: Imagine a library, if you will, where you pick up a book from the section labeled “French literature”. The book is printed in French all right, but when the librarian hands it to you he says “actually, this one’s in English, sorry about that”. The librarian in this example is your server. If he hadn’t said anything, you would have treated it as French, and been perfectly happy with it. Now that you think it’s English (and have forgotten all your French) you won’t be able to read it, although the book hasn’t changed at all.

So, if iso-8859-1 is actually one of the ways we can encode High ASCII characters, why doesn’t it work on the site?

It would, but you’d actually have to encode your documents as iso-8859-1, not utf-8 as you do. In our example, you’d have to translate the book into English, and put it in the correct section. Then, when you check it out the librarian again tells you it’s in English, only you’re OK with it this time, because it’s what you’ve been expecting.

So, yes, if your server forces Latin-1 on all documents you can solve the issue by actually storing your documents in that format. I don’t recommend it, though, utf-8 is clearly the way to go. Most modern software uses it by default.
I’m trying to encode utf-8 but iso-8859-1 is getting processed thus the weird characters. If I had just used
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />
instead of utf-8 from the start, would everything be working fine since it would match up with the server?
Well, the “Content-Type” is only the declaration, or classification if you will. If you put a French book into the “English literature” section, it’s still in French. You’d also have to translate it (= convert characters from utf-8 to iso-8859-1). If you tell EE that iso-8859-1 is what you want to use from now on (= change the language in which your books are being written) future documents and templates will be correct, but existing ones still need to be converted.

Still with me? It’s not rocket science, but still moderately complex. Took me some time to fully grok it, and then only out of necessity 😉
#7 / Sep 24, 2009 1:41pm

Deron Sizemore
1033 posts

Ingmar,

Thanks for taking time to explain. I’m following what’s happening a lot better now.

I contacted our host and they changed the server setting to utf-8 (supposedly). But, when I visit one of the pages that had the character issues, it’s not fixed. Will existing pages and templates need to be fixed manually or should they be fixed as soon as the encoding is correct on the server?
#8 / Sep 24, 2009 5:32pm

Ingmar
29245 posts

I contacted our host and they changed the server setting to utf-8 (supposedly).

They really shouldn’t force any encoding, really, it’s usually set via headers in the template itself. Gives you much more flexibility as a developer.

That said, can you give me a link to a page that should have been fixed? Using http://ksba.org/board-team-development/student-achievement as an example, it’s still being served as iso-8859-11.
#9 / Sep 25, 2009 9:51am

Deron Sizemore
1033 posts

I contacted our host and they changed the server setting to utf-8 (supposedly).

They really shouldn’t force any encoding, really, it’s usually set via headers in the template itself. Gives you much more flexibility as a developer.

That said, can you give me a link to a page that should have been fixed? Using http://ksba.org/board-team-development/student-achievement as an example, it’s still being served as iso-8859-11.

I was under the impression that they should change the setting on the server to send HTML pages as UTF-8 instead of iso-8859-1? Wouldn’t that be forcing the encoding?

Every single page on ksba.org should have been (by my understanding) changed to use UTF-8 encoding instead of iso-8859-1. The “student achievement” page is one that isn’t fixed and still has the weird characters. I’m going to contact our host again.

EDIT: I see what you’re saying. I checked the HTML headers for logogala.com (on EngineHosting) and see that they are not “forcing” any encoding. It just show “text/html” for the content-type so then at that point I can specify it in my templates.
#10 / Sep 25, 2009 10:38am
Ingmar
29245 posts
I was under the impression that they should change the setting on the server to send HTML pages as UTF-8 instead of iso-8859-1? Wouldn’t that be forcing the encoding?

Well, yes, because it would mean that you couldn’t use any other encoding. With utf-8 that’s not so bad because that’s what you’ll want to use anyway, but there’s really no need to do so indiscriminately, and for all served documents. I have a hard time to see where and how AddDefaultCharset would be necessary when you can simply use
```
<meta http-equiv="Content-type" content="text/html; charset=utf-8" />
```
in your code.

Every single page on ksba.org should have been (by my understanding) changed to use UTF-8 encoding instead of iso-8859-1.

Sorry, no. This is the header I get:
```
HTTP/1.x 200 OK
Date: Fri, 25 Sep 2009 13:36:05 GMT
Server: Apache/2.2.3 (Debian) PHP/5.2.0-8+etch13 mod_ssl/2.2.3 OpenSSL/0.9.8c
X-Powered-By: PHP/5.2.0-8+etch13
Content-Type: text/html; charset=iso-8859-1
```
Note the “charset” in the last line.
#11 / Sep 25, 2009 11:02am

Deron Sizemore
1033 posts

Ingmar,

Talked with our hosting provider again a few minutes ago and while they “thought” they changed it I showed them that it still wasn’t changed and they thought they knew what he problem was and are now looking into it. He said that he changed the “AddDefaultCharset” to UTF-8 and I checked the headers again and it still showed iso-8859-1. Something else was overriding the setting.

I’ll keep you updated.
#12 / Sep 25, 2009 11:05am
Ingmar
29245 posts
Talked with our hosting provider again a few minutes ago and while they “thought” they changed it I showed them that it still wasn’t changed and they thought they knew what he problem was and are now looking into it.

Good to hear it.

He said that he changed the “AddDefaultCharset” to UTF-8 and I checked the headers again and it still showed iso-8859-1. Something else was overriding the setting.

Definitely looks like it. Can they not simply try
```
AddDefaultCharset Off
```
?

I’ll keep you updated.

Very good, we’ll be here.
#13 / Sep 25, 2009 11:13am
Deron Sizemore
1033 posts
He said that he changed the “AddDefaultCharset” to UTF-8 and I checked the headers again and it still showed iso-8859-1. Something else was overriding the setting.
Definitely looks like it. Can they not simply try
AddDefaultCharset Off
?
Well that was one thing I mentioned to them that you had said it’s best to not force any encoding and that it should just be off so that I can specify it in the templates. Not sure what they will end up doing.

It doesn’t help matters that I’ve never configured an apache server before and am basically going into this blind and just relaying the info to them from this thread. Hopefully they understand though. 😊
#14 / Sep 25, 2009 11:17am

Ingmar
29245 posts

I think it’s quite a common issue, to be honest. That said AddDefaultCharset is not a directive that’s loaded by default, so somebody must have considered it necessary at some point.

Thanks for keeping us in the loop 😊
#15 / Sep 25, 2009 2:34pm

Deron Sizemore
1033 posts

Ingmar,

The host finally was able to change the encoding on the server. They said that they changed it to be blank so that it would use whatever was specified in the templates but when I check the header info, it’s now showing UTF-8 instead of iso-8859-1. If they left the setting blank or turned it off, wouldn’t the header check display nothing?

He said too that the setting he changed was in the PHP ini file on the server?

If you check the page in question, it did fix the weird characters, but not fully. I just turned them all into two question marks, one of them with a diamond around it.

Is this another issue that needs to be resolved by our host or something that I can take care of? I’ve got the setting set to yes for “Automatically Convert High ASCII Text to Entities” but I don’t know if that even matters in this case.

Thanks

1
2
next