Group: comp.lang.python


Subject: Help needed with python unicode cgi-bin script
From: weheh
Date: 12/10/2007 7:29:39 AM
Dear web gods: After much, much, much struggle with unicode, many an hour reading all the examples online, coding them, testing them, ripping them apart and putting them back together, I am humbled. Therefore, I humble myself before you to seek guidance on a simple python unicode cgi-bin scripting problem. My problem is more complex than this, but how about I boil down one sticking point for starters. I have a file with a Spanish word in it, "años", which I wish to read with: #!C:/Program Files/Python23/python.exe STARTHTML= u'''Content-Type: text/html <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/1999/xhtml" lang="en" xml:lang="en"> <head> </head> <body> ''' ENDHTML = u''' </body> </html> ''' print STARTHTML print open('c:/test/spanish.txt','r').read() print ENDHTML Instead of seeing "año" I see "a?o". BAD BAD BAD Yet, if I open the file with the browser (IE/Mozilla), I see "año." THIS IS WHAT I WANT WHAT GIVES? Next, I'll get into codecs and stuff, but how about starting with this? The general question is, does anybody have a complete working example of a cgi-bin script that does the above properly that they'd be willing to share? I've tried various examples online but haven't been able to get any to work. I end up seeing hex code for the non-ascii characters u'a\xf1o', and later on 'a\xc3\xb1o', which are also BAD BAD BAD. Thanks -- your humble supplicant.

Subject: Help needed with python unicode cgi-bin script
From: Jack
Date: 12/9/2007 11:36:04 PM
You probably need to set stdout mode to binary. They are not by default on Windows. "weheh" <weheh@verizon.net> wrote in message news:DV57j.11710$OR.11141@trnddc01... > Dear web gods: > > After much, much, much struggle with unicode, many an hour reading all the > examples online, coding them, testing them, ripping them apart and putting > them back together, I am humbled. Therefore, I humble myself before you to > seek guidance on a simple python unicode cgi-bin scripting problem. > > My problem is more complex than this, but how about I boil down one > sticking point for starters. I have a file with a Spanish word in it, "años", > which I wish to read with: > > > #!C:/Program Files/Python23/python.exe > > STARTHTML= u'''Content-Type: text/html > > <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" > "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> > <html xmlns="http://www.w3.org/1999/xhtml" lang="en" xml:lang="en"> > <head> > </head> > <body> > ''' > ENDHTML = u''' > </body> > </html> > ''' > print STARTHTML > print open('c:/test/spanish.txt','r').read() > print ENDHTML > > > Instead of seeing "año" I see "a?o". BAD BAD BAD > Yet, if I open the file with the browser (IE/Mozilla), I see "año." THIS > IS WHAT I WANT > > WHAT GIVES? > > Next, I'll get into codecs and stuff, but how about starting with this? > > The general question is, does anybody have a complete working example of a > cgi-bin script that does the above properly that they'd be willing to > share? I've tried various examples online but haven't been able to get any > to work. I end up seeing hex code for the non-ascii characters u'a\xf1o', > and later on 'a\xc3\xb1o', which are also BAD BAD BAD. > > Thanks -- your humble supplicant. >

Subject: Help needed with python unicode cgi-bin script
From: weheh
Date: 12/10/2007 10:53:57 PM
Thanks for the reply, Jack. I tried setting mode to binary but it had no affect. "Jack" <nospam@invalid.com> wrote in message news:y_6dnZT4ccX5ccHanZ2dnUVZ_vKunZ2d@comcast.com... > You probably need to set stdout mode to binary. They are not by default on > Windows. > > > "weheh" <weheh@verizon.net> wrote in message > news:DV57j.11710$OR.11141@trnddc01... >> Dear web gods: >> >> After much, much, much struggle with unicode, many an hour reading all >> the examples online, coding them, testing them, ripping them apart and >> putting them back together, I am humbled. Therefore, I humble myself >> before you to seek guidance on a simple python unicode cgi-bin scripting >> problem. >> >> My problem is more complex than this, but how about I boil down one >> sticking point for starters. I have a file with a Spanish word in it, >> "años", which I wish to read with: >> >> >> #!C:/Program Files/Python23/python.exe >> >> STARTHTML= u'''Content-Type: text/html >> >> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" >> "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> >> <html xmlns="http://www.w3.org/1999/xhtml" lang="en" xml:lang="en"> >> <head> >> </head> >> <body> >> ''' >> ENDHTML = u''' >> </body> >> </html> >> ''' >> print STARTHTML >> print open('c:/test/spanish.txt','r').read() >> print ENDHTML >> >> >> Instead of seeing "año" I see "a?o". BAD BAD BAD >> Yet, if I open the file with the browser (IE/Mozilla), I see "año." THIS >> IS WHAT I WANT >> >> WHAT GIVES? >> >> Next, I'll get into codecs and stuff, but how about starting with this? >> >> The general question is, does anybody have a complete working example of >> a cgi-bin script that does the above properly that they'd be willing to >> share? I've tried various examples online but haven't been able to get >> any to work. I end up seeing hex code for the non-ascii characters >> u'a\xf1o', and later on 'a\xc3\xb1o', which are also BAD BAD BAD. >> >> Thanks -- your humble supplicant. >> > >

Subject: Help needed with python unicode cgi-bin script
From: weheh
Date: 12/10/2007 10:55:00 PM
Hi Martin, thanks for your response. My updates are interleaved with your response below: > What is the encoding of that file? Without a correct answer to that > question, you will not be able to achieve what you want. I don't know for sure the encoding of the file. I'm assuming it has no intrinsic encoding since I copied the word "año" into vim and then saved it as the example text file called, "spanish.txt". > Possible answers are "iso-8859-1", "utf-8", "windows-1252", and "cp850" > (these all support the word "año") > >> Instead of seeing "año" I see "a?o". > > I don't see anything here. Where do you see the question mark? Did you > perhaps run the CGI script in a web server, and pointed your web browser > to the web page, and saw the question mark in the web browser? The cgi-bin scripts prints to stdout, i.e. to my browser, and when I use print I see a square box where the ñ should be. When I use print repr(...) I see 'a\xf1o'. I never see the desired 'ñ' character. Sending "Content-type: text/html" is not enough. The web browser needs > to know what the encoding is. So you should send > > Content-type: text/html; charset="your-encoding-here" Sorry, somehow my cut and paste job into outlook missed the exact line you had above that specifies encoding tp be set as "utf8", but it's there in my program. Not to worry. > Use "extras/page information" in Firefox to find out what the web > browser thinks the encoding of the page is. Firefox says the page is UTF8. > P.S. Please, stop shouting. OK, it's just that it hurts when I've been pulling my hair out for days on end over a single line of code. I don't want to go bald just yet.

Subject: Help needed with python unicode cgi-bin script
From: Jack
Date: 12/10/2007 10:17:38 PM
Just want to make sure, how exactly are you doing that? > Thanks for the reply, Jack. I tried setting mode to binary but it had no > affect.

Subject: Help needed with python unicode cgi-bin script
From: weheh
Date: 12/11/2007 4:21:40 PM
import sys if sys.platform == "win32": import os, msvcrt msvcrt.setmode(sys.stdout.fileno(), os.O_BINARY) "Jack" <nospam@invalid.com> wrote in message news:A72dne5UzKYftsPanZ2dnUVZ_uKpnZ2d@comcast.com... > Just want to make sure, how exactly are you doing that? > >> Thanks for the reply, Jack. I tried setting mode to binary but it had no >> affect. > >

Subject: Help needed with python unicode cgi-bin script
From: weheh
Date: 12/11/2007 5:46:12 PM
Hi John: Thanks for responding. >Look at your file using > print repr(open('c:/test/spanish.txt','rb').read()) >If you see 'a\xf1o' then use charset="windows-1252" I did this ... no change ... still see 'a\xf1o' >else if you see 'a\xc3\xb1o' then use charset="utf-8" else ???? >Based on your responses to Martin, it appears that your file is >actually windows-1252 but you are telling browsers that it is utf-8. >Another check: if the file is utf-8, then doing > open('c:/test/spanish.txt','rb').read().decode('utf8') >should be OK; if it's not valid utf8, it will complain. No. this causes decode error: UnicodeDecodeError: 'utf8' codec can't decode bytes in position 1-4: invalid data args = ('utf8', 'a\, 1, 5, 'invalid data') encoding = 'utf8' end = 5 object = 'a\xf1o' reason = 'invalid data' start = 1 >Yet another check: open the file with Notepad. Do File/SaveAs, and >look at the Encoding box -- ANSI or UTF-8? Notepad says it's ANSI Thanks. What now? Also, this is a general problem for me, whether I read from a file or read from an html text field, or read from an html text area. So I'm looking for a general solution. If it helps to debug by reading from textarea or text field, let me know.

Subject: Help needed with python unicode cgi-bin script
From: weheh
Date: 12/11/2007 10:51:38 PM
John & Martin, Thanks for your help and kind words of encouragement. Still, what you have suggested doesn't seem to work, unless I'm not understanding your directive to encode as 'windows-1252'. Here's my program in full: #!C:/Program Files/Python23/python.exe import cgi, cgitb import sys, codecs import os,msvcrt cgitb.enable() print u"""Content-Type: text/html <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd" > <html xmlns="http://www.w3.org/1999/xhtml" lang="en,sp,fr" xml:lang="en,sp,fr"> <head> <meta http-equiv="content-type" content="text/html; charset=windows-1252" /> <meta http-equiv="content-language" content="en,fr,sp" /> </head> <body> """ if sys.platform == 'win32': msvcrt.setmode(sys.stdin.fileno(), os.O_BINARY) msvcrt.setmode(sys.stdout.fileno(), os.O_BINARY) x = repr(open('c:/test/spanish.txt','rb').read()) print '<p>',x,'# first print</p>' x = open('c:/test/spanish.txt','rb').read() print '<p>',x,'# second print</p>' x = repr((open('c:/test/spanish.txt','rb').read()).decode('windows-1252')) print '<p>',x,'# third print</p>' print """ </body> </html> """ The output of the program is this: 'a\xf1o\r\n' # first print a?o # second print #### Note that there is no ñ between the a and the o, only a box u'a\xf1o\r\n' # third print So what do you advise?

Subject: Help needed with python unicode cgi-bin script
From: weheh
Date: 12/12/2007 12:03:36 AM
> What does the browser say what the encoding of the page is? > > What browser are you using, and did you configure it to default to > UTF-8 for all pages? (which you should not have done) > Browser is both IE and Firefox. IE is defaulting to UTF8. If I force it to "Encoding > Western European (Windows)" it shows the ñ. The browser encoding "Autoselect" feature is enabled, yet it always seems to default to UTF8. Any idea how to change that? Is there something I can put in html that forces it to do that? I'm using Apache and have the following line in my http.conf file: AddDefaultCharset utf-8 Is this a problem? > Try "telnet server 80", then type > > GET /path HTTP/1.1<enter> > Host: server<enter> > <enter> > > and report what response from the server is (the complete one, > not just the character in question) > OK, telnet session yields this: HTTP/1.1 200 OK Date: Tue, 11 Dec 2007 23:58:02 GMT Server: Apache Transfer-Encoding: chunked Content-Type: text/html; charset=utf-8 1f8 <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.or g/TR/xhtml1/DTD/xhtml1-transitional.dtd" > <html xmlns="http://www.w3.org/1999/xh tml" lang="en,sp,fr" xml:lang="en,sp,fr"> <head> <meta http-equiv="content-type" c ontent="text/html; charset=utf8" /> <meta http-equiv="content-language" content=" en,fr,sp" /> </head> <body> <p> a±o # first print</p> <p> 'a\xf1o\r\n' # second print</p> <p> 'a\xf1o\r\n' # third pri nt</p> <p> 'a\xf1o\r\n' # third print</p> </body> </html> 0 Connection to host lost.

Subject: Help needed with python unicode cgi-bin script
From: weheh
Date: 12/12/2007 12:06:09 AM
p.s. I modified the code to break things out more explicitly: #!C:/Program Files/Python23/python.exe import cgi, cgitb import sys, codecs import os,msvcrt cgitb.enable() print u"""Content-Type: text/html <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd" > <html xmlns="http://www.w3.org/1999/xhtml" lang="en,sp,fr" xml:lang="en,sp,fr"> <head> <meta http-equiv="content-type" content="text/html; charset=utf8" /> <meta http-equiv="content-language" content="en,fr,sp" /> </head> <body> """ if sys.platform == 'win32': msvcrt.setmode(sys.stdin.fileno(), os.O_BINARY) msvcrt.setmode(sys.stdout.fileno(), os.O_BINARY) x = open('c:/test/spanish.txt','rb').read() print '<p>',x,'# first print</p>' x = open('c:/test/spanish.txt','rb').read() x = repr(x) print '<p>',x,'# second print</p>' x = open('c:/test/spanish.txt','rb').read() x = repr(x) x = x.decode('windows-1252') print '<p>',x,'# third print</p>' x = open('c:/test/spanish.txt','rb').read() x = repr(x) x = x.decode('windows-1252') x = x.encode('utf8') print '<p>',x,'# third print</p>' print """ </body> </html> """ (The last print should read "fourth print")

Subject: Help needed with python unicode cgi-bin script
From: weheh
Date: 12/12/2007 5:21:06 AM
John and Martin, Thanks for your help. However, I have identified the culprit to be with Apache and the command: AddDefaultCharset utf-8 which forces my browser to utf-8 encoding. It looks like your suggestions to change charset were incorrect. My example works equally well with charset=utf8 as it does with charset=windows-1252. Incidentally, next time, if you really want to be helpful, might I suggest you leave out the mocking. I could care less, myself, but someone else might have gotten their feelings hurt. And in the end, it doesn't make you look good. Thanks again. Cheers.

Subject: Help needed with python unicode cgi-bin script
From: weheh
Date: 12/12/2007 5:12:31 PM
Hi Duncan, thanks for the reply. >> > FWIW, the code you posted only ever attempted to set the character set > encoding using an html meta tag which is the wrong place to set it. The > encoding specified in the HTTP headers always takes precedence. This is > why > the default charset setting in Apache was the one which applied. > > What you should have been doing was setting the encoding in the content- > type header. i.e. in this line of your code: > > print u"""Content-Type: text/html > > You should have changed it to read: > > Content-Type: text/html; charset=windows-1252 > > but because you didn't Apache was quietly changing it to read: > > Content-Type: text/html; charset=utf-8 > Will this work under the following situation? Let's say the user is filling out a text field on a form on my website. The user has their browser encoding set to utf8. My website has charset=windows-1252 as you indicate above. Will I run into a conflict somewhere?