Subject: BUG #3730: Creating a swedish dictionary fails
From: penty.wenngren@dgc.se ("Penty Wenngren")
Date: 11/8/2007 12:03:26 PM
The following bug has been logged online:
Bug reference: 3730
Logged by: Penty Wenngren
Email address: penty.wenngren@dgc.se
PostgreSQL version: 8.3 BETA 2
Operating system: FreeBSD 7 BETA 2
Description: Creating a swedish dictionary fails
Details:
I'm trying to create a swedish dictionary for tsearch as specified in the
8.3 manual, but it breaks:
test=# CREATE TEXT SEARCH DICTIONARY swedish_ispell (
test(# TEMPLATE = ispell,
test(# DictFile = swedish,
test(# AffFile = swedish,
test(# StopWords = swedish);
FEL: syntax error at line 219 of affix file
"/usr/local/share/postgresql/tsearch_data/swedish.affix"
picard# pwd
/usr/local/share/postgresql/tsearch_data
picard# file swedish.*
swedish.affix: UTF-8 Unicode text
swedish.dict: UTF-8 Unicode text
swedish.stop: UTF-8 Unicode text
Line 219 in the affix file looks like this:
O T > -OT,\xc3\x96TTER
Please forgive me if this is a known problem with a known solution. I
haven't been able to find information about this particular issue regarding
swedish dictionaries.
// Penty
---------------------------(end of broadcast)---------------------------
TIP 1: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to majordomo@postgresql.org so that your
message can get through to the mailing list cleanly
Subject: BUG #3730: Creating a swedish dictionary fails
From: tgl@sss.pgh.pa.us (Tom Lane)
Date: 11/8/2007 11:42:55 AM
"Penty Wenngren" <penty.wenngren@dgc.se> writes:
> I'm trying to create a swedish dictionary for tsearch as specified in the
> 8.3 manual, but it breaks:
Can you point us to copies of the swedish files you used?
regards, tom lane
---------------------------(end of broadcast)---------------------------
TIP 6: explain analyze is your friend
Subject: BUG #3730: Creating a swedish dictionary fails
From: tgl@sss.pgh.pa.us (Tom Lane)
Date: 11/8/2007 5:21:17 PM
Penty Wenngren <penty.wenngren@dgc.se> writes:
> I used iconv to convert svenska.aff and svenska.datalist (from
> iswedish-1.2.1) to UTF-8. The converted files can be found at:
> http://www.lederhosen.org/swedish.affix
> http://www.lederhosen.org/swedish.dict
I think the reason it's failing right there is that that line is the
first affix rule containing a non-ASCII letter, and the rules are
supposed to only contain letters and certain specific punctuation.
I suspect you are working in a locale that doesn't think Ö is a
letter --- check lc_ctype.
regards, tom lane
---------------------------(end of broadcast)---------------------------
TIP 5: don't forget to increase your free space map settings
Subject: BUG #3730: Creating a swedish dictionary fails
From: tgl@sss.pgh.pa.us (Tom Lane)
Date: 11/8/2007 8:45:32 PM
Penty Wenngren <penty.wenngren@dgc.se> writes:
> On Thu, Nov 08, 2007 at 05:21:17PM -0500, Tom Lane wrote:
>> I suspect you are working in a locale that doesn't think Ö is a
>> letter --- check lc_ctype.
> It doesn't seem to make any difference. The first try was done from a
> terminal that didn't care much for UTF-8, but that is fixed now and I
> still get the same result.
It sorta looks to me like you only changed the locale of your terminal
session. Changing the database's locale requires re-initdb. What
does "show lc_ctype" show within Postgres?
regards, tom lane
---------------------------(end of broadcast)---------------------------
TIP 3: Have you checked our extensive FAQ?
http://www.postgresql.org/docs/faq
Subject: BUG #3730: Creating a swedish dictionary fails
From: tgl@sss.pgh.pa.us (Tom Lane)
Date: 11/9/2007 1:49:27 PM
Alvaro Herrera <alvherre@commandprompt.com> writes:
> I am wondering if the newline being included in the token could be
> causing a problem.
Nope. I traced through it and the problem is that char2wchar() is
completely brain-dead: at some places it thinks that "len" is the
length of the output wchar array, and at others it thinks that "len"
is the number of bytes in the input. In particular, _t_isalpha()
fails completely for any multibyte character, because the pnstrdup
call truncates the character to 1 byte.
After looking at the callers I'm inclined to think that the only
safe way to implement this routine is to change its API to provide
both counts. Comments?
regards, tom lane
---------------------------(end of broadcast)---------------------------
TIP 6: explain analyze is your friend
Subject: BUG #3730: Creating a swedish dictionary fails
From: tgl@sss.pgh.pa.us (Tom Lane)
Date: 11/9/2007 5:39:58 PM
Alvaro Herrera <alvherre@commandprompt.com> writes:
> Tom Lane wrote:
>> After looking at the callers I'm inclined to think that the only
>> safe way to implement this routine is to change its API to provide
>> both counts. Comments?
> +1
I've cleaned this up along with a fair amount of other infelicity in
ts_locale.h/.c. However, I'm not in a position to test the Windows-
specific bits in wchar2char() and char2wchar() --- could someone
eyeball and/or test what I did?
regards, tom lane
---------------------------(end of broadcast)---------------------------
TIP 7: You can help support the PostgreSQL project by donating at
http://www.postgresql.org/about/donate
|