

...making Linux just a little more fun!
Jimmy O'Regan [joregan at gmail.com]
Freelang has a lot of (usually small) dictionaries, for Windows. They have quite a few languages that aren't easy to find dictionaries for, so though the coverage and quality are usually quite low, they're sometimes all that's there.
So, an example: http://www.freelang.net/dictionary/albanian.php
Leads to a file, dic_albanian.exe
This runs quite well in Wine (I haven't found any other way of extracting the contents). On my system, the 'C:\users\jim\Local Settings\Application Data\Freelang Dictionary' translates to '~/.wine/drive_c/users/jim/Local\ Settings/Application\ Data/Freelang\ Dictionary/'. The dictionary files are inside the 'language' directory.
Saving this as wb2dict.c:
#include <stdlib.h>
#include <stdio.h>
int main (int argc, char** argv)
{
char src[31];
char trg[53];
FILE* f=fopen(argv[1], "r");
if (f==NULL) {
fprintf (stderr, "Error reading file: %s\n", argv[1]);
exit(1);
}
while (!feof(f)) {
fread(&src, sizeof(char), 31, f);
fread(&trg, sizeof(char), 53, f);
printf ("%s\n %s\n\n", src, trg);
}
fclose(f);
exit(0);
}
The next step depends on the contents... Albanian on Windows uses Codepage 1250, so in this case:
./wb2dict Albanian_English.wb|recode 'windows1250..utf8' |dictfmt -f
--utf8 albanian-english
dictzip albanian-english.dict
(as root
cp albanian-english.* /usr/share/dictd/
add these lines to /var/lib/dictd/db.list : database albanian-english { data /usr/share/dictd/albanian-english.dict.dz index /usr/share/dictd/albanian-english.index }
/etc/init.d/dictd restart
and now it's available: dict agim 1 definition found
From unknown [albanian-english]:
agim dawn
-- <Leftmost> jimregan, that's because deep inside you, you are evil. <Leftmost> Also not-so-deep inside you.
Ben Okopnik [ben at linuxgazette.net]
On Sun, Sep 05, 2010 at 02:58:30PM +0100, Jimmy O'Regan wrote:
> Freelang has a lot of (usually small) dictionaries, for Windows. They > have quite a few languages that aren't easy to find dictionaries for, > so though the coverage and quality are usually quite low, they're > sometimes all that's there. > > So, an example: http://www.freelang.net/dictionary/albanian.php > > Leads to a file, dic_albanian.exe
Sweet. Thanks, Jimmy - I can use that!
> This runs quite well in Wine (I haven't found any other way of > extracting the contents). On my system, the 'C:\users\jim\Local > Settings\Application Data\Freelang Dictionary' translates to > '~/.wine/drive_c/users/jim/Local\ Settings/Application\ Data/Freelang\ > Dictionary/'. The dictionary files are inside the 'language' > directory.
Oh, right - reminds me: for stuff like this, I've got a special directory I use so I don't have to hunt through the WINE structure. I created a symlink at ".wine/drive_c/temp/to_unix" that points to my /tmp directory, so if I just install the program to that directory, it shows up in my /tmp, all ready to be played with.
> Saving this as wb2dict.c:
[snip]
Whoops - that double-prints the last entry in the dictionary.
Not a
big deal, though.
> The next step depends on the contents... Albanian on Windows uses > Codepage 1250, so in this case: > > ./wb2dict Albanian_English.wb|recode 'windows1250..utf8' |dictfmt -f > --utf8 albanian-english > dictzip albanian-english.dict
Or, all of the above in one step:
#!/usr/bin/perl -w
# Created by Ben Okopnik on Sun Sep 5 12:11:02 EDT 2010
use strict;
die "Usage: ", $0 =~ /([^\/]+)$/, " <dict_file> [encoding]\n"
unless @ARGV;
use open IN => ":encoding(" . (defined $ARGV[1]?$ARGV[1]:'utf8') . ")",
OUT => ":utf8";
(my $dct = $ARGV[0]) =~ s/\.wb$//;
$dct =~ tr/_ A-Z/-_a-z/;
open my $in, $ARGV[0] or die "$ARGV[0]: $!\n";
open my $out, "|/usr/bin/dictfmt -f --utf8 $dct"
or die "Pipe failure: $!\n";
{
my $ret1 = read $in, my $src, 31;
my $ret2 = read $in, my $tgt, 53;
last unless $ret1 & $ret2;
s/\0.*// for $src, $tgt;
printf $out "%s\n %s\n\n", $src, $tgt;
redo;
}
close $in;
system ('dictzip', "$dct.dict");
print <<"+EOT+"
database $dct.dict.dz
{
data /usr/share/dictd/$dct.dict.dz
index /usr/share/dictd/$dct.index
}
+EOT+
Just specify the '.wb' file as the first argument and its encoding as the second.
> (as root> cp albanian-english.* /usr/share/dictd/ > > add these lines to /var/lib/dictd/db.list : > database albanian-english > { > data /usr/share/dictd/albanian-english.dict.dz > index /usr/share/dictd/albanian-english.index > }
For convenience, the script actually spits that out so it can be copied
and pasted.
-- * Ben Okopnik * Editor-in-Chief, Linux Gazette * http://LinuxGazette.NET *
Jimmy O'Regan [joregan at gmail.com]
On 5 September 2010 17:30, Ben Okopnik <ben at linuxgazette.net> wrote:
> On Sun, Sep 05, 2010 at 02:58:30PM +0100, Jimmy O'Regan wrote: >> Freelang has a lot of (usually small) dictionaries, for Windows. They >> have quite a few languages that aren't easy to find dictionaries for, >> so though the coverage and quality are usually quite low, they're >> sometimes all that's there. >> >> So, an example: http://www.freelang.net/dictionary/albanian.php >> >> Leads to a file, dic_albanian.exe > > Sweet. Thanks, Jimmy - I can use that! > >> This runs quite well in Wine (I haven't found any other way of >> extracting the contents). On my system, the 'C:\users\jim\Local >> Settings\Application Data\Freelang Dictionary' translates to >> '~/.wine/drive_c/users/jim/Local\ Settings/Application\ Data/Freelang\ >> Dictionary/'. The dictionary files are inside the 'language' >> directory. > > Oh, right - reminds me: for stuff like this, I've got a special > directory I use so I don't have to hunt through the WINE structure. I > created a symlink at ".wine/drive_c/temp/to_unix" that points to my /tmp > directory, so if I just install the program to that directory, it shows > up in my /tmp, all ready to be played with. > >> Saving this as wb2dict.c: > > [snip] > > Whoops - that double-prints the last entry in the dictionary.Not a > big deal, though. >
Ah well... I spent more time on the dict stuff than looking at the raw
files/writing the C
It also loses the first entry (I think) because of the way dictfmt adds its initial entries.
>> The next step depends on the contents... Albanian on Windows uses
>> Codepage 1250, so in this case:
>>
>> ./wb2dict Albanian_English.wb|recode 'windows1250..utf8' |dictfmt -f
>> --utf8 albanian-english
>> dictzip albanian-english.dict
>
> Or, all of the above in one step:
>
> ```
> #!/usr/bin/perl -w
> # Created by Ben Okopnik on Sun Sep ?5 12:11:02 EDT 2010
> use strict;
>
> die "Usage: ", $0 =~ /([^\/]+)$/, " <dict_file> [encoding]\n"
> ? ?unless @ARGV;
>
> use open IN => ":encoding(" . (defined $ARGV[1]?$ARGV[1]:'utf8') . ")",
> ? ?OUT => ":utf8";
>
> (my $dct = $ARGV[0]) =~ s/\.wb$//;
> $dct =~ tr/_ A-Z/-_a-z/;
> open my $in, $ARGV[0] or die "$ARGV[0]: $!\n";
> open my $out, "|/usr/bin/dictfmt -f --utf8 $dct"
> ? ?or die "Pipe failure: $!\n";
>
> {
> ? ?my $ret1 = read $in, my $src, 31;
> ? ?my $ret2 = read $in, my $tgt, 53;
> ? ?last unless $ret1 & $ret2;
> ? ?s/\0.*// for $src, $tgt;
Not quite. The reason I used C was because the data showed some evidence of C string reuse: schmal(t)z\0ch\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0 "devojka za s\0"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0 factotum\0\0\0ch\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0
... so you'd at least need to split both strings on \0
> ? ?printf $out "%s\n ? %s\n\n", $src, $tgt;
> ? ?redo;
> }
> close $in;
> system ('dictzip', "$dct.dict");
>
> print <<"+EOT+"
> database $dct.dict.dz
> {
> ? ? ? ?data ?/usr/share/dictd/$dct.dict.dz
> ? ? ? ?index /usr/share/dictd/$dct.index
> }
> +EOT+
> '''
>
> Just specify the '.wb' file as the first argument and its encoding as
> the second.
>
>> (as root
>> cp albanian-english.* /usr/share/dictd/
>>
>> add these lines to /var/lib/dictd/db.list :
>> database albanian-english
>> ?{
>> ? data ?/usr/share/dictd/albanian-english.dict.dz
>> ? index /usr/share/dictd/albanian-english.index
>> }
>
> For convenience, the script actually spits that out so it can be copied
> and pasted.
>
>
> --
> * Ben Okopnik * Editor-in-Chief, Linux Gazette * http://LinuxGazette.NET *
>
> TAG mailing list
> TAG at lists.linuxgazette.net
> http://lists.linuxgazette.net/listinfo.cgi/tag-linuxgazette.net
>
-- <Leftmost> jimregan, that's because deep inside you, you are evil. <Leftmost> Also not-so-deep inside you.
Jimmy O'Regan [joregan at gmail.com]
On 5 September 2010 17:36, Jimmy O'Regan <joregan at gmail.com> wrote:
>>> Saving this as wb2dict.c: >> >> [snip] >> >> Whoops - that double-prints the last entry in the dictionary.Not a >> big deal, though. >> > > Ah well... I spent more time on the dict stuff than looking at the raw > files/writing the C
> > It also loses the first entry (I think) because of the way dictfmt > adds its initial entries. >
This version fixes both problems:
#include <stdlib.h> #include <stdio.h>
int main (int argc, char** argv) { char src[31]; char trg[53]; int c; FILE* f=fopen(argv[1], "r"); if (f==NULL) { fprintf (stderr, "Error reading file: %s\n", argv[1]); exit(1); }
printf ("00-database-info\n Converted from %s\n\n", argv[1]); printf ("00-dummy-entry\n For dictfmt\n\n");
while ((c = (int) fgetc(f)) != EOF) { ungetc(c, f); fread(&src, sizeof(char), 31, f); fread(&trg, sizeof(char), 53, f); printf ("%s\n %s\n\n", src, trg); } fclose(f); exit(0); }
-- <Leftmost> jimregan, that's because deep inside you, you are evil. <Leftmost> Also not-so-deep inside you.
Ben Okopnik [ben at linuxgazette.net]
On Sun, Sep 05, 2010 at 05:36:45PM +0100, Jimmy O'Regan wrote:
> On 5 September 2010 17:30, Ben Okopnik <ben at linuxgazette.net> wrote:
>
> > {
> > ? ?my $ret1 = read $in, my $src, 31;
> > ? ?my $ret2 = read $in, my $tgt, 53;
> > ? ?last unless $ret1 & $ret2;
> > ? ?s/\0.*// for $src, $tgt;
>
> Not quite. The reason I used C was because the data showed some
> evidence of C string reuse:
> schmal(t)z\0ch\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0
> "devojka za s\0"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0
> factotum\0\0\0ch\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0
>
> ... so you'd at least need to split both strings on \0
Actually, except for the double-printed entry, it produces precisely the same output as your program - so that seems to work just fine.
-- * Ben Okopnik * Editor-in-Chief, Linux Gazette * http://LinuxGazette.NET *
Ben Okopnik [ben at linuxgazette.net]
On Sun, Sep 05, 2010 at 05:36:45PM +0100, Jimmy O'Regan wrote:
> > Not quite. The reason I used C was because the data showed some > evidence of C string reuse: > schmal(t)z\0ch\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0 > "devojka za s\0"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0 > factotum\0\0\0ch\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0 > > ... so you'd at least need to split both strings on \0
Just recalled: C strings are null-terminated, right? That means the assignment to the string will terminate at that first null, regardless of the content after it. I'm just doing that manually.
#include <stdlib.h>
#include <stdio.h>
int main()
{
char *str = "abc\0def";
printf("%s\n", str);
exit(0);
}
This will only print the first three characters of the string.
-- * Ben Okopnik * Editor-in-Chief, Linux Gazette * http://LinuxGazette.NET *
Jimmy O'Regan [joregan at gmail.com]
On 5 September 2010 18:13, Ben Okopnik <ben at linuxgazette.net> wrote:
> On Sun, Sep 05, 2010 at 05:36:45PM +0100, Jimmy O'Regan wrote:
>> On 5 September 2010 17:30, Ben Okopnik <ben at linuxgazette.net> wrote:
>>
>> > {
>> > ? ?my $ret1 = read $in, my $src, 31;
>> > ? ?my $ret2 = read $in, my $tgt, 53;
>> > ? ?last unless $ret1 & $ret2;
>> > ? ?s/\0.*// for $src, $tgt;
>>
>> Not quite. The reason I used C was because the data showed some
>> evidence of C string reuse:
>> schmal(t)z\0ch\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0
>> "devojka za s\0"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0
>> factotum\0\0\0ch\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0
>>
>> ... so you'd at least need to split both strings on \0
>
> Actually, except for the double-printed entry, it produces precisely the
> same output as your program - so that seems to work just fine.
>
Sorry, misread "s/\0.*//". I need 1) new glasses, and 2) to clean my monitor
-- <Leftmost> jimregan, that's because deep inside you, you are evil. <Leftmost> Also not-so-deep inside you.
Jimmy O'Regan [joregan at gmail.com]
On 5 September 2010 17:30, Ben Okopnik <ben at linuxgazette.net> wrote:
> ```
> #!/usr/bin/perl -w
> # Created by Ben Okopnik on Sun Sep ?5 12:11:02 EDT 2010
> use strict;
>
> die "Usage: ", $0 =~ /([^\/]+)$/, " <dict_file> [encoding]\n"
> ? ?unless @ARGV;
>
> use open IN => ":encoding(" . (defined $ARGV[1]?$ARGV[1]:'utf8') . ")",
> ? ?OUT => ":utf8";
>
> (my $dct = $ARGV[0]) =~ s/\.wb$//;
> $dct =~ tr/_ A-Z/-_a-z/;
> open my $in, $ARGV[0] or die "$ARGV[0]: $!\n";
> open my $out, "|/usr/bin/dictfmt -f --utf8 $dct"
> ? ?or die "Pipe failure: $!\n";
>
print $out "00-dummy-entry\n For dictfmt\n\n";
here will get rid of the second bug I had
-- <Leftmost> jimregan, that's because deep inside you, you are evil. <Leftmost> Also not-so-deep inside you.
Ben Okopnik [ben at linuxgazette.net]
On Sun, Sep 05, 2010 at 06:55:24PM +0100, Jimmy O'Regan wrote:
> > print $out "00-dummy-entry\n For dictfmt\n\n"; > > here will get rid of the second bug I had
OK, so the "improved" version looks like this (I was trying to remember what in Perl handles C strings... 'pack/unpack', of course):
#!/usr/bin/perl -w
# Created by Ben Okopnik on Sun Sep 5 12:11:02 EDT 2010
use strict;
die "Usage: ", $0 =~ /([^\/]+)$/, " <dict_file> [encoding]\n"
unless @ARGV;
use open IN => ":encoding(" . (defined $ARGV[1]?$ARGV[1]:'utf8') . ")",
OUT => ":utf8";
(my $dct = $ARGV[0]) =~ s/\.wb$//;
$dct =~ tr/_ A-Z/-_a-z/;
open my $in, $ARGV[0] or die "$ARGV[0]: $!\n";
open my $out, "|/usr/bin/dictfmt -f --utf8 $dct" or die "Pipe failure: $!\n";
my $src;
print $out "00-dummy-entry\n\tFor dictfmt\n\n";
printf "%s\n\t%s\n\n", unpack("Z31 Z53", $src) while read $in, $src, 84;
close $in;
system ('dictzip', "$dct.dict");
print <<"+EOT+"
database $dct.dict.dz
{
data /usr/share/dictd/$dct.dict.dz
index /usr/share/dictd/$dct.index
}
+EOT+
The amusing part is the amount of work done by that "printf" line. Real
workhorse, that thing.
-- * Ben Okopnik * Editor-in-Chief, Linux Gazette * http://LinuxGazette.NET *
Ben Okopnik [ben at linuxgazette.net]
On Sun, Sep 05, 2010 at 02:28:53PM -0400, Benjamin Okopnik wrote:
Whoops, one mistake there:
> printf "%s\n\t%s\n\n", unpack("Z31 Z53", $src) while read $in, $src, 84;
Should be
printf $out "%s\n\t%s\n\n", unpack("Z31 Z53", $src) while read $in, $src, 84;
-- * Ben Okopnik * Editor-in-Chief, Linux Gazette * http://LinuxGazette.NET *