Re: comparing utf8s

Date view Thread view Subject view Author view

From: Godmar Back (gback@cs.utah.edu)
Date: Mon Dec 21 1998 - 00:34:14 EST


>
>
> Also, section 4.4.7 of the VM spec says:
>
> There are two differences between this format and the "standard"
> UTF-8 format. First, the null byte (byte)0 is encoded using the
> two-byte format rather than the one-byte format, so that Java Virtual
> Machine UTF-8 strings never have embedded nulls. Second, only the
> one-byte, two-byte, and three-byte formats are used. The Java Virtual
> Machine does not recognize the longer UTF-8 formats.
>
>
> The way I read this is that Java's Utf8s are meant to be compared with
> strcmp. The GET macro should not be necessary.
>

I meant the use of the GET macro in the method that compares them here.

Of course, we could not use strcmp if there were alternate encodings
for the same unicode character. From my reading of section 4.4.7, this
does not seem to be the case.

I looked at the fake utf8 stuff and think I know what it does now.
It's not exactly an elegant thing, but I guess it's okay.
The underlying reason seems to polymorphism, or better lack thereof.
What you really want is utf8const.equals(utf8const) and
utf8const.equals(char*) and have the hashtable automatically pick
which one you mean. That's the price you pay for wanting to use
a parametrized hashtable in a language that does not offer polymorphism.

One possibility would be to sacrifice a word and store a pointer to
the char* at the head of the utf8const. Then the fake and the other
object would be indistinguishable. I'd supposed that would be a
space/time trade-off.

One other thing I'd do is to compute the hash value for a faked utf8 up
front, since you have to do it anyway.
Then, utf8ConstHashValue(u) would simply be written as return u->hash;
There's no need to compute the hash lazily anymore. (I think this only
made sense to do lazily if utf8's were not interned.)
Then you could put utf8ConstHashValue as a static inline in the header
file.

        - Godmar


Date view Thread view Subject view Author view

This archive was generated by hypermail 2b29 : Sat Sep 23 2000 - 19:57:26 EDT