Further evidence suggests strongly that this was simply my mistake, due to too many simultaneous confusing factors. I will neuter the post for now, lest it become a public nuisance. If I am brave enough, I may provide some explanation of why I was misled; otherwise I may delete the post eventually.
I haven't investigated this, since there was an easy workaround in my case. I thought that Python 3.4.0 might have returned the byte offset instead of the character offset in certain cases when you use re.search.
Although I now think Python is fine, I can't be entirely sure. I had good reason to retry using the search -- I needed something able to handle more complexity than find can.
In my case, I was working with Hebrew (right to left, multiple-byte codes -- definitely not an easy combination). pdb showed len(line) as 78, yet m.start() was 102 and m.end() was 104. Assuming the numbers are byte offsets (which seems very possible), then if there's a way to use them as indices into the string, it might be much faster than if they returned a character index. If anyone has any insight into how this "feature" might be used, or insights into when it does or doesn't get the "wrong" value, I would be interested.
My original thought was that somehow there could be a case when the unicode string needs to be represented in a fixed-width format (presumably 16 bits in this case), but through some oversight it is left in utf8 format internally and the search expects never to have to convert a byte count to a variable character count.
It appears that I either looked at the wrong similar data structure during debugging or the code referenced the wrong one or both.
0 blog comments below