offending filename should be logged on Malformed UTF-8 error #18

ericcj · 2011-01-27T18:09:34Z

also, this bug was closed for a bogus reason and should be reopened: https://github.com/protocool/AckMate/issues/closed#issue/13

0xradical · 2011-04-21T01:09:40Z

+1 for re-opening issue #13

Perl has different ways of treating utf8 stuff:
http://perldoc.perl.org/Encode.html#UTF-8-vs.-utf8-vs.-UTF8

The ackmate_ack that is shipped with ackmate uses Encode::encode_utf8($buffer) instead of Encode::encode('UTF-8',$buffer). The previous malformed erros I was getting disappeared when I used the latter version of encode, although I don't know why (yet). I thought this could help others investigate this issue.

protocool · 2011-04-21T01:41:55Z

13 was closed because people were getting sick of getting emails about it.

As for using the latter version, I can't recall where I saw it, but I recall seeing a statement that encode_utf8 was the correct call. It's also odd that encode('UTF-8', $buffer) would result in fewer errors if it's more strict...

However, it's been a very long time since I burned any hours on this issue.

If you can come up with a few test files that prove it works better then I will gratefully apply any pull request.

Trev

0xradical · 2011-04-21T01:47:11Z

Hi Trev, I'll see if I can come up with something and possibly submit a pull request to your ack fork.

Best regards.

tedeh · 2011-05-16T12:40:56Z

+1 for this issue, I guess demanding a fix for the actual error might be too much but at least outputting the erroneous file name should be easy for anyone versed in Perl.

jjb · 2011-05-28T07:12:34Z

Alright folks, I've put together a solution for reporting offending files. Tell me what you think: https://github.com/protocool/AckMate/wiki/Unicode-UTF-8-error-message

obeattie · 2011-07-28T08:34:24Z

Would be really nice if this could be fixed. As far as I'm aware, all my files are encoded as UTF-8, but given that the kinds of directories I need to search over contain upward of 1,000 files, it's hard to track down the offending article. I've downgraded for now, but you know how we developers like to be on the bleeding edge :)

Even if someone could put together a script that will report all the "broken" files (or suggest one that already exists) and of course ideally fix them (wishful thinking I know!) that would be brilliant too.

artzte · 2011-08-09T23:11:55Z

I added a similar patch in another location of ackmate_ack: https://gist.github.com/1135457

airways · 2011-12-11T19:36:33Z

Addd a note to #13 just because it's the first thing you find in google, but I thought I should comment here:

Adding these lines to the end of needs_line_scan fixed this issue completely for me:

use Encode;
$buffer =  encode("UTF-8", $buffer);
return $buffer =~ /$regex/m;

Granted, there may be better places to do this encoding, but I just needed a quick fix. Hopefully this helps someone else out if they find the tickets.

joshgoebel · 2011-12-15T15:00:49Z

At the end of my needs_line_scan method (around line 2597) I changed the end of the method to:

    my $regex = $opt->{regex};
    if ( ! utf8::valid($buffer) ) {
            $buffer = Encode::encode_utf8($buffer);
    }
        return $buffer =~ /$regex/m;

And you have to add 'use utf8;' to the beginning of the script... does this seem reasonable?

protocool · 2011-12-15T21:05:47Z

Guys,

thanks for taking the lead on sorting this stuff out.

I've seen someone mention that setting export LC_CTYPE=en_US.UTF-8 in the environment helps.

I'd like to experiment with the various options before merging into AM proper, and at the same time updating AM's version of ack to the latest.

Can any of you point me to files which cause the UTF explosions so I can test against it?

Thanks, Trev

airways · 2011-12-15T21:18:32Z

I attempted to add logging as recommended by @jjb above, but the log items did not show up in the output. I can play with it a bit later unless @yyyc514 gets to it first. :)

snowblink · 2012-01-11T15:01:23Z

@yyyc514 solution works for me

@protocool I was getting stung on language files in ext.js

trisweb · 2012-02-15T18:54:47Z

No errors should be logged because AFAIK they aren't errors, they're files not in UTF-8 format being interpreted as if they are.

Here's a patch for the ackmate_ack to solve the issue - read the file with :raw format first, then convert to UTF-8 using the Encode lib. Does not complain. It also uses @protocool's (or @airways ?) encode strategy for the other possible error section, with which I had no issues.

I have not yet submitted this to the ack project because I am unsure if it's universally applicable, but it should work for our purposes.

To apply, first paste into an ackmate_ack.patch somewhere, then:

cd ~/Library/Application Support/TextMate/Plugins/AckMate.tmplugin/Contents/Resources
cp [path/to/ackmate_ack.patch] .
patch ackmate_ack < ackmate_ack.patch

--- ackmate_ack 2012-02-15 13:46:32.000000000 -0500
+++ ackmate_ack 2012-02-15 13:59:25.000000000 -0500
@@ -1553,12 +1553,15 @@

     # If there's no extension, or we don't recognize it, check the shebang line
     my $fh;
-    if ( !open( $fh, '<', $filename ) ) {
+    if ( !open( $fh, '<:raw', $filename ) ) {
         App::Ack::warn( "$filename: $!" );
         return;
     }
     my $header = <$fh>;
     close $fh;
+   use Encode;
+   $header = encode("UTF-8", $header);

     if ( $header =~ /^#!/ ) {
         return ($1,TEXT)       if $header =~ /\b(ruby|p(?:erl|hp|ython))\b/;
@@ -2607,10 +2610,14 @@
         App::Ack::warn( "$self->{filename}: $!" );
         return 1;
     }
+
     return 0 unless $rc && ( $rc == $size || length(Encode::encode_utf8($buffer)) == $size );

     my $regex = $opt->{regex};
-    return $buffer =~ /$regex/m;
+
+   use Encode;
+   $buffer = encode("UTF-8", $buffer);
+   return $buffer =~ /$regex/m;
 }

airways · 2012-02-15T19:09:00Z

This is basically the same as my patch for the malformed UTF-8 errors, which I posted on #13. I feel like this ticket should just be closed and the patch should be applied. No one has pointed out any issues with it, so I'm not sure why this hasn't been resolved already.

I agree that the request in this ticket seems pointless, we don't need logging for the error - we just need it to go away since ack and ackmate shouldn't tell us what format our files are supposed to be in.

trisweb · 2012-02-15T19:12:47Z

Correct. I've forked the project and attempted to apply, but the problem, you see, is that ackmate_ack is actually generated from http://github.com/petdance/ack - see: https://github.com/trisweb/AckMate/blob/master/bundle_extras/ackmate_ack.autogenerated

So it's not exactly simple.

mergulhao · 2012-03-08T23:38:53Z

@protocool I wget this file http://www.fiscontex.com.br/legislacao/ICMS/aliquotainternaicms.htm and the problem happens with it. This file seens to be written on Windows FrontPage and have weird windows encoding chars. Hope it helps.

normann · 2012-07-31T04:39:52Z

@trisweb 's patch totally work in my case.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

offending filename should be logged on Malformed UTF-8 error #18

offending filename should be logged on Malformed UTF-8 error #18

ericcj commented Jan 27, 2011

0xradical commented Apr 21, 2011

protocool commented Apr 21, 2011

0xradical commented Apr 21, 2011

tedeh commented May 16, 2011

jjb commented May 28, 2011

obeattie commented Jul 28, 2011

artzte commented Aug 9, 2011

airways commented Dec 11, 2011

joshgoebel commented Dec 15, 2011

protocool commented Dec 15, 2011

airways commented Dec 15, 2011

snowblink commented Jan 11, 2012

trisweb commented Feb 15, 2012

airways commented Feb 15, 2012

trisweb commented Feb 15, 2012

mergulhao commented Mar 8, 2012

normann commented Jul 31, 2012

offending filename should be logged on Malformed UTF-8 error #18

offending filename should be logged on Malformed UTF-8 error #18

Comments

ericcj commented Jan 27, 2011

0xradical commented Apr 21, 2011

protocool commented Apr 21, 2011

0xradical commented Apr 21, 2011

tedeh commented May 16, 2011

jjb commented May 28, 2011

obeattie commented Jul 28, 2011

artzte commented Aug 9, 2011

airways commented Dec 11, 2011

joshgoebel commented Dec 15, 2011

protocool commented Dec 15, 2011

airways commented Dec 15, 2011

snowblink commented Jan 11, 2012

trisweb commented Feb 15, 2012

airways commented Feb 15, 2012

trisweb commented Feb 15, 2012

mergulhao commented Mar 8, 2012

normann commented Jul 31, 2012