Blog coding and discussion of coding about JavaScript, PHP, CGI, general web building etc.

Thursday, February 18, 2016

Perl File Test for Text -T and PDFs

Perl File Test for Text -T and PDFs


I'm trying to limit my Perl-Tk code to only open text files to edit. I'm testing to make sure the user selected a valid file (I'm using Tks getOpenFile() ):

if ( (defined $file) and (-f $file) and (-T $file) ) {    #work with file  }  

The problem I've run into is that some PDF files pass the -T test and get opened (causing much chaos). I tried this code in a directory full of PDFs:

#!/usr/bin/perl    use strict;  use warnings;    my @files = <*>;  foreach (@files) {    if (-T) { print "$_ is a text file\n"};  }  

About 1/2 the PDFs in the directory get printed.

Am I using -T wrong? Will I have to add a regex to filter out PDFs? And how come Perl thinks only some of the PDFs are text?

EDIT: -T is a file test that should return true if the file is plain text. I'm not trying to check for taint.

Answer by ?yvind Skaar for Perl File Test for Text -T and PDFs


A couple of suggestions:

  • Have you tried with a newer Perl? The docs call -T a "heuristic guess", maybe they improved it.
  • Kind of a hack, but you could try running 'file' on the files before opening them
  • Another hack: read the first line after open() to see it it really is text.

Don't know why it fails though.. do you have a publicly accessible pdf file that passes -T ?

Answer by matthias krull for Perl File Test for Text -T and PDFs


You may have more success with the File::Type or File::LibMagic modules.

PDF is mostly plain text. Compression, images and encryption make them appear as binary. But simple PDFs are plain text to naive tests.

The minimal PDF from the specs in a simplyfied version is plain text:

%PDF-1.1  %í??"    1 0 obj    << /Type /Catalog       /Pages 2 0 R    >>  endobj    2 0 obj    << /Type /Pages       /Kids [3 0 R]       /Count 1       /MediaBox [0 0 300 144]    >>  endobj    3 0 obj    <<  /Type /Page        /Parent 2 0 R        /Resources         << /Font             << /F1                 << /Type /Font                    /Subtype /Type1                    /BaseFont /Times-Roman                 >>             >>         >>        /Contents [          << /Length 105 >>          stream            BT              /F1 18 Tf              0 0 Td              (Hello world.) Tj            ET          endstream ]    >>  endobj    xref  0 4  0000000000 65535 f   0000000019 00000 n   0000000078 00000 n   0000000179 00000 n   trailer    <<  /Root 1 0 R        /Size 4    >>  startxref  612  %%EOF  

Answer by Borodin for Perl File Test for Text -T and PDFs


You are using -T correctly: it is just a best guess rather than an absolute classification. It may help to know that PDF files carry a fourcc of %PDF which you can check easily with a subroutine like this

sub isPDF {    open my $fh, '<', shift or return;    read $fh, my $fourcc, 4;    return $fourcc eq '%PDF';  }  

Answer by tuxuday for Perl File Test for Text -T and PDFs


As @yvind Skaar pointed, try 'file' command.

Answer by Chris Dolan for Perl File Test for Text -T and PDFs


Most PDFs have a few binary characters right after the %PDF on purpose to hint that it's not (entirely) a plain text file. The PDF spec even recommends it:

Note: If a PDF file contains binary data, as most do (see Section 3.1, ?Lexical Conventions?), it is recommended that the header line be immediately followed by a?comment line containing at least four binary characters?that is, characters whose codes are 128 or greater. This will ensure proper behavior of file transfer applica- tions that inspect data near the beginning of a file to determine whether to treat the file?s contents as text or as binary.

In @mugen kenichi's answer, you can see the %í??" that attempts to trigger this.


Fatal error: Call to a member function getElementsByTagName() on a non-object in D:\XAMPP INSTALLASTION\xampp\htdocs\endunpratama9i\www-stackoverflow-info-proses.php on line 72

0 comments:

Post a Comment

Popular Posts

Powered by Blogger.