Perl File Test for Text -T and PDFs
Perl File Test for Text -T and PDFs
I'm trying to limit my Perl-Tk code to only open text files to edit. I'm testing to make sure the user selected a valid file (I'm using Tks getOpenFile()
):
if ( (defined $file) and (-f $file) and (-T $file) ) { #work with file }
The problem I've run into is that some PDF files pass the -T test and get opened (causing much chaos). I tried this code in a directory full of PDFs:
#!/usr/bin/perl use strict; use warnings; my @files = <*>; foreach (@files) { if (-T) { print "$_ is a text file\n"}; }
About 1/2 the PDFs in the directory get printed.
Am I using -T wrong? Will I have to add a regex to filter out PDFs? And how come Perl thinks only some of the PDFs are text?
EDIT: -T
is a file test that should return true if the file is plain text. I'm not trying to check for taint.
Answer by ?yvind Skaar for Perl File Test for Text -T and PDFs
A couple of suggestions:
- Have you tried with a newer Perl? The docs call -T a "heuristic guess", maybe they improved it.
- Kind of a hack, but you could try running 'file' on the files before opening them
- Another hack: read the first line after open() to see it it really is text.
Don't know why it fails though.. do you have a publicly accessible pdf file that passes -T ?
Answer by matthias krull for Perl File Test for Text -T and PDFs
You may have more success with the File::Type or File::LibMagic modules.
PDF is mostly plain text. Compression, images and encryption make them appear as binary. But simple PDFs are plain text to naive tests.
The minimal PDF from the specs in a simplyfied version is plain text:
%PDF-1.1 %í??" 1 0 obj << /Type /Catalog /Pages 2 0 R >> endobj 2 0 obj << /Type /Pages /Kids [3 0 R] /Count 1 /MediaBox [0 0 300 144] >> endobj 3 0 obj << /Type /Page /Parent 2 0 R /Resources << /Font << /F1 << /Type /Font /Subtype /Type1 /BaseFont /Times-Roman >> >> >> /Contents [ << /Length 105 >> stream BT /F1 18 Tf 0 0 Td (Hello world.) Tj ET endstream ] >> endobj xref 0 4 0000000000 65535 f 0000000019 00000 n 0000000078 00000 n 0000000179 00000 n trailer << /Root 1 0 R /Size 4 >> startxref 612 %%EOF
Answer by Borodin for Perl File Test for Text -T and PDFs
You are using -T
correctly: it is just a best guess rather than an absolute classification. It may help to know that PDF files carry a fourcc of %PDF
which you can check easily with a subroutine like this
sub isPDF { open my $fh, '<', shift or return; read $fh, my $fourcc, 4; return $fourcc eq '%PDF'; }
Answer by tuxuday for Perl File Test for Text -T and PDFs
As @yvind Skaar pointed, try 'file' command.
Answer by Chris Dolan for Perl File Test for Text -T and PDFs
Most PDFs have a few binary characters right after the %PDF on purpose to hint that it's not (entirely) a plain text file. The PDF spec even recommends it:
Note: If a PDF file contains binary data, as most do (see Section 3.1, ?Lexical Conventions?), it is recommended that the header line be immediately followed by a?comment line containing at least four binary characters?that is, characters whose codes are 128 or greater. This will ensure proper behavior of file transfer applica- tions that inspect data near the beginning of a file to determine whether to treat the file?s contents as text or as binary.
In @mugen kenichi's answer, you can see the %í??"
that attempts to trigger this.
Fatal error: Call to a member function getElementsByTagName() on a non-object in D:\XAMPP INSTALLASTION\xampp\htdocs\endunpratama9i\www-stackoverflow-info-proses.php on line 72
0 comments:
Post a Comment