Blog coding and discussion of coding about JavaScript, PHP, CGI, general web building etc.

Friday, April 15, 2016

Get the number of pages in a PDF document

Get the number of pages in a PDF document


This question is for referencing and comparing. The solution is the accepted answer below.

Many hours have I searched for a fast and easy, but mostly accurate, way to get the number of pages in a PDF document. Since I work for a graphic printing and reproduction company that works a lot with PDFs, the number of pages in a document must be precisely known before they are processed. PDF documents come from many different clients, so they aren't generated with the same application and/or don't use the same compression method.

Here are some of the answers I found insufficient or simply NOT working:

Using Imagick (a PHP extension)

Imagick requires a lot of installation, apache needs to restart, and when I finally had it working, it took amazingly long to process (2-3 minutes per document) and it always returned 1 page in every document (haven't seen a working copy of Imagick so far), so I threw it away. That was with both the getNumberImages() and identifyImage() methods.

Using FPDI (a PHP library)

FPDI is easy to use and install (just extract files and call a PHP script), BUT many of the compression techniques are not supported by FPDI. It then returns an error:

FPDF error: This document (test_1.pdf) probably uses a compression technique which is not supported by the free parser shipped with FPDI.

Opening a stream and search with a regular expression:

This opens the PDF file in a stream and searches for some kind of string, containing the pagecount or something similar.

$f = "test1.pdf";  $stream = fopen($f, "r");  $content = fread ($stream, filesize($f));    if(!$stream || !$content)      return 0;    $count = 0;  // Regular Expressions found by Googling (all linked to SO answers):  $regex  = "/\/Count\s+(\d+)/";  $regex2 = "/\/Page\W*(\d+)/";  $regex3 = "/\/N\s+(\d+)/";    if(preg_match_all($regex, $content, $matches))      $count = max($matches);    return $count;  
  • /\/Count\s+(\d+)/ (looks for /Count ) doesn't work because only a few documents have the parameter /Count inside, so most of the time it doesn't return anything. Source.
  • /\/Page\W*(\d+)/ (looks for /Page) doesn't get the number of pages, mostly contains some other data. Source.
  • /\/N\s+(\d+)/ (looks for /N ) doesn't work either, as the documents can contain multiple values of /N; most, if not all, not containing the pagecount. Source.

So, what does work reliable and accurate?

See the answer below

Answer by Richard de Wit for Get the number of pages in a PDF document


A simple command line executable called: pdfinfo.

It is downloadable for Linux and Windows. You download a compressed file containing several little PDF-related programs. Extract it somewhere.

One of those files is pdfinfo (or pdfinfo.exe for Windows). An example of data returned by running it on a PDF document:

Title:          test1.pdf  Author:         John Smith  Creator:        PScript5.dll Version 5.2.2  Producer:       Acrobat Distiller 9.2.0 (Windows)  CreationDate:   01/09/13 19:46:57  ModDate:        01/09/13 19:46:57  Tagged:         yes  Form:           none  Pages:          13    <-- This is what we need  Encrypted:      no  Page size:      2384 x 3370 pts (A0)  File size:      17569259 bytes  Optimized:      yes  PDF version:    1.6  

I haven't seen a PDF document where it returned a false pagecount (yet). It is also really fast, even with big documents of 200+ MB the response time is a just a few seconds or less.

There is an easy way of extracting the pagecount from the output, here in PHP:

// Make a function for convenience   function getPDFPages($document)  {      $cmd = "/path/to/pdfinfo";           // Linux      $cmd = "C:\\path\\to\\pdfinfo.exe";  // Windows        // Parse entire output      // Surround with double quotes if file name has spaces      exec("$cmd \"$document\"", $output);        // Iterate through lines      $pagecount = 0;      foreach($output as $op)      {          // Extract the number          if(preg_match("/Pages:\s*(\d+)/i", $op, $matches) === 1)          {              $pagecount = intval($matches[1]);              break;          }      }        return $pagecount;  }    // Use the function  echo getPDFPages("test 1.pdf");  // Output: 13  

Of course this command line tool can be used in other languages that can parse output from an external program, but I use it in PHP.

I know its not pure PHP, but external programs are way better in PDF handling (as seen in the question).

I hope this can help people, because I have spent a whole lot of time trying to find the solution to this and I have seen a lot of questions about PDF pagecount in which I didn't find the answer I was looking for. That's why I made this question and answered it myself.

Answer by Muad'Dib for Get the number of pages in a PDF document


if you can't install any additional packages, you can use this simple one-liner:

foundPages=$(strings < $PDF_FILE | sed -n 's|.*Count -\{0,1\}\([0-9]\{1,\}\).*|\1|p' | sort -rn | head -n 1)  

Answer by Feiming Chen for Get the number of pages in a PDF document


Here is a R function that reports the PDF file page number by using the pdfinfo command.

pdf.file.page.number <- function(fname) {      a <- pipe(paste("pdfinfo", fname, "| grep Pages | cut -d: -f2"))      page.number <- as.numeric(readLines(a))      close(a)      page.number  }  if (F) {      pdf.file.page.number("a.pdf")  }  

Answer by commander for Get the number of pages in a PDF document


Here is a Windows command script using gsscript that reports the PDF file page number

@echo off  echo.  rem  rem this file: getlastpagenumber.cmd  rem version 0.1 from commander 2015-11-03  rem need Ghostscript e.g. download and install from http://www.ghostscript.com/download/  rem Install path "C:\prg\ghostscript" for using the script without changes \\ and have less problems with UAC  rem    :vars    set __gs__="C:\prg\ghostscript\bin\gswin64c.exe"    set __lastpagenumber__=1    set __pdffile__="%~1"    set __pdffilename__="%~n1"    set __datetime__=%date%%time%    set __datetime__=%__datetime__:.=%    set __datetime__=%__datetime__::=%    set __datetime__=%__datetime__:,=%    set __datetime__=%__datetime__:/=%     set __datetime__=%__datetime__: =%     set __tmpfile__="%tmp%\%~n0_%__datetime__%.tmp"    :check    if %__pdffile__%=="" goto error1    if not exist %__pdffile__% goto error2    if not exist %__gs__% goto error3    :main    %__gs__% -dBATCH -dFirstPage=9999999 -dQUIET -dNODISPLAY -dNOPAUSE  -sstdout=%__tmpfile__%  %__pdffile__%    FOR /F " tokens=2,3* usebackq delims=:" %%A IN (`findstr /i "number" test.txt`) DO set __lastpagenumber__=%%A     set __lastpagenumber__=%__lastpagenumber__: =%    if exist %__tmpfile__% del %__tmpfile__%    :output    echo The PDF-File: %__pdffilename__% contains %__lastpagenumber__% pages    goto end    :error1    echo no pdf file selected    echo usage: %~n0 PDFFILE    goto end    :error2    echo no pdf file found    echo usage: %~n0 PDFFILE    goto end    :error3    echo.can not find the ghostscript bin file    echo.   %__gs__%    echo.please download it from:    echo.   http://www.ghostscript.com/download/    echo.and install to "C:\prg\ghostscript"    goto end    :end    exit /b  

Answer by Kuldeep Dangi for Get the number of pages in a PDF document


Simplest of all is using ImageMagick

here is a sample code

$image = new Imagick();  $image->pingImage('myPdfFile.pdf');  echo $image->getNumberImages();  

otherwise you can also use PDF libraries like MPDF or TCPDF for PHP


Fatal error: Call to a member function getElementsByTagName() on a non-object in D:\XAMPP INSTALLASTION\xampp\htdocs\endunpratama9i\www-stackoverflow-info-proses.php on line 72

0 comments:

Post a Comment

Popular Posts

Powered by Blogger.