PDF Files

Menu


PDF files - A Quick Guide

Topics covered below include: The Basics, PDF File Formats, PDF Creation software, PDF Editing software, PDF Security and Copyright Protection, PDF Online Services, PDF Bookstores

PDF Files - the Basics for Electronic Publishing

Overview

The widely used Portable Document Format (PDF) for page-based documents was introduced by Adobe in 1993. It is based in large part on the Postscript page description language for modern printers, which had been developed in the late 1970s/early 1980s. In 2008 the specification for PDFs was made available by Adobe as a royalty free open standard and adopted by the International Standards authority (PDF32000). A very good summary of the history and technical aspects of PDF files is provided on WikiPedia. Wikipedia also includes a very useful list of PDF Software.

However, the sections below provide more practical information based on our own experience of using PDFs and PDF tools over the past decade.

One of the most important features of PDF documents is they are defined by a PAGE BASED model - this describes how individual pages in the document are made up, in terms of the text, the fonts used, graphical objects, interactive elements and possibly other features associated with the page. This PAGE BASED model means that when you look at a PDF page on-screen or on printed output, it should always look the same and as specified by the designer. This is completely different from formats such as ePUB and HTML, which are not page based - they are effectively a linear stream of items, one after another, with limited "layout" elements (ePUB3 and HTML5 have improved on this of course, but they are still very much flexible, flowable formats). The two approaches have been designed independently, with a major aim of formats like ePUB being to allow the text to be the dominant element, re-sizable and re-flowable, ignoring the page concept and focusing on the size and orientation of the device on which it is viewed. ePUB and its variants and versions is the most widely used format for reading eBooks documents on mobile devices, including of course Amazon Kindle, Nook and other specilaized ebook reader devices.

Page size and orientation

Some comments should be made at this stage about page size. Because PDF files are page-based, the underlying page size (or effective page size) matters - both for successful output to print devices and for reading on-screen. The great majority of documents that are converted into PDF format are based on A4 or US Letter page size in Portrait orientation. This has a total height of 11+ inches or 300mm. If a computer screen displays at 100 pixels per inch the screen would need a resolution of at least 1100 pixels vertically to display the page, and even then it would be very difficult to read typical text at, say 12 point size. And most computer displays, including all laptops, are less than 11 inches high so any entire page will be shrunk to fit the available screen space. Technical drawings can be even more difficult to decipher as they often include small fonts and fine lines. This does not present a serious problem for print output, as long as the printing is carried out with at least 300dpi (dots per inch) or preferrably 600-2400+ dpi. There are a number of solutions to this problem, the most obvious of which is zooming. Most PDF readers support immediate zooming to page width, and this increases the font size viewed dramatically with the result that the text is easy to read - however, a downside of this is that the page then has to be scrolled downwards and maybe sidewards in order to read the entire text on a page. This in turn has some impact on the ease with which the user can read the document, particularly if it has many pages. For reference works, where small sections are referred to, this is not really a problem, but for fully reading and absorbing a long document, it is a limiting factor. The alternative to zooming the page is for the source document to be arranged (designed to fit) on a smaller page size (e.g. A5 or one-half US Letter) and/or a landscape orientation used, and/or for the document to be created with larger font sizes.

Recommendations for successful web-based PDF viewing (from Mozilla.org)

The following recommendations are from the Mozilla organization, who are responsible for the Firefox web browser, combined with some of our own recommendations.

Optimize a PDF from Adobe's website for more information. There are more improvement techniques that we can suggest:

  • Avoid using high resolution images - 150 dpi resolution for scanned images shall be enough for screens
  • Try to use JPEG encoding for color images/photos in RGB colorspace when possible
  • Avoid using expensive compositions/effects such as transitions/masking - flatten transparency
  • Avoid using PDF generators (or do not create content) that produce ineffective PDF output (e.g. LibreOffice and several other PDF creators produce lots of tiny images for vector elements/pictures that they do not understand)
  • If there is such a setting, use web-optimized PDF output / linearization
  • Fix or don't produce corrupted PDFs that do not conform to the PDF32000 specification

Another common issue with PDF creation is the use of implicit rather than explicit URLs. Suppose you include www.adobe.com as text in your source document and then output the file to PDF. Some PDF readers, including the Adobe Reader, will "guess" that you meant this to be a web address and will act on it accordingly. However, the item will not be highlighted as a link (which is the case for web browsers also, as can be seen from www.adobe.com not showing up as a link) and for many PDF viewers on different technology platforms the result will be no action at all. The solution is to make the link explicit rather than implicit (when you want it to be a link instead of just text). You can do this by selecting the text in the source file and then telling the editor to add a hyperlink at that point. This is like ensuring that a link in text like "Click here" actually specifies where the resulting click should take you to.

Video and Audio media files can be embedded in some Adobe-specific PDFs, but should not be embedded for general use as these will not work across platforms and readers and the resulting PDF files are generally huge and unsuitable for downloading. For video and audio files we recommend that these be created in MP3 or MP4 format and then placed on an in-network server for user access. Then within the PDF place a link to these files so they become linked rather than embedded. The resulting PDF file will work on all platforms and situations, and will remain small, even where the media files are large.

PDF file formats

Note - this section is based on materials published by Adobe and others - links are provided to original technical reference material below. When creating a PDF file many software packages offer you the option of choosing a particular version of PDF, and for most purposes choosing PDF version 1.5, 1.6 or 1.7 is best. These versions will be readable by all modern PDF readers, not just the Adobe Reader. Many of the advanced features of Adobe PDFs, such as javascript, audio and video embedding, and 3D model support are very specific to Adobe, so will not work on most other readers. Even form-filling and markup, which were introduced into the standards many years ago, are not supported by many PDF readers. This applies not just to offline readers on desktops, laptops and mobile and tablet devices, but also to almost all online PDF viewers that are supported by the main web browsers (Chrome, Firefox, Safari etc).

Basic structure of a PDF file

(source: Adobe, 2004)

The general structure of a PDF file is composed of the following code components: header, body, cross-reference (xref) table, and trailer, as shown in figure 1.

Figure 1. Basic structure of a PDF file

PDF file structure

The header contains just one line that identifies the version of PDF. For Example: %PDF-1.4 is the first line of the testfonts.pdf file (which we include on our Samples page in our proprietary secured format). If you add the two values from the version number, e.g. 1.4 -> 1+4 you get 5 which is the version of Adobe Reader needed to view a document in that version of PDF - so version 1.6, which is probably the last overall "standard" version that is most widely used, requires Adobe Reader V7 or later (or other PDF readers that handle PDF version 1.6). The trailer contains pointers to the xref table and to key objects contained in the trailer dictionary. It ends with %%EOF to identify end of file. The xref table contains pointers to all the objects included in the PDF file. It identifies how many objects are in the table, where the object begins (the offset), and its length in bytes. The body contains all the object information — fonts, images, words, bookmarks, form fields, and so on.

The following links provide access to the technical specifications - these are large files, often 30Mb+

Save and Save As

When you perform a Save operation on a PDF file, the new, incremental information is appended to the original structure (see figure 2); that is, a new body, xref table, and trailer are added to the original PDF file.

Figure 2. Structure of a PDF file after updates

PDF amended file structure

PDF Creation software

In the past creating PDF files meant purchasing Adobe Acrobat software from Adobe Inc, and even today, Acrobat is one of the best and most powerful software packages for the creation and amendment of PDF files. However, there are many other ways of generating a PDF, typically as an Export option within major document creation software products. This applies to the full set of current MS Office and OpenOffice applications - whether Word, Excel, Powerpoint etc or their equivalents in OpenOffice. The OpenOffice export to PDF option is very fast and produces good quality PDF files, but has almost no options for specifying attributes of the generated file. MS Office Export to PDF does include a range of options, with the most relevant being those that create structured PDF files and files that are optimized for screen or print. MS Word files that include Heading styles can be set to create a Contents tree automatically (also known as an Outline or Bookmark tree) which is very important for fast navigation of larger documents, particularly when viewed on mobile devices. All modern desktop publishing software products also generate PDF files, with preset options for print production and in some cases, for screen viewing. Amongst the best and most widely used of these are Adobe's InDesign software for PCs and Macs and QuarkXPress for Macs. Mac computers will create PDF files from almost all appropriate applications, including the basic Pages and office-related facilities included as standard with OSX. Similar facilities exist within Linux.

In addition to the above options, there are many "print drivers" which will create a simple PDF as output from any desktop application under Windows, just by printing the material to a specially installed printer ... in this case, a non-physical print device. The result is a PDF with very little functionality, but usable for many basic applications.

Recommended links:

  • Adobe- for Acrobat and InDesign software
  • A-PDF- for software from A-PDF (affordable PDF Tools), including an excellent watermarking tool
  • jPDFBookmarks- free PDF bookmarking tool
  • CutePDF- for software from CutePDF (PDF print driver/writer and editor)
  • PDF Creator Plus- from Peernet
  • Setasign- php software for creating and augmenting web-based PDFs

PDF Editing software

As explained earlier PDF files have a quite complex structure, and there are many variants in terms of the standards applied and the way in which these standards have been implemented. In fact, some would argue that there are almost no real standards because there is so much variation in implementations. The result is that the structure of any particular PDF file can be extremely complicated, and thus very difficult to amedn (edit) post creation. It is, of course, possible to perform a range of functions which come under the general hheading of editing, i.e. are not simply the amendment of text on a page. These features include changing the content of specific pages; changing the use of particular fonts; cleaning/optimizing files to remove duplications and poor structure; splitting and combining PDF files; extracting and inserting pages; saving the content is alternative formats (other PDF standard variants or completely different formats, such as Rich Text (rtf); and more.... In the case of Adobe Acrobat, which is the mostly widely used PDF Editor, the standard editing tools are arranged into groups of functions:

  • Content editing - includes adding and editing text and images, plus adding hyperlinks and bookmarks
  • Page manipulation - includes rotation, deletion, splitting, watermarking, style changes etc
  • Forms and button management - includes adding fields for text/data input and related functions
  • Text recognition, i.e. OCR functionality, mainly used for conversion of scanned-in files to text and numbers where possible
  • Document processing - which includes a range of functions, from aspects of page layout to page numbering and auto-identifying web content and URLs (converting implicit URLs into explicit URLs)
  • Additional tools are also provided, either built-in or downloadable from Adobe's website

Other PDF Editors provide similar functionality, with their own take on the most important aspect of their usage - for example, the Infix editor from Iceniis much closer to a Word Processor style of editor than Acrobat. Likewise, Foxit's PDF Editor software provides a very wide range of functionality, similar to the features described above, at a very competitive price. And there are innumerable online PDF conversion and editing websites, most of which provide basic functionality as a free service with more advanced features on a subscription basis. Software providers include:

  • Adobe- for Acrobat and InDesign software
  • Iceni- PDF Editor provider
  • Foxit- PDF Editor and Reader provider
  • PDF Architect- Tailorable PDF Editor provider

PDF Security and Copyright Protection

PDF Security is a complex topic, with three main strands: (i) Authentication; (ii) Content protection; and (iii) Digital Rights Management. In this section we discuss each briefly, but there is a vast amount of information available on all these topics. Copyright protection of PDFs requires use of a Digital Rights Management (DRM) service.

PDF Authentication

Authentication is the process of determining whether a document is from the person or organization it claims to be from and/or is correctly signed by them. The basic ideas are summarized by Adobe as follows:

PDF supports two kinds of digital signatures: approval signatures and certification signatures. Any number of approval signatures may be applied to a PDF document but only one certifying signature may be applied and it must be the first digital signature. Approval signatures are used in the same manner as the ink on paper signatures we are all familiar with. Certification signatures are considered a part of creating the PDF file so only occur once at the beginning.

The screenshot below provides an example of the use of Digital Signatures with the Adobe software - the provision of such information regarding signatures is not implemented in all PDF readers, so for such files use of Adobe's reader is recommended. Also notice that the first signature shown here is recorded as being by DocuSign, i.e. via a third party document signing service.

Digital Signatures

Digital Certificates (certification) is slightly different from applying signatures - it involves use of an independent certifications authority. Adobe approves the following certifications authorities (Entrust, GlobalSign, OpenTrust and Symantec (Verisign). The screen shot below illsutrates the prompt provided when using the signing tool in Adobe's Certificate Signing option:

Certification

Content Protection

Content protection is the most familiar option for users of PDFs. It provides for two forms of document content protection. The first is to apply a password restriction such that when a user attempts to open the document a password is requested before the content is displayed. One reason for providing such a facility in the early days was to try and limit access to PDF files on PCs that were shared or left unattended. In large measure this facility is now redundant and is not recommended for use.

The principal form of content protection provided within current standards and Adobe's implementations involves encryption of the PDF file using a user-supplied password. The password, known only to the person securing the file, is entered and selected features of PDF are then made unavailable to a person to whom the file is sent. A sample screenshot of the Adobe Acrobat security settings screen is shown below. The upper entry is for the Open password protection described above, whilst the lower section specifies the security password and permission settings required. As can be seen the default setting is to not allow copying of the text or printing. HOWEVER, even if you set these controls they may be ignored is the resulting PDF file is opened with a non-Adobe PDF reader that ignores such settings (see our separate article on this question).

The other big issue with applying such security is trying to determine whether the file can be readily decrypted. The brief answer to this is "yes", assuming the encryption password/key is not too long and complicated and the encryption level is at least 128bitAES or 256bit AES - for more details on decryption of secured PDF files see the website of the Russian software house, Elcomsoft. The 256bit encryption level for Adobe PDFs is not available for standard PDFs (it requires Adobe Reader 9.0 or later), so should only be used if the target PDF readers are ALL the Adobe Reader.

Content protection is possible, including print, copy, forward, save and date-based protection, using non-Adobe solutions. In most cases these facilities require alternative PDF readers - these are discussed below in the section covering Digital Rights Management

Encryption

Digital Rights Management

Digital Rights Management (or DRM for short) is the term applied to the protection of digital assets using a centralized rights management service. It uses the combination of several distinct elements to provide the strongest possible protection of PDF documents, i.e. protection against copyright theft, amendment, forwarding files and much more. DRM services not only protect documents at the point of viewing, but also provide facilities to track access and in some instances, withdraw access permission from the end user.

DRM systems fall broadly into two main classes: (i) Hardware-based solutions, which rely on identification of pre-registered hardware, typically as an eBooker reader (e.g.a Kindle) or controlled generic device (e.g. an Apple iPAD) in order to veryify the end user, their access and permissioning; and (ii) Software-based solutions, which apply across technology platforms, i.e. that are not based on proprietary hardware but rely on the exchange of information between the user and user's device, and the central DRM service, to uniquely identify the identity of the target receipient of the secured PDF file. Note that for PDF files, unlike ePUB or similar files, the hardware-based ebook vendors like Amazon, do not offer a DRM service, so use of third party software and service solutions are required for PDF DRM. This includes Adobe of course (at a high cost) and a number of other providers. A number of these providers offer cross-platform solutions, whilst others (like FileOpen, an Adobe Partner) offer solutions that are both cross platform and cross document type.

Because hardware-based DRM services do not support PDF files, software solutions are required. In general this involves using a special PDF reader, e.g. Digital Editions from Adobe or Javelin from Drumlin Security, to open a specially encrypted PDF file. Foxit-based PDF readers with DRM security are available from Foxit and from Locklizard.

Software DRM solutions can also be separated into two distinct types - those that are based on use of some form of code string for authorization of access to a specific document, and those based on license files and/or online access by pre-registered users. In the former case there is no requirement for users to be registered, so no user management system is imposed on the implementation and management of the service. In the latter case all users must be registered with the central DRM service before they can be enabled to view specific documents. This latter requirement has the advantage that it provides a high level of access control, with the option to disable a document and/or user under specific circumstances. However, it has the disadvantage that the entire system has to be managed, which can impose a substantial overhead on organizations. For this reason it is best applied in cases where PDF documents are distributed intra-corporately, although it can be applied for extra-corporate PDF distribution when applied carefully (e.g. for well-defined closed user groups). It is not really suitable for eBook sales and similar low overhead/low margin applications, not for smaller organizations where the cost of managing such a service and possibly providing DRM service integration, can be high.

Some PDF DRM providers, including Drumlin Security, offer both options, i.e. code-based and license based. The standard service is code based, this being inexpensive and very quick and simple to implement, ideally suited to small-medium sized organizations and for ecommerce applications. For larger organizations, with more complex requirements, the license-based approach may be more suitable.

Before ending this section a rather different approach to PDF security is possible, delivery of content via a web browser, with or without user access controls (i.e. user login with tracking of this activity). With this model users access the PDF via a standard web browser (typically an HTML5 compliant browser, which most are nowadays) and view the pages online rather than offline. The advantages of this approach are that (i) no special software download and insatll is required; (ii) no document distribution is required; and (iii) the service is highly scalable with minimal management support required. The disadvantages are that the document must be read online, which means continuous access online is needed; the quality of display, speed of display, and overall functionality is typically not as good as offline usage; and security is much lower because the software being used to view the files is just a web browser over which the service provider has no control. Solutions of this type can be based on PDF display, Flash-based display (not recommended as Flash is not supported on many devices), HTML5 display (can be fast and very good quality if statically generated, slower if dynamically generated), or pure HTML display (effective but poorer quality).

Selected providers include:

  • Drumlin Security - for our own Drumlin Publisher and Javelin Reader software and DRM services (offline and online)
  • Adobe - for Adobe Digital Editions reader and Adobe DRM services (LiveCycle)
  • FileOpen - for Adobe Reader based solutions and a range of associated DRM services
  • Foxit- Foxit Reader based PDF Rights Management System
  • Locklizard - for Foxit Reader based solution and a range of associated DRM services

PDF Online Conversion services

There are many online PDF conversion and enhancement services, some built using the Setasign php software mentioned earlier. Examples include:

  • VeryPDF- PDF conversion and watermarking
  • Small PDF- Conversion to/from PDF, Merge/Split, Basic security (lock/unlock)

PDF Bookstores

Although not immediately apparent, many books are available electronically at no cost, i.e. free to download. Project Gutenberg (https://www.gutenberg.org/wiki/Main_Page/) is one such source, although most of the 50,000+ titles it carries are out-of-print and in ePUB or Kindle formats.

Copyright issues with "free" books are an important issue - our article Catching the eBook Pirates addresses this question in mor detail.

Sample news articles