PDF Files: SEO and Accessibility
1. PDF and Search Engines Compatibility
Crawl and index: Search engines (hereafter- SE), and Google specifically, can crawl and index PDF files. In the absence of any other directives (see below for options), Google will crawl and index any PDF in a link encountered by a crawler, in accordance with its file size crawl limitations (see below).
During the crawl, Google will fully index all text in the PDF, including headings markup, but not images or text in images (OCR text is considered regular text and will be indexed fully).
Google will also index links from the PDF text as he would in HTML, including the ranking (or “juice”) transfer through links
Search results display: as far as visual representation in the search results is concerned, the only difference (in regards to HTML pages) is that Google clearly marks the file format.
Screenshot: Google search results for a PDF document
Search results locations: as far as the location in the search results is concerned, PDF’s can and do fully compete with HTML pages. Although not publicly published by Google, the parameters for grading PDF files are known to be different from HTML files, mostly due to the large textual (and therefore- keywords reach) content volume of PDFs (in comparison to average website’s HTML pages). The difference in grading is created in order to allow a correct comparison between HTML and PDF versions of content, and in the final result- PDFs on sites can successfully compete with HTML pages and rank very high, even in first places on the organic search results.
1.2. Content- in PDF or HTML?
When placing content on a site, a choice of content formats (HTML, PDF, Word, flash etc) often arises.
In terms of SEO, this choice enfolds a strategy choice- where should we place the strength (authority) of the content in the eyes of search engines- on the site pages or in an external file (PDF)?
While there are situations requiring the use of both formats, and although PDFs are generally well indexed and accepted in search engines, choosing a PDF has several downsides.
First of all, it’s important to realize that when a user moves directly from search results to a PDF (ie- the PDF is his “landing page”), as far as the user experience goes- the user is not really in the website: the user is not exposed to the website design, the logo, the navigation bars, header and footer etc.
Apart from the user experience, the total lack of the user interface (UI) of the site drastically affects the user’s ability to move to other content on the site- the user’s ability to view more pages and perform more actions is severely limited.
Similarly, and from the same reason, so is our ability to drive the user into specific funnels and content we want the user to see, or to drive the user to perform any actions.
Finally, while links in PDFs are indexed, it is not possible to control the transfer of authority through them as on HTML (it’s not possible to apply no index or no follow tags to links inside PDFs).
For these reasons and others, and when there are no special conditions dictating the use of either PDF or HTML- It is always preferable to place content in HTML and not PDF.
However, as mentioned, there are often situations calling for the use of PDFs, for example- user’s guides, forms that need to be downloaded by the user etc. It is important to realize that even in such situations, usually the use of PDFs does not necessarily mean that we must give up the strategic choice to place the content authority in HTML pages.
For example- it is possible to place all the content in HTML, and at the same time offer a downloadable PDF copy, while using techniques that direct search engines to place all the authority of the content in the HTML version only (see below). This solution is well suited to relatively short contents.
In cases where the content is long, it is possible to use a focused, keyword driven synopsis in the HTML pages, while offering the full content version as a downloadable PDF, again while using techniques that direct SE to place all the authority in the HTML version only.
1.3. Preferring PDF’s content- recommendations
In the rare cases where we do choose to place the authority in PDF files, it is advised to observe the following considerations:
Allowing PDF’s to be indexed: it is not necessary to perform any special actions in order to allow indexing- as soon as a crawler encounters a link for a PDF, it will attempt to crawl and index it. However, for several technical reasons, crawling and indexing PDFs takes SE longer than HTML does (usually on the scale of hours to days, but sometimes up to a month more). Therefore, there is no reason to alarm if upon first crawling, an HTML page gets indexed, but the PDFs linked within it are still not indexed.
Encouraging and speeding up indexing: it is recommended to mark the address of a PDF in the website’s sitemap file, as in any HTML page, in order to hasten the indexing.
If there is an urgent need for speedy indexing, or that even after a long period of time (over a month) the PDF was still not indexed (assuming that SE have full access to the file for indexing purposes), it is possible to use GOOGLE WEBMASTES TOOLS to submit the PDF for crawling (“fetch as Google”), and after the crawl- to submit the results for indexing.
Size limitation: as a general rule of thumb, it is advisable to create PDFs as small as possible, and to avoid sizes larger than 2.5 MB.
The larger the file is, it may take SEs longer to crawl it, they will do so less often, and may also crawl only parts of it, or avoid indexing it altogether if it is too big. Specifically for Google, PDFs are temporarily transformed to HTML during the crawl, and Google will only index a maximum of 2.5 MB from the temporary HTML file. If the temporary HTML is larger then 2.5 MB, Google will usually crawl the whole file, but index only 2.5 MB of data (usually the first 2.5 MB). If the temporary HTML file exceeds 100 MB, Google might not index it at all.
Influencing the title Google will use for the PDF in search results: for PDFs,it is not possible to direct SEs to use a specific title using meta tags (as in HTML). However, Google will usually choose the Title it will use for the file from the main document heading (H1) and/or the text used as link to the PDF file, assuming they match the content of the PDF.
Title and Heading markup: Google crawls and indexes titles that are stylistically marked as titles (using Headings), and utilizes them to improve the indexing and association with keywords. Therefore, it is important to use headings markup for titles when creating PDFs.
Links within PDFs: As previously mentioned, Google can index links within PDFs, and treats them as it would links in HTML. For this purpose, links must be have a standard link structure (ie structured as >a href=”/page2.html”>link to page 2</a> ). As it is not possible to mark links in PDF with the “no follow” and “no index” tags, if it is undesired that a specific link would transfer authority, then it must not be placed in the HTML.
Usage of Rich Media: Google will not index rich media (including pictures of any kind) placed in PDFs. It is necessary to avoid placing texts in images (same as in HTML pages). If a picture is to be indexed, it is possible to place a link to the picture in the PDF, and then the crawler will follow that link and index the picture (as a separate file from the PDF and not as part of its content).
PDF produced with text from scanned images of texts (OCR):As previously mentioned, SEs will not index text located in a picture. However, if the text was produced through OCR, it is still considered text, and there should be no problems with indexing.
Indexing PDFs but preventing displaying cached versions in Google: if the PDF contains temporary content, or content that changes often, it may be desired to prevent Google for keeping and displaying cached versions of files that are outdated or don’t exist anymore. This is possible to achieve by implementing the X-Robots tag with a “no archive” markup in the PDF HTTP response (see details in the next chapter)
Avoid using password-protected PDFs: when creating a PDF, it is sometimes possible to add a password lock to it, to prevent unauthorized access to the file. Obviously, locking the file with a password will prevent SEs from accessing it, so if indexing is desired, password protection must not be used.
1.4. Preferring non PDF content- recommendations
When choosing to place the authority in HTML pages (the recommended option), but still make use of PDFs in the site (for example: downloadable forms) it is advised to generally prevent PDFs from being indexed, thus preventing a leakage of power for the site pages to files, and form users landing directly in files.
Stopping PDF’s from being indexed: it is possible to ask SEs not to index PDFs. There are 3 ways of doing it, described here in the order of preference
a) Blocking large numbers of files- asking for an entire folder not to be indexed: this is the most recommended and “cleanest” method. Create a separate folder on the server, and place all the PDF files in it. Next, in the site’s robots.txt file, mark the entire folder as “no index”. The advantage of this method is that from this point on, and additional PDF file uploaded to the folder will also automatically be ignored by SEs. Additionally, this method is impervious to errors due to changes in the files or in links leading to the files.
b) Single file handling- asking for an individual file not to be indexed: if the aforementioned solution is not desired (too large scale), it is possible to mark a specific file that needs to be blocked with “no index” in the robots.txt file of the site.
c) Single file handling- marking the file itself as “no index”: as previously mentioned, it is not possible to use regular “no index” tags with PDFs as they have no file header. However, it is possible to mark the file itself by implementing th X-Robots tag in the header of the HTTP response of the file. The following is an example of the HTTP rsponse of a PDF with an X-Robots tag requesting a no index:
HTTP/1.1 200 OK
Date: Tue, 25 May 2010 21:42:43 GMT
The X-Robots tag supports also the “no follow” and “no archive” requests.
for further details about the X-Robots tag, please see the information supplied by Google in the following link: https://developers.google.com/webmasters/control-crawl-index/docs/robots_meta_tag
d) Standard and urgent removal a file from Google’s index: If a file has been indexed and we wish to remove it from indexing, marking it with “no index” in any of the aforementioned methods (especially the X-Robots tag) will eventually lead to it being removed from the index.
If there is an urgent need for a hasty removal of a specific file (or a folder, or even an entire site) from index, it is possible to request the removal through the Google Webmasters Tools (for Google’s index only), using the URL removal tool. It’s important to remember that this is a last resort- if the file has not been marked with “no index”, it will be crawled and indexed again!
1.5. Preventing content duplication
If, under any circumstances, there is a PDF file available for indexing and at the same time an HTML page with the same (or highly similar) content, or other PDF files with the same (or highly similar) content, it is necessary to specify the proffered version for SE in order to avoid content duplication penalties.
This can be achieved using the canonical tag (similar to HTML). However, it’s important to remember that the tag has to be implemented in the header of the PDF’s HTTP response. For further details on this subject, see the following link (and specifically- the example at the bottom of the page for implementing canonical in PDFs) https://support.google.com/webmasters/answer/139066?hl=en.
It’s important to remember that such a canonical markup will only work if the PDF is available for indexing- otherwise, the SE will never see the canonical request.
1.6. Links for additional information
General information about Google and PDFs (from Google webmasters blog)
X-Robots tag (Google) https://developers.google.com/webmasters/control-crawl-index/docs/robots_meta_tag
Canonical tag, including an example of implementing in PDF’s HTTP response https://support.google.com/webmasters/answer/139066?hl=en
Canonical tag- general information https://support.google.com/webmasters/answer/139394?hl=en
2. PDF and Accessibility considerations
*All information in this chapter follows the conventions of the WCAG 2.0 (http://www.w3c.org.il/guidelines/guidelines_WCAG_2.0.html) international regulations for accessibility of internet content, to an AA standard level. Be advised that some local laws and regulations may differ from this standard, and that this document is no way a replacement for legal advice on the subject, nor does it advise on local regulations or presumes to supply legal advice on the subject.
PDF is a format that allows for very high level of accessibility when the file is appropriately constructed.
Appropriately constructing the PDF divides into 2 main parts: adjustments made while preparing the original document (ie Word, RTF etc), and adjustments made to the PDF itself, which in turn may be divided into enabling accessibility options and performing adjustments to content. We bring here some of the important aspects to consider. For further details and specific implementation techniques, please see the provided links.
2.1. Preparing the original document
preparing the original document as an accessible document is the basis for creating an accessible PDF. As there are countless document formats that may serve as the original format, we will supply here only the main points for consideration in a Word format, which is the most command format. However, it is important to remember that the following list is only a summary of the main subjects, and it is necessary for the author of the document to confirm that the document holds up to all the requirements mentioned in the WCAG 2.0 regulations (http://www.w3c.org.il/guidelines/guidelines_WCAG_2.0.html).
Using live text only (all standard levels): avoid placing text in images, or creating documents from scanned images of texts without OCR.
Defining document structure and design, headings and structural hierarchy with style definitions (all standard levels):all the design of the document and document items must be made using the built in Word style definitions, and not manually. This includes numbering, and most important- Titles (Headings). For example- do not choose a text line and manually mark it for a bold+ underline+ large text size to give it the appearance of a title. Instead, mark it with a Heading style according to the desired hierarchy (H1, H2 etc), and then manually adjust the visual appearance.
It’s important to realize that this subject goes beyond the visual appearance issue- using the built in style definitions creates the structure and hierarchy definitions of the document, on which most accessibility instruments relay on for their functioning.
Creating spaces using style definitions and not manually (level A): for the same reasons mentioned above, it is crucial that all space definitions (space between the lines, between words, between paragraphs etc) will be defined using the Word built in Style options, and not manually (ie- not using the space bar, tab button etc).
Constructing tables using the Word built in table options, and not manually or using a picture (level A)
Supplying alt tags for images (level A)- crucial
Links from texts (level A): the words used in creating links must be meaningful. Avoid generic phrases such as “click here”, “for more information” etc.
In addition- use the word built in “screen tip” tool to supply an explanation/description for the link
Supply explanations to all abbreviations used in the document -for example: SE= Search Engine.
Contrasts and color coded information:
a) Make sure to use appropriate contrast definitions- at least 4.5 to 1 for the contrast between the text and background (level AA) or 3 to 1 if the font is sized 8 points or higher. In addition- 3 to 1 for contrast between adjacent texts (level AA).
b) Avoid using color coding as the only way to convey information (level A)
c) Verify color compatibility for color blind individuals- see techniques and regulations supplied in WCAG 2.0 (accessibility level- determined according to the techniques that will be employed)
2.2. Handling the PDF- enabling accessibility options
For a PDF to be available for use in various accessibility aids, accessibility options must be enabled during the conversion of the original document to the PDF format.
There are many PDF conversion tools in the market- we will supply here only the options for Adobe Acrobat, which is the most common one. Please note that some cheep or free tools don’t include the accessibility options at all, and therefore should not be used.
During conversion, in the “preferences” window of Adobe Acrobat, under the “settings” tab, mark these 3 options (see screenshot below): “create bookmarks”, “add links” and “enable accessibility and reflow with tagged Adobe PDF”.
Screenshot: the preferences window of adobe acrobat for document conversion
Note this 3 options, correctly chosen to enable accessibility during conversion
2.3. Performing adjustments and marking tags in the new PDF
After performing the previous 2 stages, we receive a PDF which complies with all basic accessibility regulations. At this point, the author should verify that the document complies with all the relevant regulations in the WCAG 2.0 (http://www.w3c.org.il/guidelines/guidelines_WCAG_2.0.html). To verify this, there are 23 technical points that must be observed. These points and the techniques to comply with them are explained in a separate page in the WCAG 2.0 dedicated to techniques for PDFs: http://www.w3.org/TR/WCAG20-TECHS/pdf.html .
If the 1st stage (preparing the original document) was thoroughly performed in accordance with all WCAG 2.0 regulations, then there will be very little work left to be done at this stage, most of it relevant to handling forms that need to be filled by the reader.
Screenshot: the preferences window of adobe acrobat for document conversion