PDFs and Search Engines

While we were doing the Flash tests I got to thinking about the other binary blobs out there and how the search engines were coping.adobe PDF Icon While word docs are perhaps the most common way for people to send text data around it is used less as the end format. Currently the format of choice is Adobe PDF particularly in the corporate world. So it was time for another round of tests do PDFs rank for keywords? Will their links pass weight? more importantly why are they so annoying!

How I conducted the tests

Like the previous Flash test I used 4 separate domains 2 are "authority" domains in day to day use while 2 are my little test areas. Just over 16 PDFs were used with 6 end pages.

Results for those who can't wait, commentary is below.

  Google Yahoo MSN/Live Ask
Indexed Yes Yes Yes Yes
File Name Yes Yes Yes Yes
Term Rank Yes Yes Yes Yes
Links Indexed Yes Yes Yes/No No
Links Weighted Yes Maybe Maybe N/A
Duplicate HTML Yes Yes Yes Yes
Duplicate PDFs No Yes Yes Yes

PDFs Ranking, Indexing, and optimising

Given we have all seen PDF files in the SERPs it will come as no surprise all major search engines do indeed index and rank PDFs. None of the major search engines provided high rankings to the PDFs if HTML page were present in the SERPs to compete with them but this is not surprising. Attempts to optimise a PDF is still limited to keyword repartition and basic manipulation of some of a PDFs meta data.

PDF and search friendly Links

Barring anecdotal evidence I have not until now seen research to support the fact that PDFs links carry weight. Our tests indicate that PDF links are followed, crawled and indexed though as PDFs themselves carry little weight there is little to pass to their links. Ask was the only search engine which did not index our test links however once more this maybe a time issue and so yet again Ask is in the sin bin. The rankings in Yahoo and MSN/Live were so poor for our test links it was hard to tell if they were gaining any weight from their PDF links.

Anchor Text did make a small difference in Google and Yahoo, but once again it was hard to tell in MSN/Live.

PDFs and Duplicate content

One of the factors that was noted in our Flash testing was that copy held inside the SWF file could be a duplicate of a HTML page and both would be crawled and ranked without a penalty in all of the major engines this also appears to be the case with PDFs. All the major engines indexed and ranked scraped content held in a PDF, though no PDF outranked it's HTML equivalent even with the same inbound links. Out of the engines only Google prevented Duplicate content in PDFs. Google only discounted PDFs which were identical, change the Title, Meta description or change the file size by adding any extra code and both versions would be indexed.

What does this mean for me?

Anyone who has searched for hardware manuals will know that PDFs are often found in long tail search terms and while it is unlikely a PDF could compete for competitive search terms they carry enough weight to be of some use in a search engine optimisation campaign.

Dofollow only

Unless Google petitions Adobe a link is a do-follow link and that's it, there is no way for a writer to designate a do not crawl indication except to the entire document. This of course has implications for those looking to circumvent paid link issues, a PDF version of your review for example would provide your links with more (though not a lot more) weight then your no-followed link. Since you have no way to separate your editorial and non editorial links Google can't penalise you without preventing all links from a PDF to be no-followed.

SPDFs or Spam PDFs

Splogs are increasingly becoming accepted, most bloggers accept that their site is going to be scraped and many are looking at ways to maximise this unauthorised syndication. It's generally accepted that such scraping is not likely to cause the original site any major problems, scrapers using PDFs might however prove more problematical. First off the PDFs are much more likely to rank for some terms then a traditional Splog, because PDFs take time to rank you are much less likely to notice a SPDF as by the time they are ranking you will have forgotten about your post and its rankings. SPDFs perhaps represent more of a reputation management issue then a splog, apart from the general annoyance of PDFs the way to monetise PDFs are limited and so people may go to more unusual lengths (PDF viruses are not unheard of).

PDF usability on the web sucks

PDFs maybe the file format of choice for ebooks, and secure documents but when it comes to viewing them in a browser for most people they just plain suck! I know that when I accidentally click a link to a PDF my first reaction is to frantically click the close icon in Firefox! Before some one suggests Foxit I've done it for you and recommend it for anyone currently using Adobe Reader. Consequently anyone seriously considering using PDFs as part of a strategy might want to reconsider using them as a primary source. Given that the search engines are relaxed regarding duplicate content issue in PDFs it makes sense to where possible include a HTML version of your PDF which should rank higher then the PDF.

Deep linking in PDFs

Ok so that is a misleading heading but a useful tip in case you didn't know it, you can jump to a specific page in a file using the name anchor attribute.

www.example.com/myexample.pdf#page=10

Like any anchor link it is treated as being the same page by the search engines.

The long wait

One of the biggest issues with PDFs seem to be the time they take to rank which seems to be between 1 and 2 weeks on an authority domain in Google and anywhere up to 6 elsewhere. If you are hoping to make a quick buck then you might want to plan ahead.

So PDF SEO worth it?

It's a lot of work for little gain, I would not be surprised if some clever spammer developed a method to turn their daily scrapings into PDFs with enough interlinked PDFs they could provide some weight to pages. Personally I am not a fan, I think unless you have a reason to be protecting your content (In which case should it be available to a search engine?) their is little point to PDFs on the web, HTML does a much more efficient job without upsetting your users. However those using PDFs can be safe in the knowledge their PDFs are helping their rankings and the duplicate content should not be an issue.

Do you use PDFs? I would like to hear and gather peoples thoughts?

RSS feed | Trackback URI | Add your comment!

18 Comments »

Comment by Ashish Mohta from Technospot.Net Subscribed to comments via email
2008-02-19 14:22:39
Ashish Mohta avatar

Pretty Intresting study. I had always wondered about indexing of PDF may a time but this gives a better understanding. Infact its just not pdf its everything. The only difference is you cant see pDF in one shot to find duplicate content. I hope google does something else its algorithm will fail.

 
Comment by reena from Investintech.com, Inc. Subscribed to comments via email
2008-02-19 17:03:57
reena avatar

Thanks for the informative post! I was looking for research like this a few months ago for a posting on search engines and PDFs. I was wondering if there are any statistics or research on the type of PDF content users are currently searching for? is it at all possible to single out search habits for PDF files online?

Best,
reena

 
Comment by Charles Benninghoff from Crown SEO - Our search engine optimization will ta Subscribed to comments via email
2008-02-19 18:09:59
Charles Benninghoff avatar

Statistically, some data is available. For example, a search from Google’s advanced search facility results in only 114 million returns for web sources ending in .pdf.

Thus, while there may be billions of pdf pages out there, Google is clearly not indexing a mere fraction of them.

The reason for this is clear: robots.txt limitations.

Again, the power of the robots.txt is apparent and is, perhaps, one of the elements in Google’s algorithm that is more important than generally realized.

Hope this helps.

Charles

 
Comment by Catfish from BusinessOnLine SEO Blog Subscribed to comments via email
2008-02-20 00:38:19
Catfish avatar

Thanks very much for this post. This is the best article I have ever read on the subject. Seriously, thanks man.

 
Comment by Gab Goldenberg from SEO ROI Services
2008-02-20 04:37:10
Gab Goldenberg avatar

Tim, I really love all these experiments you run!

As to dupe content, the engines rank it in html format (read the news on any given day), so why not in pdf?

 
Comment by James from Bugs And Weeds Subscribed to comments via email
2008-02-20 04:51:56
James avatar

Tim, I have to say that I have been reading a lot of seo articles over the last couple of years, but I found more practical, interesting information here than in any 10 other sites I have read. I love the approach! Great work, very helpful!

 
Comment by Chris "PDF's are SEO tools"? Estes from Search Engine Optimzation by Chris Subscribed to comments via email
2008-02-20 07:19:15
Chris

Interesting results. I don’t like the future of PDF SEO but adding a few search engine friendly attributes is easy. All of the PDF’s I share often optimized for fast web view and edited for SERP. One thing I haven’t found is links showing up in a link:www.example search. If you run more test you might want to monitor links from pdf documents.

 
Comment by Tim Nash
2008-02-20 12:50:27

@Ashish not sure what you mean, Google could simply convert the pdf to text and compare. certainly capable of doing it all in one go though it might be a bit heavy on the server load.

@Reena I have not seen searching habits by document type research but it would be interesting to see I suspect if people are anything like me you would see a very low number of people clicking through to pdfs unless they had to

@Charles once again your genius astounds me :)

@Gab we sort of covered that one, but yes duplicate content issue is here to stay in both HTML and PDF but you have to ask yourself how hard it is to compare two lots of text? actually from a simple server load issue its actually quite problematical but hey!

@Chris we had links ranking and appearing as back links, along with pages indexing that were only linked to by the PDFs :) though it does require a few nudges at the PDF with lots of html links to start the ball rolling.

@ Everyone Thanks for the comments and feedback these sort of articles take a fair amount of time to research and its often hard to know how people will take to them. So it’s always nice to receive positive feedback.

 
Comment by Edward Beckett from Florida Search Engine Optimization LLC Subscribed to comments via email
2008-02-22 08:46:09
Edward Beckett avatar

Great Study …

A few months ago I was working on an SEO campaign for several sites when the company editor asked me if I recommended hosting some white papers on the site(s) in PDF …

I suggested that we write them in HTML and then create a break out window for those that wish to download the paper …

It worked out really well that way …

Thanks for the great study … now I feel that I made a smart decision …

 
Comment by Marios Alexandrou from http://www.allthingssem.com/ Subscribed to comments via email
2008-02-24 22:53:12
Marios Alexandrou avatar

What I wish is for an effective way to convert PDFs into HTML where images retaing their quality, layout remains intact, and multi-column pages get converter to single column pages. I’m not asking for too much, am I? :-)

 
Comment by Ilia B from Exposed SEO Subscribed to comments via email
2008-03-02 20:16:31
Ilia B avatar

The rumour has it there is a PDF search engine in development.

Comment by reena from Investintech.com, Inc. Subscribed to comments via email
2008-03-03 21:05:15
reena avatar

Hi Ilia,

A PDF search engine (ie. a web search engine for PDFs)? That’s really interesting. Where did you hear the buzz for that one? Sounds like something worth looking into ;)

Best,
reena

 
 
Comment by Ilia B from Exposed SEO Subscribed to comments via email
2008-03-04 01:24:52
Ilia B avatar

Yep, a friend of mine is works for Symantec and one of his workmates was asked to help out with the startup. Not sure how reliable this information is though.

 
Comment by Scott from Paintworkzweb Design Subscribed to comments via email
2008-04-14 10:17:53
Scott avatar

Would the search engines be able to index the images found in the PDF’s. That would be pretty neat. But would it be possible. ?

Comment by Tim Nash
2008-04-14 10:22:28

It would be neat but not very practical, even Adobe Acrobat has problems pulling images from PDFs at times so Google has even less chance, but the biggest problem is no real way to classify them with no alt tags, no file names they could only go on the content around the image which just would not be accurate.

 
 

Responses to this post:

dont be an idiot, use a real name and all comments are moderated
Name Your name not your website(required)
E-mail (required - never shown publicly)
Website address (URL)
Web site name will be used as link text
Subscribe to comments via email
Your Comment (smaller size | larger size)
You may use <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong> in your comment.
Spam protection: Sum of three + 9 ?