“I wonder how easy it would be to write a script to calculate the link density in a block and use that to identify eg whether it’s an internal navigation block, blogroll etc?”
Bills article covers a pair of research papers by Microsoft the basic premise is that pages are not always a single topic but rather groups of topics, by splitting the page into individual blocks and treating these blocks separately the Microsoft team could do what ever a search engine does.
There approach was a combination of techniques but in a nut shell they first identified larger areas of the page through the DOM and then used a technique called VIPS (VIson Page Segmentation) to work out visual divides in content. This is rather clever but also a little over the top. In the first paper they describe their reasoning’s for not going entire DOM based on some of its limitations.
“First, DOM is still a linear structure, so visually adjacent blocks may be far from each other in the structure and departed wrongly. Secondly, tags such as <TABLE> and <P> are used not only for content presentation but also for layout structuring. It is therefore difficult to obtain the appropriate segmentation granularity. Thirdly, in many cases DOM prefers more on presentation to content and therefore not accurate enough to discriminate different semantic blocks in a web page.”
While all of these are true one could certainly argue that their first objection is a null point simply as many web pages render differently in different browsers and indeed some (including Googles own pages) use image substitution techniques or similar to hide contextual information with fancy presentation. Indeed with very little thought both the second and third limitations can easily be over come. While I am not suggesting relying on the DOM as your sole source will provide results as good as the papers combined method I do not think it would be wildly different. Where the DOM method does come into its own, is in processing and code time making it a far more practical.
Using the DOM to identify blocks
Lets start of making a couple of assumptions:
- The sites we are evaluating are in English
- Tables are being used as to store tabular data
Assumption 2 is to simply there to reduce our workload, it also means that sites will most likely be following pseudo semantic layouts.
If we presume a block of content is contained in a container then we should have an easier time locating and separating blocks this is very much the Russian doll idea. We start with a global <html> which we will take as a given next we have <head> and <body> for this example we shall discard the head and its sub containers. Moving on inside the body we have a range of potential containers the most common of which are: div, span, ul, ol, table, dl note I have not included headings in this list. Of course each of these containers could be for structure, semantics or presentation.
Because of the ambiguity of the P tag and its common use in simple presentation it does not make sense to include it as a container block even though it is obviously one.
Our first step therefore is to identify all potential container tags and most of which will be nested inside of other tags. Our next step would be to eliminate tags which in turn contain only tags that are identical and tags which contain no other element or content. For example:
Would result in test and 1 being removed
Next a leap of semantic faith, for every instance of a heading tag which is followed directly after by a container or text extract contents up to the point of closure of the container, or in the case of text outside of a container until the next container, heading which ever is first. This does create a small problem for example
<h1>mysite</h1> <div id=”test”> <div id=”1”></div> <div id=”2”>mushrooms</div> <div id=”3”> <h2>mushrooms</h2> this is a mushroom <h3>mushroom breeds</h3> <p>A mushroom is a type of</p><p>fungus</p> </div> </div>
The problem here is that under rule two the separation of the word fungus from the rest of the section would mean it is not counted as part of the mushroom breeds block. We can take solace in the fact that this would almost certainly be done due to presentation reasons and so the separation was also likely to occur with VIPs as well.
A third rule to deal with non Semantic layouts is needed, a common and unhelpful technique is to include heading tags on their own inside div, in such cases we should treat the encapsulating div as non existent. Meaning its parent div if previously effected by rule 1 would no longer be so effected.
<div id=”1”><h2>Mushroom species</h2></div>
<div id=”2”>mushrooms come in all shapes and sizes</div>
Would result in the ID test being a block as ID 1 contains only a heading tag and no further content
So let’s take a fairly common layout and see if with just 2 rule sets we can break our content down our content, to make things fair we will also do a visual breakdown to see if they match. Our site of choice is this ones front page, full of content and a fairly complex layout:
So from the DOM and applying the above rules we are left with…
- Div – Header
- Div – Menu (contains purely UL no ID)
- Div – title (images only)
Then for each blog post entry
- Div – Post-ID#
Beyond the blog entry divs
- Div – Sidebar left
- UL – No ID
- H3 – What is IA
- H3 – What is SEO
- H3 – Who is Tim Nash
And so on, one nice thing to note the HCard microformat works nicely in this scenario
The rest again follow the rules nicely with the exception of the blip inside bottom-right div which has two heading tags one immediately after the other which would generate two blocks rather then 1.
Analysing for number of links
Our block analysis is not overly complicated and is not 100% accurate but it doesn’t have to be but let’s go back to Tony initial idea while once again you could use their link analysis I think a simple scoring system would work with comparable.
Determine link count
This is pretty simple, count the number of links in a block, to determine if the block is a “link block” we could take several approaches.
I say it is approach
In this approach after n number of links the block is deemed to contain enough links to be a link block regardless of other content.
Content to link ratio
If the number of words within an ‘a href’ tag <a href>the number here</a> exceed the number of words out of the tag (or another ratio) the block is considered to be a link block.
If the block contains elements in a list which contain nothing other then ‘a href’ tag or contain a high link content ratio then it is considered a link block.
I prefer this approach myself, so If block contains more then 3 links test to see if:
- Li tag contains word content beyond ‘a href’
- ‘a href’ content of any li tag is less the half the total word count
- The total ‘a href’ content count is less then a third of the block
Let’s take a simple scenario:
<li><a href="http://ventureskills.wordpress.com/2007/09/19/ stumbleupon-mathematics-for-stumblers/">Mathematics for stumblers</a> - Venture Skills Blog</li>
<li><a href="http://internetducttape.com/2007/04/10/an-introduction -to-reputation-management/">Reputation management intro</a> - Engtech</li>
<li><a href="http://seocog.blogspot.com/2007/09/psst-want-to-buy-link.html ">Psst Want to buy a link</a> - SEOCog</li>&</ul>
So in this block the number of links is 3, ‘a href’ are within a li tag but with additional content however the total word count inside the ‘a href’ tag exceeds by more then a third the external content.
Consequently this would be flagged as a link box.
Determining if it’s a blog roll
Ok we could make a huge cheat here, and first see if the any link block has a heading with content “Blogroll” in it or similar wording, such as advertising. Alternatively we can analyse the links, obviously a blog roll is made up of primarily other peoples links so we would first need to determine if the link block contains domains other then our own, a 50% would be a reasonable limit so if you have 4 links within a link block at least 2 must be foreign domains. Next we would wish to exclude any links which have the element rel=”tag” as this would indicate a service like Technorati similarly rel=”license” is also a good indicator that the link is not from a blog roll, we can also use the reverse so rel=”friend” gives us a high probability of that link being part of a blog roll.
Basic Blog Roll scoring
So let’s create a simple scoring system for block links
- For each new domain +2
- For each Link that contains rel=”friend” +2
- For each link to your own domain -1
- For each Link to external domain containing rel=”tag” or rel=”license” -2
Setting a Blog roll likely hood at 5
So let’s take a look at my original link block:
3 domains, none marked with rel tag for a total of 6 making it likely to be a blog roll.
How to actually program it?
I have tried to keep all the rules fairly simple and so a would be programmer could easily (hmm) knock this up with a little wizardry and some basic knowledge of X path, indeed the whole program would consist of just three elements, X Path implementation, A Word Count and a string expander. Simple really
The first an obvious use is to analyse pages you are hoping to gain a link from to maximise your links potential. Indeed really smart so and so’s might even go as far as to develop this into a tool to create a much more accurate pricing system for links similar to that of Text Link ads. Of course they could also use it to identify blog rolls What uses can you think of for such a script?