Simple Block segmentation analysis

A recent Blog post on Bills SEO by Sea about block segemention started to get me thinking, but when Tony Hirst wrote a note on his del.icio.us bookmark about the same post I took it as a challenge.

“I wonder how easy it would be to write a script to calculate the link density in a block and use that to identify eg whether it's an internal navigation block, blogroll etc?”

Bills article covers a pair of research papers by Microsoft the basic premise is that pages are not always a single topic but rather groups of topics, by splitting the page into individual blocks and treating these blocks separately the Microsoft team could do what ever a search engine does.

There approach was a combination of techniques but in a nut shell they first identified larger areas of the page through the DOM and then used a technique called VIPS (VIson Page Segmentation) to work out visual divides in content. This is rather clever but also a little over the top. In the first paper they describe their reasoning’s for not going entire DOM based on some of its limitations.

“First, DOM is still a linear structure, so visually adjacent blocks may be far from each other in the structure and departed wrongly. Secondly, tags such as <TABLE> and <P> are used not only for content presentation but also for layout structuring. It is therefore difficult to obtain the appropriate segmentation granularity. Thirdly, in many cases DOM prefers more on presentation to content and therefore not accurate enough to discriminate different semantic blocks in a web page.”

While all of these are true one could certainly argue that their first objection is a null point simply as many web pages render differently in different browsers and indeed some (including Googles own pages) use image substitution techniques or similar to hide contextual information with fancy presentation. Indeed with very little thought both the second and third limitations can easily be over come. While I am not suggesting relying on the DOM as your sole source will provide results as good as the papers combined method I do not think it would be wildly different. Where the DOM method does come into its own, is in processing and code time making it a far more practical.

Using the DOM to identify blocks

Lets start of making a couple of assumptions:

  • The sites we are evaluating are in English
  • Tables are being used as to store tabular data

Assumption 2 is to simply there to reduce our workload, it also means that sites will most likely be following pseudo semantic layouts.

Identifying containers

If we presume a block of content is contained in a container then we should have an easier time locating and separating blocks this is very much the Russian doll idea. We start with a global <html> which we will take as a given next we have <head> and <body> for this example we shall discard the head and its sub containers. Moving on inside the body we have a range of potential containers the most common of which are: div, span, ul, ol, table, dl note I have not included headings in this list.  Of course each of these containers could be for structure, semantics or presentation.

Because of the ambiguity of the P tag and its common use in simple presentation it does not make sense to include it as a container block even though it is obviously one.

Our first step therefore is to identify all potential container tags and most of which will be nested inside of other tags. Our next step would be to eliminate tags which in turn contain only tags that are identical and tags which contain no other element or content. For example:

<div id=”test”>
  <div id=”1”></div>
  <div id=”2”>mushrooms</div>
  <div id=”3”><em>mushrooms</em></div>
  </div>

Would result in test and 1 being removed

Next a leap of semantic faith, for every instance of a heading tag which is followed directly after by a container or text extract contents up to the point of closure of the container, or in the case of text outside of a container until the next container, heading which ever is first. This does create a small problem for example

<h1>mysite</h1>
  <div id=”test”>
  <div id=”1”></div>
  <div id=”2”>mushrooms</div>
  <div id=”3”>
  <h2>mushrooms</h2>
  this is a mushroom
  <h3>mushroom breeds</h3>
  <p>A mushroom is a type of</p><p>fungus</p>
  </div>
  </div>

The problem here is that under rule two the separation of the word fungus from the rest of the section would mean it is not counted as part of the mushroom breeds block. We can take solace in the fact that this would almost certainly be done due to presentation reasons and so the separation was also likely to occur with VIPs as well.

A third rule to deal with non Semantic layouts is needed, a common and unhelpful technique is to include heading tags on their own inside div, in such cases we should treat the encapsulating div as non existent. Meaning its parent div if previously effected by rule 1 would no longer be so effected.

  <div id=”test”>
  <div id=”1”><h2>Mushroom species</h2></div>
  <div id=”2”>mushrooms come in all shapes and sizes</div>
  </div> 

Would result in the ID test being a block as ID 1 contains only a heading tag and no further content

So let’s take a fairly common layout and see if with just 2 rule sets we can break our content down our content, to make things fair we will also do a visual breakdown to see if they match. Our site of choice is this ones front page, full of content and a fairly complex layout:
visual layout

DOM Layout

So from the DOM and applying the above rules we are left with…

  • Div – Header
  • Div – Menu (contains purely UL no ID)
  • Div – title (images only)

Then for each blog post entry

  • Div – Post-ID#

Beyond the blog entry divs

  • Div - Sidebar left
  • UL – No ID
  • H3 - What is IA
  • H3 - What is SEO
  • H3 - Who is Tim Nash

And so on, one nice thing to note the HCard microformat works nicely in this scenario

The rest again follow the rules nicely with the exception of the blip inside bottom-right div which has two heading tags one immediately after the other which would generate two blocks rather then 1.

Analysing for number of links

Our block analysis is not overly complicated and is not 100% accurate but it doesn’t have to be but let’s go back to Tony initial idea while once again you could use their link analysis I think a simple scoring system would work with comparable.

Determine link count

This is pretty simple, count the number of links in a block, to determine if the block is a “link block” we could take several approaches.

I say it is approach

In this approach after n number of links the block is deemed to contain enough links to be a link block regardless of other content.

Content to link ratio

If the number of words within an ‘a href’ tag <a href>the number here</a> exceed the number of words out of the tag (or another ratio) the block is considered to be a link block.

List anchor

If the block contains elements in a list which contain nothing other then ‘a href’ tag or contain a high link content ratio then it is considered a link block.

Combination approach

I prefer this approach myself, so If block contains more then 3 links test to see if:

  • Li tag contains word content beyond ‘a href’
  • ‘a href’ content of any li tag is less the half the total word count
  • The total ‘a href’ content count is less then a third of the block

Let’s take a simple scenario:

<ul>
  <li><a  href="http://ventureskills.wordpress.com/2007/09/19/
stumbleupon-mathematics-for-stumblers/">Mathematics
for stumblers</a> - Venture Skills Blog</li>
  <li><a  href="http://internetducttape.com/2007/04/10/an-introduction
-to-reputation-management/">Reputation  management intro</a> - Engtech</li>
<li><a  href="http://seocog.blogspot.com/2007/09/psst-want-to-buy-link.html
">Psst  Want to buy a link</a> - SEOCog</li>&</ul>

So in this block the number of links is 3, ‘a href’ are within a li tag but with additional content however the total word count inside the ‘a href’ tag exceeds by more then a third the external content.

Consequently this would be flagged as a link box.

Determining if it’s a blog roll

Ok we could make a huge cheat here, and first see if the any link block has a heading with content “Blogroll” in it or similar wording, such as advertising. Alternatively we can analyse the links, obviously a blog roll is made up of primarily other peoples links so we would first need to determine if the link block contains domains other then our own, a 50% would be a reasonable limit so if you have 4 links within a link block at least 2 must be foreign domains. Next we would wish to exclude any  links which have the element rel=”tag” as this would indicate a service like Technorati similarly rel=”license” is also a good indicator that the link is not from a blog roll, we can also use the reverse so rel=”friend” gives us a high probability of that link being part of a blog roll.

Basic Blog Roll scoring

So let’s create a simple scoring system for block links

  • For each new domain +2
  • For each Link that contains rel=”friend” +2
  • For each link to your own domain -1
  • For each Link to external domain containing rel=”tag” or rel=”license” -2

Setting a Blog roll likely hood at 5

So let’s take a look at my original link block:

3 domains, none marked with rel tag for a total of 6 making it likely to be a blog roll.

How to actually program it?

I have tried to keep all the rules fairly simple and so a would be programmer could easily (hmm) knock this up with a little wizardry and some basic knowledge of X path, indeed the whole program would consist of just three elements, X Path implementation, A Word Count and a string expander. Simple really :)




Potential uses

The first an obvious use is to analyse pages you are hoping to gain a link from to maximise your links potential. Indeed really smart so and so's might even go as far as to develop this into a tool to create a much more accurate pricing system for links similar to that of Text Link ads. Of course they could also use it to identify blog rolls ;) What uses can you think of for such a script?

RSS feed | Trackback URI | Add your comment!

13 Comments »

Comment by Daryl Quenet from SEO Canada Subscribed to comments via email
2008-05-13 00:03:33
Daryl Quenet avatar

Great post Tim. Why do I have a feeling if Microsoft implemented functionality like this they would make their search engine go from unreliable to utterly dysfunctional…

Comment by Tim Nash
2008-05-13 07:20:46

You cynic!
there is nothing wrong with their idea and it should be pointed out I don’t think this is anything that would be added to a regular searches any time soon.

The principle is sound, their implementation was impressive just impractical for the majority of people to replicate KISS does not appear to be in their vocabulary.

 
 
Comment by Tony Hirst from OUseful.info Subscribed to comments via email
2008-05-14 20:04:42
Tony Hirst avatar

Hi Tim
Interesting post at first skim - i now need to read it properly :-)

When i posted the bookmark, my thought was to use something like link density to try and guess at two sorts of block based on link density, where link density is measured really simply as something like (number of a nodes in a subtree of the DOM)/(number of non-text nodes in the subtree), or even just (number of alpha chars of link text in subtree)/(number of alpha chars in subtree)

1) intrasite navigation blocks (high link density and high proportion of those links pointing to the local domain)
2) blogrolls (cribbed from a ‘blogroll’ title, class or id attribute value ;-) or based on high link density and heuristic that counts wordpress.com, blogspot.com etc. ALternatively, run autodection on the links to see if they all return RSS feeds that look like canonical blog feed urls from major platforms, or feedburner feeds…

You’ve got me really intrigued about pursuing this a bit further now ;-)

Comment by Tim Nash
2008-05-14 21:06:21

Hi Tony, when I wrote this post I was sort of throwing ideas onto the virtual blackboard since then I realised my rules were a little on the to simplistic side but with a couple of tweaks it is easy to separate blocks for comparison. Once you have separated the blocks, checking if links are predominately internal or external will allow you to make a reasonable guess as to its intended use.

For example I today applied this technique on a rudimentary paid link detector, looking for links in pages which did not fit in with the content of the block or those of the over all site. It wasn’t hugely accurate but certainly accurate enough to see the use in the technique.

 
 
Comment by Tony Hirst from OUseful.info Subscribed to comments via email
2008-05-14 21:01:15
Tony Hirst avatar

Just taken the dog for a walk, and these are the thoughts I brought back:

1) for sites on eg wordpress and blogspot, the templates are well known and widely used, so case based reasoning will identify many of the blocks based on an understanding of the popular templates, if you can sniff them out;

2) when calculating link density, you can probably strip out em, i, b, strong etc tags (leaving their contents) to simplify the dom;

3) for identifying the major content block in a page, eg the news story in a news page, then the block with some increasing function of lowest link density and longest text character count is possibly the one you want; there probably needs to be some minmax reasoning going here - eg i may want the the largest amount of text content at as low a level in the dom hierarchy as possible (ie i want the content div not the main=content+banner block)

4) stuff like adsense etc can be sniffed and maybe remived from the dom to simplify processing;

5) where an internal navigation block is identified, I wonder whether you can start to look for site based templates (which is i guess what dapper tries to do when you give it several related pages) and then simplify the analysis on this basis - eg building a weighted page that maybe gives ‘density confidence’ values (or mean+sd density scores) of common blocks across linked pages from the same site, maybe at the same depth in the URL path?

I love a good dog walk ;-)

Comment by Tim Nash
2008-05-14 21:19:09

I think if it was to be taken to its natural conclusion, most websites contain common semantic elements (they might not know it) a combination of DOM interrogation based on preconceived templates, i.e expecting to find a internal links grouped inside a container (menu system) a group of external links (blog roll) a container with multiple smaller containers primarily consisting of text (main page content) you could decrease the processing time.

Likewise as I mentioned in the article, it would be in many cases simpler once blocks have been defined to rely on language for accurate answers, internal navigation will often be referred to as menu either directly in text or as an ID like wise the easiest way to identify a blog roll is to pull from the DOM the group of links that come after the term “Blog Roll” or “Friends Links” or some other derivation, of course this is presuming the site you are looking at is indeed in the language you are using I’m not familiar with the Chinese for “Blog roll” and am unlikely to write a program to look for it.

 
 
Comment by Dan Thies from SEO Fast Start Subscribed to comments via email
2008-05-15 19:28:13
Dan Thies avatar

Welcome to my blogroll, Tim.

Nice. Freakin. Post.

 
Comment by Martin Bowling from Martin Bowling Subscribed to comments via email
2008-05-17 03:13:03
Martin Bowling avatar

Wow this is an awesome post, I hadn’t stopped by your blog for a while cause I have been super busy but I stopped back in for some great stuff. I was just looking around for some info like this. As always tim top notch stuff! Thanks man.

 
Comment by Gab Goldenberg from SEO ROI Services
2008-05-22 05:33:36
Gab Goldenberg avatar

Thanks for the explanation Tim!

BTW, I had a thought. Since some themes and site templates are really common, perhaps the engines could use what they know about ‘footprinting templates to ID spam’ in order to reduce processing required for many sites. The script would query a DB of known footprints and see if it’s a theme whose blocks are already identified; if not, it would start froms scratch.

What do you think?

 

Responses to this post:

dont be an idiot, use a real name and all comments are moderated
Name Your name not your website(required)
E-mail (required - never shown publicly)
Website address (URL)
Web site name will be used as link text
Subscribe to comments via email
Your Comment (smaller size | larger size)
You may use <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong> in your comment.
Spam protection: Sum of 1 + eight ?