<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	>
<channel>
	<title>Comments on: Simple Block segmentation analysis</title>
	<atom:link href="http://www.timnash.co.uk/05/2008/block-segmentation-analysis/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.timnash.co.uk/05/2008/block-segmentation-analysis/</link>
	<description>The SEO Consultant</description>
	<pubDate>Wed, 20 Aug 2008 11:32:12 +0000</pubDate>
	<generator>http://wordpress.org/?v=2.6</generator>
		<item>
		<title>By: Quality Score, Relevance Score and Search Engine Optimization &#124; SEO Design Solutions</title>
		<link>http://www.timnash.co.uk/05/2008/block-segmentation-analysis/#comment-1223</link>
		<dc:creator>Quality Score, Relevance Score and Search Engine Optimization &#124; SEO Design Solutions</dc:creator>
		<pubDate>Mon, 09 Jun 2008 14:03:17 +0000</pubDate>
		<guid isPermaLink="false">http://www.timnash.co.uk/?p=137#comment-1223</guid>
		<description>[...] algorithms) in the search index through evolved Phrase Based Indexing and Retrieval (paIR), block segment analysis, vector analysis and dozens of other criteria used to assess your content, site structure and [...]</description>
		<content:encoded><![CDATA[<p>[...] algorithms) in the search index through evolved Phrase Based Indexing and Retrieval (paIR), block segment analysis, vector analysis and dozens of other criteria used to assess your content, site structure and [...]</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Link Worth – What&#8217;s yours worth &#8226; Tim Nash UK SEO Blog</title>
		<link>http://www.timnash.co.uk/05/2008/block-segmentation-analysis/#comment-1110</link>
		<dc:creator>Link Worth – What&#8217;s yours worth &#8226; Tim Nash UK SEO Blog</dc:creator>
		<pubDate>Fri, 30 May 2008 15:33:39 +0000</pubDate>
		<guid isPermaLink="false">http://www.timnash.co.uk/?p=137#comment-1110</guid>
		<description>[...] Link Worth – What&#8217;s yours worthGetting Excited about Mashed2008!Clickthrough Experiment UpdateNear perfect sales pitch by Bob Massa&#8230;Simple Block segmentation analysis [...]</description>
		<content:encoded><![CDATA[<p>[...] Link Worth – What&#8217;s yours worthGetting Excited about Mashed2008!Clickthrough Experiment UpdateNear perfect sales pitch by Bob Massa&#8230;Simple Block segmentation analysis [...]</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Gab Goldenberg</title>
		<link>http://www.timnash.co.uk/05/2008/block-segmentation-analysis/#comment-1061</link>
		<dc:creator>Gab Goldenberg</dc:creator>
		<pubDate>Thu, 22 May 2008 04:33:36 +0000</pubDate>
		<guid isPermaLink="false">http://www.timnash.co.uk/?p=137#comment-1061</guid>
		<description>Thanks for the explanation Tim!

BTW, I had a thought. Since some themes and site templates are really common, perhaps the engines could use what they know about 'footprinting templates to ID spam' in order to reduce processing required for many sites. The script would query a DB of known footprints and see if it's a theme whose blocks are already identified; if not, it would start froms scratch.

What do you think?</description>
		<content:encoded><![CDATA[<p>Thanks for the explanation Tim!</p>
<p>BTW, I had a thought. Since some themes and site templates are really common, perhaps the engines could use what they know about &#8216;footprinting templates to ID spam&#8217; in order to reduce processing required for many sites. The script would query a DB of known footprints and see if it&#8217;s a theme whose blocks are already identified; if not, it would start froms scratch.</p>
<p>What do you think?</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Why The Repetitive SEO Post Can Still Be Valuable &#124; Cape Cod SEO</title>
		<link>http://www.timnash.co.uk/05/2008/block-segmentation-analysis/#comment-1057</link>
		<dc:creator>Why The Repetitive SEO Post Can Still Be Valuable &#124; Cape Cod SEO</dc:creator>
		<pubDate>Wed, 21 May 2008 12:02:31 +0000</pubDate>
		<guid isPermaLink="false">http://www.timnash.co.uk/?p=137#comment-1057</guid>
		<description>[...] may find of value. I have to admit that I found a good deal of them interesting and presenting new information or unique perspectives. But many of them talk about SEO topics I was at least aware of and probably [...]</description>
		<content:encoded><![CDATA[<p>[...] may find of value. I have to admit that I found a good deal of them interesting and presenting new information or unique perspectives. But many of them talk about SEO topics I was at least aware of and probably [...]</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Martin Bowling</title>
		<link>http://www.timnash.co.uk/05/2008/block-segmentation-analysis/#comment-1024</link>
		<dc:creator>Martin Bowling</dc:creator>
		<pubDate>Sat, 17 May 2008 02:13:03 +0000</pubDate>
		<guid isPermaLink="false">http://www.timnash.co.uk/?p=137#comment-1024</guid>
		<description>Wow this is an awesome post, I hadn't stopped by your blog for a while cause I have been super busy but I stopped back in for some great stuff. I was just looking around for some info like this. As always tim top notch stuff! Thanks man.</description>
		<content:encoded><![CDATA[<p>Wow this is an awesome post, I hadn&#8217;t stopped by your blog for a while cause I have been super busy but I stopped back in for some great stuff. I was just looking around for some info like this. As always tim top notch stuff! Thanks man.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Dan Thies</title>
		<link>http://www.timnash.co.uk/05/2008/block-segmentation-analysis/#comment-1005</link>
		<dc:creator>Dan Thies</dc:creator>
		<pubDate>Thu, 15 May 2008 18:28:13 +0000</pubDate>
		<guid isPermaLink="false">http://www.timnash.co.uk/?p=137#comment-1005</guid>
		<description>Welcome to my blogroll, Tim.

Nice. Freakin. Post.</description>
		<content:encoded><![CDATA[<p>Welcome to my blogroll, Tim.</p>
<p>Nice. Freakin. Post.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Tim Nash</title>
		<link>http://www.timnash.co.uk/05/2008/block-segmentation-analysis/#comment-998</link>
		<dc:creator>Tim Nash</dc:creator>
		<pubDate>Wed, 14 May 2008 20:19:09 +0000</pubDate>
		<guid isPermaLink="false">http://www.timnash.co.uk/?p=137#comment-998</guid>
		<description>I think if it was to be taken to its natural conclusion, most websites contain common semantic elements (they might not know it) a combination of DOM interrogation based on preconceived templates, i.e expecting to find a internal links grouped inside a container (menu system) a group of external links (blog roll) a container with multiple smaller containers primarily consisting of text (main page content)  you could decrease the processing time.

Likewise as I mentioned in the article, it would be in many cases simpler once blocks have been defined to rely on language for accurate answers, internal navigation will often be referred to as menu either directly in text or as an ID like wise the easiest way to identify a blog roll is to pull from the DOM the group of links that come after the term "Blog Roll" or "Friends Links" or some other derivation, of course this is presuming the site you are looking at is indeed in the language you are using I'm not familiar with the Chinese for "Blog roll" and am unlikely to write a program to look for it.</description>
		<content:encoded><![CDATA[<p>I think if it was to be taken to its natural conclusion, most websites contain common semantic elements (they might not know it) a combination of DOM interrogation based on preconceived templates, i.e expecting to find a internal links grouped inside a container (menu system) a group of external links (blog roll) a container with multiple smaller containers primarily consisting of text (main page content)  you could decrease the processing time.</p>
<p>Likewise as I mentioned in the article, it would be in many cases simpler once blocks have been defined to rely on language for accurate answers, internal navigation will often be referred to as menu either directly in text or as an ID like wise the easiest way to identify a blog roll is to pull from the DOM the group of links that come after the term &#8220;Blog Roll&#8221; or &#8220;Friends Links&#8221; or some other derivation, of course this is presuming the site you are looking at is indeed in the language you are using I&#8217;m not familiar with the Chinese for &#8220;Blog roll&#8221; and am unlikely to write a program to look for it.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Tim Nash</title>
		<link>http://www.timnash.co.uk/05/2008/block-segmentation-analysis/#comment-997</link>
		<dc:creator>Tim Nash</dc:creator>
		<pubDate>Wed, 14 May 2008 20:06:21 +0000</pubDate>
		<guid isPermaLink="false">http://www.timnash.co.uk/?p=137#comment-997</guid>
		<description>Hi Tony, when I wrote this post I was sort of throwing ideas onto the virtual blackboard since then I realised my rules were a little on the to simplistic side but with a couple of tweaks it is easy to separate blocks for comparison. Once you have separated the blocks, checking if links are predominately internal or external  will allow you to make a reasonable guess as to its intended use.

For example I today applied this technique on a rudimentary paid link detector, looking for links in pages which did not fit in with the content of the block or those of the over all site. It wasn't hugely accurate but certainly accurate enough to see the use in the technique.</description>
		<content:encoded><![CDATA[<p>Hi Tony, when I wrote this post I was sort of throwing ideas onto the virtual blackboard since then I realised my rules were a little on the to simplistic side but with a couple of tweaks it is easy to separate blocks for comparison. Once you have separated the blocks, checking if links are predominately internal or external  will allow you to make a reasonable guess as to its intended use.</p>
<p>For example I today applied this technique on a rudimentary paid link detector, looking for links in pages which did not fit in with the content of the block or those of the over all site. It wasn&#8217;t hugely accurate but certainly accurate enough to see the use in the technique.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Tony Hirst</title>
		<link>http://www.timnash.co.uk/05/2008/block-segmentation-analysis/#comment-996</link>
		<dc:creator>Tony Hirst</dc:creator>
		<pubDate>Wed, 14 May 2008 20:01:15 +0000</pubDate>
		<guid isPermaLink="false">http://www.timnash.co.uk/?p=137#comment-996</guid>
		<description>Just taken the dog for a walk, and these are the thoughts I brought back:

1) for sites on eg wordpress and blogspot, the templates are well known and widely used, so case based reasoning will identify many of the blocks based on an understanding of the popular templates, if you can sniff them out;

2) when calculating link density, you can probably strip out em, i, b, strong etc tags (leaving their contents) to simplify the dom;

3) for identifying the major content block in a page, eg the news story in a news page, then the block with some increasing function of lowest link density and longest  text character count is possibly the one you want; there probably needs to be some minmax reasoning going here - eg i may want the the largest amount of text content at as low a level in the dom hierarchy as possible (ie i want the content div not the main=content+banner block)

4) stuff like adsense etc can be sniffed and maybe remived from the dom to simplify processing;

5) where an internal navigation block is identified, I wonder whether you can start to look for site based templates (which is i guess what dapper tries to do when you give it several related pages) and then simplify the analysis on this basis - eg building a weighted page that maybe gives 'density confidence' values (or mean+sd density scores) of common blocks across linked pages from the same site, maybe at the same depth in the URL path?

I love a good dog walk ;-)</description>
		<content:encoded><![CDATA[<p>Just taken the dog for a walk, and these are the thoughts I brought back:</p>
<p>1) for sites on eg wordpress and blogspot, the templates are well known and widely used, so case based reasoning will identify many of the blocks based on an understanding of the popular templates, if you can sniff them out;</p>
<p>2) when calculating link density, you can probably strip out em, i, b, strong etc tags (leaving their contents) to simplify the dom;</p>
<p>3) for identifying the major content block in a page, eg the news story in a news page, then the block with some increasing function of lowest link density and longest  text character count is possibly the one you want; there probably needs to be some minmax reasoning going here - eg i may want the the largest amount of text content at as low a level in the dom hierarchy as possible (ie i want the content div not the main=content+banner block)</p>
<p>4) stuff like adsense etc can be sniffed and maybe remived from the dom to simplify processing;</p>
<p>5) where an internal navigation block is identified, I wonder whether you can start to look for site based templates (which is i guess what dapper tries to do when you give it several related pages) and then simplify the analysis on this basis - eg building a weighted page that maybe gives &#8216;density confidence&#8217; values (or mean+sd density scores) of common blocks across linked pages from the same site, maybe at the same depth in the URL path?</p>
<p>I love a good dog walk <img src='http://www.timnash.co.uk/wp-includes/images/smilies/icon_wink.gif' alt=';-)' class='wp-smiley' /></p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Tony Hirst</title>
		<link>http://www.timnash.co.uk/05/2008/block-segmentation-analysis/#comment-993</link>
		<dc:creator>Tony Hirst</dc:creator>
		<pubDate>Wed, 14 May 2008 19:04:42 +0000</pubDate>
		<guid isPermaLink="false">http://www.timnash.co.uk/?p=137#comment-993</guid>
		<description>Hi Tim
Interesting post at first skim - i now need to read it properly :-)

When i posted the bookmark, my thought was to use something like link density to try and guess at two sorts of block based on link density, where link density is measured really simply as something like (number of a nodes in a subtree of the DOM)/(number of non-text nodes in the subtree), or even just (number of alpha chars of link text in subtree)/(number of alpha chars in subtree)

1) intrasite navigation blocks (high link density and high proportion of those links pointing to the local domain)
2) blogrolls (cribbed from a 'blogroll' title, class or id attribute value ;-) or based on high link density and heuristic that counts wordpress.com, blogspot.com  etc. ALternatively, run autodection on the links to see if they all return RSS feeds that look like canonical blog feed urls from major platforms, or feedburner feeds...

You've got me really intrigued about pursuing this a bit further now ;-)</description>
		<content:encoded><![CDATA[<p>Hi Tim<br />
Interesting post at first skim - i now need to read it properly <img src='http://www.timnash.co.uk/wp-includes/images/smilies/icon_smile.gif' alt=':-)' class='wp-smiley' /> </p>
<p>When i posted the bookmark, my thought was to use something like link density to try and guess at two sorts of block based on link density, where link density is measured really simply as something like (number of a nodes in a subtree of the DOM)/(number of non-text nodes in the subtree), or even just (number of alpha chars of link text in subtree)/(number of alpha chars in subtree)</p>
<p>1) intrasite navigation blocks (high link density and high proportion of those links pointing to the local domain)<br />
2) blogrolls (cribbed from a &#8216;blogroll&#8217; title, class or id attribute value <img src='http://www.timnash.co.uk/wp-includes/images/smilies/icon_wink.gif' alt=';-)' class='wp-smiley' /> or based on high link density and heuristic that counts wordpress.com, blogspot.com  etc. ALternatively, run autodection on the links to see if they all return RSS feeds that look like canonical blog feed urls from major platforms, or feedburner feeds&#8230;</p>
<p>You&#8217;ve got me really intrigued about pursuing this a bit further now <img src='http://www.timnash.co.uk/wp-includes/images/smilies/icon_wink.gif' alt=';-)' class='wp-smiley' /></p>
]]></content:encoded>
	</item>
</channel>
</rss>
