Tim Nash "stuff" Blog

Is it porn?

1



Idle fact of the day: over a 3rd of the web is porn, according to a study by Optenet.

That’s quite a lot and its increasing. So. imagine if you ran a popular membership management software for WordPress and you want to segment your adult industry specialists in your mailing list to better deal with their “needs”. Oddly, it’s not something we ask at checkout!

So How Do You Find Out If a Site Has Porn?

Porn burns your eyes

Well, you could use your eyes, I guess, but when you have a couple of thousand URLs to process it’s going to take some time. You would also need to visit future sites, so viewing each site is not really feasible. Instead, the answer is clearly NLP and brute force checking. Don’t look at me like that, thats clearly the answer!

Keyword Extraction
There are 3 broad types of keywords extraction analysis:

  • Statistical
  • Linguistic
  • Mixed

Statistical – Instead of looking the words themselves, this method looks at aspects like frequency and the position of  the word in the text. These are probably the most common method of analysis, in part because it is easy to program and relative good results.

Linguistic – Looking at the actual words, the structure, and the parts-of-speech. The benefit of using such methods is a greater accuracy in identifying individual words and collections, but not necessarily any major advantage in terms of results. Indeed, linguistic methods are rarely deployed on their own.

Mixed – This method uses a combination of statistical and linguistic methods to identify of collection of words. It has a higher degree of accuracy than statistical methods, but it can be combined with term frequency and inverse document frequency to produce reasonable results.

For most of my projects, I rely on pure statistical methods for extraction, combined with a stop list to prevent ‘this’, ‘the’, and ‘a’ turning up as principle keywords. However, for this project, I chose to cheat.

AlchemyAPI

AlchemyAPI describes itself as a suite of content analysis and meta-data annotation tools. It’s a pure REST based system that is quite cool and free, even for commercial use. One of the methods it has is Keyword / Terminology Extraction, which it describes as ‘sophisticated statistical algorithms and natural language processing technology’. So, using mixed methodology on a basic level, we can assume some grouping of words, but the main part of the API is still using term frequency combined with a stop word list. You will notice AlchemyAPI produces two word terms moreso than single words.

Keyword / Term Extraction: Web API

To use the API, you need an API key, and that’s about it. A simple post based curl request is needed with:

http://access.alchemyapi.com/calls/url/URLGetRankedKeywords

  • apikey
  • url
  • outputMode

Be careful to capitalise the M in output otherwise you get xml returned regardless.

The return is basically a list of keywords found in the text and a “relevance” score. For example, on my blog homepage:


[url] => http://www.timnash.co.uk/
[language] => english
[keywords] => Array
(
[0] => stdClass Object
(
[text] => Continue Reading
[relevance] => 0.943294
)

[1] => stdClass Object
(
[text] => search engine
[relevance] => 0.629734
)

[2] => stdClass Object
(
[text] => Search Engine Optimisation
[relevance] => 0.610243
)

[3] => stdClass Object
(
[text] => twitter spammer
[relevance] => 0.599565
)

[4] => stdClass Object
(
[text] => ranking factor
[relevance] => 0.583604
)

[5] => stdClass Object
(
[text] => Information Architecture
[relevance] => 0.551172
)

Now, as you can see, it’s picked up ‘Continue Reading’ as the primary topic. Oh well, guess I need to do a little optimising on my site. My own name came in 6th with a relevancy of 0.541318! Still, as a simple tool it will fit our purposes nicely.

Building Our Check Word List

Now that we have extracted the terms and associated some sort of scoring with them, it’s time to build our check list. This acts almost like a reverse stop list and becomes the list of terms we are focusing on. We’ll assign each term a rating of how associated with porn it is.

At this point, you could introduce some machine learning into the mixture to dynamically update the otherwise arbitrary value associated with each term. However, in this example, it’s a bit of an overkill.

Once I have the list, it’s a case of processing through for each topic and comparing them to see if it has words in our check list. Then, we’ll need to perform a very simple fomulae:

porn likelihood = Relevance * pornyness

The total ‘porn likelihood’ for all topics is added up and a final page score is derived. If it exceeds a certain value, the page is considered porn!

To make things interesting, I initially gave the system an arbitrary starting point for determining when is a page porn based on testing against a small set of 10 porn sites and 10 sites like my blog and BBC home page. The system stores the average score of all porn pages and uses that as a modifier to dynamically adjust the threshold of what is and isn’t porn.

Classification Of Porn

@tnash im proud of you mate….. can it determine size and species?
http://twitter.com/seoidiot/statuses/16926576279

Was what greeted my announcement that my porn bot was having more fun than me, but within reason, the topic keyword indicator can be used to broadly identify sites by associating keywords with a category. You can then use the relevance score to determine which category the site is most likely to fit in.

If you really want to, point it to a site with an alternate species. I’m fairly confident it would pick it up and place it in beastiality. One of the interesting problems is when a site discusses porn, such as this post. In such scenarios, it would return a false positive for our little bot, which was only looking at a single url instead of the entire site.

The obvious way around this is to use a full crawler. You can use AlchemyAPI for this to along with a combination of Content Scraping and xpath. We’ll use it to pull all links on a page that cross link on that page and slowly build a score up over time. If it’s an anti-porn site or therapist site, it will still throw up false positive, but for a site like this one, it should balance out.

So, there you go. A simple way to classify large sets of sites to determine if they are porn, without burning the eyes out of your sales team, or destroying the innocence of your intern.
xkcd porn

Not quite sure why they feel the need to check the accuracy of my system though….

Consulting

While I no longer offer personal consultancy if you are interested in going further then please let us know at Coding Futures


1 comment

  • Mark

    Interesting post and process, Tim. I think it was one of the US supreme court justices that said regarding porn “I can’t define it, but I know it when I see it.”

Add a comment



*Required

You may use <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong> in your comment.