keywordsSEO guys like to analyze keywords used by their competitors after first extracting the text content (see “Extract text on an html webpage WITHOUT using Regular Expression”). As for me, I need to do the same for feeding my Experimental Search Engine NotFoundOnSearchEngine.

So let’s say I have this list of text:

content: {http://www.davidtemkin.com/mtarchive/000002.html
"The lost art of user interface programming"
{Way back when, in the late '80s and early '90s, user interface programming was it. The industry was focused on making computing useful, accessible, and mainstream, and that required attention to the human-computer interface. Then came the Web...}

http://people.cis.ksu.edu/~schmidt/text/densem.html
"Denotational Semantics: A Methodology for Language Development"
{In 1986, Allyn and Bacon published my Denotational Semantics text, which I wrote while I was a post-doc in Edinburgh in 1982-83. The book sold steadily over the years, but Allyn and Bacon was purchased by William C. Brown, which was purchased by McGraw-Hill. McGraw-Hill deleted the text as soon as they acquired it. }

http://heim.ifi.uio.no/~trygver/themes/mvc/mvc-index.html
"MVC XEROX PARC 1978-79"
{I made the first implementation and wrote the original MVC note at Xerox PARC in 1978. The note defines four terms; Model, View, Controller and Editor . The Editor is an ephemeral component that the View creates on demand as an interface between the View and the input devices such as mouse and keyboard.}
}

from which I need to extract Keywords and only Keywords that is I want to exclude a list of none-keywords:

exclude-list: ["http" "" "" "www" "com" "^/" "The" "of" "and" "was" "it" "" "on" "that" "to" "the" "Then" "html^/" "for" "In" "my" "which" "I" "I" "was" "a" "over" "years" "but" "by" "C" "soon" "they" "it" "no" "html^/" "I" "the" "at" "The" "such" "that" "is" "an" "^/^/http" "html"]

To do such task, you just have to use the parse function to explode the content into separate words like this:


raw-list: parse/all content {" #" " : , ' / . #"{" #"}" - ; ~}

which would result into this:

["http" "" "" "www" "davidtemkin" "com" "mtarchive" "000002" "html" "^/" "The" "lost" "art" "of" "user" "interface" "programming" "^/" "Way" "back" "when" "" "in" "the" "late" "" "80s" "and" "early" "" "90s" "" "user" "interface" "programming" "was" "it" "" "The" "industry" "was" "focused" "on" "making" "computing" "useful" "" "accessible" "" "and" "mainstream" "" "and" "that" "required" "attention" "to" "the" "human" "computer" "interface" "" "Then" "came" "the" "Web" "" "" "" "^/^/http" "" "" "people" "cis" "ksu" "edu" "" "schmidt" "text" "densem" "html^/" "Denotational" "Semantics" "" "A" "Methodology" "for" "Language" "Development" "^/" "In" "1986" "" "Allyn" "and" "Bacon" "published" "my" "Denotational" "Semantics" "text" "" "which" "I" "wrote" "while" "I" "was" "a" "post" "doc" "in" "Edinburgh" "in" "1982" "83" "" "The" "book" "sold" "steadily" "over" "the" "years" "" "but" "Allyn" "and" "Bacon" "was" "purchased" "by" "William" "C" "" "Brown" "" "which" "was" "purchased" "by" "McGraw" "Hill" "" "McGraw" "Hill" "deleted" "the" "text" "as" "soon" "as" "they" "acquired" "it" "" "" "^/^/http" "" "" "heim" "ifi" "uio" "no" "" "trygver" "themes" "mvc" "mvc" "index" "html^/" "MVC" "XEROX" "PARC" "1978" "79" "^/" "I" "made" "the" "first" "implementation" "and" "wrote" "the" "original" "MVC" "note" "at" "Xerox" "PARC" "in" "1978" "" "The" "note" "defines" "four" "terms" "" "Model" "" "View" "" "Controller" "and" "Editor" "" "" "The" "Editor" "is" "an" "ephemeral" "component" "that" "the" "View" "creates" "on" "demand" "as" "an" "interface" "between" "the" "View" "and" "the" "input" "devices" "such" "as" "mouse" "and" "keyboard" "" "^/"]

And then exclude the common words with the difference function like this:
difference raw-list exclude-list


exclude raw-list exclude-list

which would give as final result this:

["davidtemkin" "mtarchive" "000002" "lost" "art" "user" "interface" "programming" "Way" "back" "when" "late" "80s" "early" "90s" "industry" "focused" "making" "computing" "useful" "accessible" "mainstream" "required" "attention" "human" "computer" "came" "Web" "people" "cis" "ksu" "edu" "schmidt" "text" "densem" "Denotational" "Semantics" "Methodology" "Language" "Development" "1986" "Allyn" "Bacon" "published" "wrote" "while" "post" "doc" "Edinburgh" "1982" "83" "book" "sold" "steadily" "purchased" "William" "Brown" "McGraw" "Hill" "deleted" "as" "acquired" "heim" "ifi" "uio" "trygver" "themes" "mvc" "index" "XEROX" "PARC" "1978" "79" "made" "first" "implementation" "original" "note" "defines" "four" "terms" "Model" "View" "Controller" "Editor" "ephemeral" "component" "creates" "demand" "between" "input" "devices" "mouse" "keyboard"]

This is a quick and dirty way but it works fine (at least for just 2 lines instructions, compare with easy PHP to do the same thing :) ). For a cleaner and more rich-featured options see Make-Word-List by Peter Wood.

Update: I made a mistypo, previously I did write difference instead of exclude.

2 people like this post.
Bookmark and Share

Recent Articles