extract_htmlDo you need to do some webscrapping but do you hate Regular Expression ? Rebol is very good at that using some kind of natural language (dialect in its own semantic) thanks to 2 functions: load/markup and parse.

One of the classical case is to just strip all html tags so that you can keep the text only.

The sample code already exists on http://www.rebolforces.com/articles/rebolandtheshell.html#sect2.4.
I just reproduced it below so that I can comment on it:


rebol [
  title: "strip all html tags"
  author: "http://www.rebolforces.com/articles/rebolandtheshell.html#sect2.4."
  version: 1.0.0
]

parse load/markup read to-url ask "url: " [
    some [tag! | set x string! (prin x)]
]

Rebol load/markup function does all the job for you: it converts HTML and XML to a block (Rebol’s kind of array) of tags and strings. You can then parse the resulting block (Rebol’s array) to discard all tags (if tags! type then do nothing) and only output the text only with the prin function which means “outputs a value with no line break”.

Update: For an other implementation see Carl Sassenrath’s Recipe.

Bookmark and Share

Recent Articles