In one of the first tutorials you’ve learned how to easily extract text from an HTML Web Page in a few lines without using Regular Expression. As I’d like to automate the download of softwares, I’ll now show you how to extract all html links from an Html Web Page filtered on either zip-exe or tar.gz extension. Of course we’ll use again the human-friendly Parse Function.
For testing purpose, I have added a /clipboard refinement so that you can copy all the found links into the clipboard like this (in the example I want to copy all zip links to the clipboard):
extract-links/zip/clipboard http://www.rebol.com/view-platforms.html
The whole source code is at the end but if you want some explanation read on.
If you remember, you can easily parse the whole webpage into a block (array) with one single function load/markup:
html-content: read http://www.rebol.com/view-platforms.html
html-block: load/markup html-content
You can then parse the html-block with
Parse html-block rule
Where rule is:
rule: [
some [set tag tag! (prin tag) | string!]
]
which will prints all tags. Contrast this with the rule we used to extract text:
rule: [
some [tag! | set x string! (prin x)]
]
This is just similar.
To test it copy and paste to console:
html-content: read http://www.rebol.com/view-platforms.html
html-block: load/markup html-content
parse html-block [
some [set tag tag! (prin tag) | string!]
]
To filter the href tags you can use the find function starting from the tail:
set tag tag! (
if url: find/tail tag "href=" [
print url
]
)
Test it:
html-content: read http://www.rebol.com/view-platforms.html
html-block: load/markup html-content
parse html-block [
some [set tag tag! (
if url: find/tail tag "href=" [
print url
]
)
| string!
]
]
You will get:
<"/main.css" type="text/css" charset="utf-8"> <"http://www.rebol.com"> <"/index-lang.html"> <"/download.html"> <"/docs.html"> <"/community.html"> <"http://www.rebol.net"> <"http://www.rebol.org"> <"/license.html"> <"/mission.html"> <"/feedback.html"> <"/contacts.html"> <"platforms.html"> <"docs/unpack-tar-gz.html"> <"logos.html"> <"downloads/v276/rebview.exe"> <"downloads/v276/rebview-osxi.tar.gz"--> <"downloads/v276/rebview-osxppc.tar.gz"> <"downloads/v276/rebview-linx86.tar.gz"> <"downloads/v276/rebview-fedx86.tar.gz"> <"downloads/v276/rebview-ppc.gzip"> <"downloads/v276/rebview-obsd38.gzip"> <"downloads/view.exe"> <"http://www.rebol.net/builds/"> <"downloads/rebol-view-1302094.tar.gz"> <"http://www.rebol.net/builds/"> <"http://www.rebol.net/builds/"> <"logos.html"> <"license"> <"downloads/view-pro031.zip"> <"downloads/view-pro042.tar.gz"> <"downloads/view-pro101.tar.gz"> <"platforms-view.html"> </cgi-bin/wip.r?r=edit+%25web/view-platforms.rmd&>
You can then filter using the same find function to extract only zip or tar.gz files depending on the zip or tar.gz function refinement:
if/else not any [zip tar.gz] [
append urls url
][
if zip [
if any [find url ".zip" find url ".exe"] [
append urls url
]
]
if tar.gz [
if any [find url ".tar.gz" find url ".tgz"] [
append urls url
]
]
]
]
To test it, copy and paste in console the code below EXCEPT the header (only used for executing from a file):
Rebol [
Usage: {
Example 1:
extract-links/zip/clipboard http://www.rebol.com/view-platforms.html
}
]
extract-links: func[url /zip-exe /zip /exe /tar /tar.gz /clipboard /local urls html-content html-block x][
html-content: read to-url url
html-block: load/markup html-content
urls: copy []
;if tag! store value in x
;if x contains "href=" print x
;otherwise it's a just a string! nothing to do
parse html-block [
some [
set x tag! (
if x: find/tail x "href=" [
url: to-string x
if/else not any [zip-exe zip exe tar tar.gz] [
append urls url
][
if any [zip-exe] [
if any [find url ".zip" find url ".exe"] [
append urls url
]
]
if zip [
if find url ".zip" [
append urls url
]
]
if exe [
if find url ".exe" [
append urls url
]
]
if any [tar.gz tar] [
if any [find url ".tar.gz" find url ".tgz"] [
append urls url
]
]
]
]
) | string!
]
]
if clipboard [
write clipboard:// mold urls
]
urls
]
Now try to use extract-links with Rebol View download page:
extract-links/tar.gz/clipboard http://www.rebol.com/view-platforms.html
which should output:
[{"downloads/v276/rebview-osxi.tar.gz"--}
{"downloads/v276/rebview-osxppc.tar.gz"}
{"downloads/v276/rebview-linx86.tar.gz"}
{"downloads/v276/rebview-fedx86.tar.gz"}
{"downloads/rebol-view-1302094.tar.gz"}
{"downloads/view-pro042.tar.gz"}
{"downloads/view-pro101.tar.gz"}]
As you can see there are some glitches:
- link is surrounded by quotes
- there can be unwanted characters before or after
Ti get rid off them you can parse the url again:
parse url [thru {"} copy url to {"} to end]
Put the above instruction just below:
url: to-string x
If there were no quotes parse would just do nothing and the url will stay the same.
So that the full source code is now:
Rebol [
Usage: {
Example 1:
extract-links/zip/clipboard http://www.rebol.com/view-platforms.html
}
]
extract-links: func[url /zip-exe /zip /exe /tar.gz /clipboard /local urls html-content html-block x][
html-content: read to-url url
html-block: load/markup html-content
urls: copy []
;if tag! store value in x
;if x contains "href=" print x
;otherwise it's a just a string! nothing to do
parse html-block [
some [
set x tag! (
if x: find/tail x "href=" [
url: to-string x
parse url [thru {"} copy url to {"} to end]
if/else not any [zip-exe zip exe tar.gz] [
append urls url
][
if any [zip-exe] [
if any [find url ".zip" find url ".exe"] [
append urls url
]
]
if zip [
if find url ".zip" [
append urls url
]
]
if exe [
if find url ".exe" [
append urls url
]
]
if tar.gz [
if any [find url ".tar.gz" find url ".tgz"] [
append urls url
]
]
]
]
) | string!
]
]
if clipboard [
write clipboard:// mold urls
]
urls
]
Another example: list all links zip and tgz from PHP Symphony Framework download page:
extract-links/zip/tar.gz http://www.symfony-project.org/installation/1_4
Which should output all zip and tgz files:
["http://www.symfony-project.org/get/symfony-1.4.1.tgz" "http://www.symfony-project.org/get/symfony-1.4.1.zip" "http://www.symfony-project.org/get/sf_sandbox_1_4.tgz" "http://www.symfony-project.org/get/sf_sandbox_1_4.zip"]
This even works on sourceforge (400 lines so as to copy to clipboard !):
extract-links/zip/clipboard http://sourceforge.net/projects/xampp/files/
You can then do whatever you want with the urls block like obviously downloading them.


















Tags: download, HTML, Parse, Regular Expressions, tar, url, zip