PDA

View Full Version : Help with link-text extractor


Webmaster-Toolkit
17-08-2002, 05:44/05:44AM
Hi

I'm needing a bit of help with a certain part of a script I'm writing. The code I'm using to extract the link text of a specified web-page seems to work on most pages, but unfortunately it doesn't pick up everything when you try it out on our illustrious administrator's (;)) page - http://www.freemoneyservices.com/

The code I'm using is this (simplified a bit):

$url = (the url inputted)
$html = get($url);

# remove newlines and multiple spaces
$html =~ s/\n//g;
$html =~ s/\r//g;
$html =~ s/\s+/ /g;

# get link text from page
@links = ($html =~ m/<a[^>]*>([^<\/a]*)<\/a>/ig);


Could anyone be so kind as to point out my error, or suggest any better way?

Thanks in advance!

bigDugan
17-08-2002, 13:46/01:46PM
It might make it easier if you re-post your link here.

plattypus
18-08-2002, 05:40/05:40AM
I must be bonkers - why am I writing this for you?!

Okay..this is a very "feature empty" version - it doesn't handle linked images/alt tags etc. Basically, it doesn't pay to simplify! And you need to dig out the old regular expressions manual.

Here's some code:

#!/usr/bin/perl
use LWP::Simple;

$url = "http://www.freemoneyservices.com/";
$html = get($url);

# remove newlines and multiple spaces
$html =~ s/\n/\ /g;
$html =~ s/\r/\ /g;
while ($html=~/\s\s/) {$html=~s/\s\s/ /gi;}


# get link text from page
@links = ($html =~ m/<a[^>]*>([^<]*)/ig);

# Post - process these links. Remove additional spaces, blanks lines etc.
@linksfinal=undef;
foreach $anchor(@links) {
while ($anchor=~/\s\s/) {$anchor=~s/\s\s/ /gi;}
$anchor=~ s/^\s//gi;
if ($anchor ne "") { push @linksfinal, $anchor; }
}


# This just lists the anchors
foreach $anchor(@linksfinal) {
print $anchor,"\n";
}

Webmaster-Toolkit
18-08-2002, 09:15/09:15AM
Wow. Thanks plattypus :cheers:

highman
18-08-2002, 17:34/05:34PM
I must be bonkers - why am I writing this for you?!


:thumb: great stuff!

motiv8x
09-03-2003, 20:08/08:08PM
you may want to look into HTML::Parser