Sometimes you need to find out that if the urls on
the page exists or not. The following code reads the HTML of the page and
extracts all the urls and finally checks if the url exists or not.
Take a look at the following code:
protected void Button1_Click(object sender, EventArgs e)
{
WebRequest req = WebRequest.Create("http://localhost:1852/
LookIntoDoPostBack/UrlList.aspx");
HttpWebResponse res = (HttpWebResponse) req.GetResponse();
Stream stream = res.GetResponseStream();
ArrayList badUrls = new ArrayList();
StreamReader reader = new StreamReader(stream);
string html = reader.ReadToEnd();
// Get the links
string pattern = @"((http|ftp|https):\/\/w{3}[\d]*.|(http|ftp|https)
:\/\/|w{3}[\d]*.)([\w\d\._\-#\(\)\[\]\\,;:]+@[\w\d\._\-#\(\)\[\]\\
,;:])?([a-z0-9]+.)*[a-z\-0-9]+.([a-z]{2,3})?[a-z]{2,6}(:[0-9]+)?(\/
[\/a-z0-9\._\-,]+)*[a-z0-9\-_\.\s\%]+(\?[a-z0-9=%&\.\-,#]+)?";
Regex r = new Regex(pattern);
MatchCollection mC = r.Matches(html);
// Iterate through the collection and find if the Url Exists or not
foreach (Match m in mC)
{
if (!DoesUrlExists(m.Value))
{
// Add to the broken urls
badUrls.Add(m.Value);
}
}
// Display the bad urls in the GridView control
gvBadUrls.DataSource = badUrls;
gvBadUrls.DataBind();
}
private bool DoesUrlExists(string url)
{
bool urlExists = false;
WebRequest req = WebRequest.Create(url);
try
{
HttpWebResponse response = (HttpWebResponse) req.GetResponse();
urlExists = true;
}
catch (System.Net.WebException ex)
{
}
return urlExists;
}
When I find a bad url I simply put it in a
ArrayList. Later I display the bad urls in the GridView control. The code will
not display any bad url if your ISP is tranfering you to a custom page when the
Page Not Found exception is thrown. Also, this process of checking the url is
very time consuming so I suggest if you use it then try to run this process in a
different thread.
powered by IMHO 1.3