[原创]从网页源文件中获取链接的两种方法 - Delphi论坛

问题点数：0 回复次数：0

[原创]从网页源文件中获取链接的两种方法

这几天打算做一个网站内容提取器，第一步就是如何获取网页中的链接，我不打算用
"TWebBrowser"控件，那玩意初试化的时候贼慢，我打算用TIdHttp控件获取网页源码，
不过TIdHttp控件链接"https"（SSL安全通信）类型的超链接的时候要绑定SSL，关于
TIdHttp控件绑定SSL的方法可以参考本站的文章，链接地址：
http://sz319.net.ru/article/artinfo.aspx?TypeID=21&PageNum=1&Keystr=&looktype=&artid=124
不过一般都是"http"的，看几篇文章没必要搞个SSL安全通信吧。

下面列举获取网页中链接的两种方法：
先获取网页源码：

var
Data : TMemoryStream; //Create,Destroy 略过
httpclient: TIdHTTP;

function TForm1.GetHtml(strUrl:string):string;
var
bufdata:pointer;
begin
result := '';
Data.clear();
httpclient.Get(strUrl,data);
GetMem(bufdata,data.Size);
data.Position := 0;
data.ReadBuffer(bufdata^,data.Size);
result := strpas(bufdata);
freemem(bufdata,data.Size);
end;

方法一:利用IHTMLDocument2
uses MSHTML;
procedure TForm1.SetDoc(strhtml:string);
var
IDoc:IHTMLDocument2;
ilen:integer;
all:IHTMLElementCollection;
len,i : integer;
item:OleVariant;
v:Variant;
strUrl:string;
begin
ilen := Length(strhtml);
IDoc:= CreateComObject(Class_HTMLDOcument) as IHTMLDocument2;

try
IDoc.designMode :='on';
while IDoc.readyState <> 'complete' do
Application.ProcessMessages;

v:=VarArrayCreate([0,0],VarVariant);
v[0]:=strhtml;
IDoc.write(PSafeArray(System.TVarData(v).VArray));
IDoc.designMode:='off';

while IDoc.readyState <> 'complete' do
Application.ProcessMessages;

all := IDoc.links;
len:=all.length;
for i:=0 to len-1 do begin
item:=all.item(i,varempty); //EmpryParam亦可
memo1.lines.add(item.href);
end;

finally
IDoc := nil;
end;
end;

方法二:利用正则表达式
必须先导入单元
Project->Import Type Library:Microsoft VBScript Regular Expressions 5.5(Version5.5)

uses VBScript_RegExp_55_TLB;
procedure TForm1.SetRegDoc(strhtml:string);
var
objExp : TRegExp;
machs:IMatchCollection;
Matchs : Match;
i : integer;
begin
objExp := TRegExp.Create(self);
objExp.Pattern := '((http|https):(\/\/|\\\\)((\w)+[.]){1,}(net|com|cn|org|cc|tv|([0-9]{1,3}))(((\/[\~]*|\\[\~]*)(\w)+)|[.](\w)+)*(((([?](\w)+){1}[=]*))*((\w)+){1}([\&](\w)+[\=](\w)+)*)*)';
objExp.IgnoreCase := true;
objExp.Global := True;
machs := objExp.Execute(strhtml) as IMatchCollection;
for i := 0 to machs.Count - 1 do begin
Matchs := machs.Item[i] as Match;
memo1.Lines.Add(matchs.Value);
end;
end;
以上两中方法各有利弊，这里就懒的说了，自己试吧。
当然还有其他的方法，比如查找字符串中"href="的值，可以自己实现。

欢迎交流Email:sz319@163.com 主页:http://sz319.net.ru
转贴请保留此信息。

搜索更多相关主题的帖子: 网页　链接　文件　获取