怎样用程序实现将一篇HTML格式的新闻转换为XML格式的文件?
怎么样将一篇从网络上搜索到的HTML格式的文件转换为XML格式的文件?下面是算法Input:HTML file from the ChinaTimes website.
Output:XML file.
Method:
Step 1: Get HTML file from URL: http://www.
Step 1.1: Compare the file name of HTML file with one of the last processed file in previous process iteration, the last file name is saved in 'check.txt'.
Step 1.2: If they are different, go to Step 2. Otherwise sleep for several minutes and then go to Step 1.
Step 2: If (date='<li>'), record the URL link which follows'<li>'.
Step 3: Retrieve web page by using the URL link.
Step 4: Compare the retrieved data with the one retrieved last time by using the same URL. If they are the same, ignore current retrieved data and then go to Step 1; otherwise continue.
Step 5:If(data='<tr>'),execute following substeps:
Step 5.1: If data meet the format of the Title(like '<Title>'...</Title>'). save title data in XML file.
Step 5.2: If data meet the format of the Date (like '<Date>...</Date>').save date data in XML file.
Step 5.3: If data meet the format of the Reporter (like'<Reporter>...</Reporter>'). save reporter data in XML file.
Step 5.4: If data meet the format of the Location (like'<Location>...</Location>').save location data in XML file.
Step 5.5: If data meet the format of the e-news(like '<News>...</News>').save e-news data in XML file.
Step 6: Save the file name of current processed HTML file in 'check.txt'.
Step 7: Go to Step 1.