How to get lots of XML files(at lease 10000 )in short period of time ? 1

Aeroman · Feb 14, 2001

I'm doing a research of processing XML queries on XML documents recently. What I need first is to find out tons of XML documents (at least > 10000)as my sample documents, then writing a parser to parse the keyword in all these XML documents. Can anyone tell me how to get that many XML documents in short period of time instead of downloading them one at a time? I realy appreciate anyone's answer and help !!

Mike Fang

Guest_imported · Feb 15, 2001

Do you just need XML documents, which you can probably get from a web crawler, or do you actually need to create them for a special purpose?

If the second, you can probably generate crude "random document creator" and just loop it 10,000 times...

Something like (C code):

void randomAttrib(int level) {
char *wordArray[] = "fur", "eyes", "collar", "ears", "tail" ;
char *wordArray2[] = "red", "blue", "green", "yellow", "black" ;
int index = rand() * 5 ; // 0-4
int index2 = rand() * 5 ; // 0-4
print(" %s=%s", wordArray[index], wordArray[index2] ) ;
}

void randomElement(int level) {
char *wordArray[] = "cat", "dog", "elephant", "fish", "shark" ;
int index = rand() * 5 ; // 0-4
print ("<%s", wordArray[index]);
while ( rand() < .5 ) {
randomAttrib() ;
}
print (">\n&quot

;
while ( rand() < .5 ) {
randomElement(level + 1) ;
}
print ("</%s>\n", wordArray[index]);
}

You can use any number of words (just modify multiplier on rand) and fiddle with the .5 to get more or fewer levels or attributes.

Hope this helps

Aeroman · Feb 15, 2001

Thanks for your answer !! All I need is whole bunch of XML files gathering from the web. Therefore what I want is the first option you mention. Do you happen to know which web crawler can do that for me? I really appreciate any hint or answer !!

Aeroman

MatsHulten · Feb 16, 2001

Try:

http://www.webcrawler.com

http://www.metacrawler.com

http://www.altavista.com

Good Luck
-Mats

rycamor · Feb 24, 2001

If you have a Unix or Linux system, then you could also check into using 'wget' or 'curl', which are command-line utilities for downloading web files. These utilities masquerade as the browser, even allowing for certain browser IDs, such as IE, or Netscape, and referrers, authentication, form posting, etc...

Using one of these, possibly in connection with a bash or Perl script, you can have a lot more control over your downloads than with a webcrawler.

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

How to get lots of XML files(at lease 10000 )in short period of time ? 1

Aeroman

Programmer

Guest_imported

New member

Aeroman

Programmer

MatsHulten

Programmer

rycamor

Programmer

Similar threads

Part and Inventory Search

Sponsor