Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations Chriss Miller on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

How to get lots of XML files(at lease 10000 )in short period of time ? 1

Status
Not open for further replies.

Aeroman

Programmer
Feb 14, 2001
2
US
I'm doing a research of processing XML queries on XML documents recently. What I need first is to find out tons of XML documents (at least > 10000)as my sample documents, then writing a parser to parse the keyword in all these XML documents. Can anyone tell me how to get that many XML documents in short period of time instead of downloading them one at a time? I realy appreciate anyone's answer and help !!

Mike Fang
 
Do you just need XML documents, which you can probably get from a web crawler, or do you actually need to create them for a special purpose?

If the second, you can probably generate crude "random document creator" and just loop it 10,000 times...

Something like (C code):

void randomAttrib(int level) {
char *wordArray[] = "fur", "eyes", "collar", "ears", "tail" ;
char *wordArray2[] = "red", "blue", "green", "yellow", "black" ;
int index = rand() * 5 ; // 0-4
int index2 = rand() * 5 ; // 0-4
print(" %s=%s", wordArray[index], wordArray[index2] ) ;
}

void randomElement(int level) {
char *wordArray[] = "cat", "dog", "elephant", "fish", "shark" ;
int index = rand() * 5 ; // 0-4
print (&quot;<%s&quot;, wordArray[index]);
while ( rand() < .5 ) {
randomAttrib() ;
}
print (&quot;>\n&quot;);
while ( rand() < .5 ) {
randomElement(level + 1) ;
}
print (&quot;</%s>\n&quot;, wordArray[index]);
}

You can use any number of words (just modify multiplier on rand) and fiddle with the .5 to get more or fewer levels or attributes.

Hope this helps
 
Thanks for your answer !! All I need is whole bunch of XML files gathering from the web. Therefore what I want is the first option you mention. Do you happen to know which web crawler can do that for me? I really appreciate any hint or answer !!

Aeroman
 
If you have a Unix or Linux system, then you could also check into using 'wget' or 'curl', which are command-line utilities for downloading web files. These utilities masquerade as the browser, even allowing for certain browser IDs, such as IE, or Netscape, and referrers, authentication, form posting, etc...

Using one of these, possibly in connection with a bash or Perl script, you can have a lot more control over your downloads than with a webcrawler.
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top