Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations Chriss Miller on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Downloading Text from websites 1

Status
Not open for further replies.

DEDMOD

Programmer
Feb 1, 2001
721
US
Since the previous thread was getting a bit long I've decided to start a new one.

Ok, so Chris let's assume that your approach is best, to save the entire thread as a memo. I did a quick test and a text file of two TekTips pages, one large and one small were 98k and 50k respectively. So if we have an average of 75k messages and 6000 threads the database would be 450 meg +. I guess this is feasible, but it is rather large. Anyway the next question is what's the best way to pull down each thread in order. I suppose I need to look at what's behind the 'next thread' button. Dave Dardinger
 
Dave

"to save the entire thread as a memo."

I was suggesting saving the thread a web page.

Typo or misunderstanding?

HTH

Chris [pc2]
 
Hi Dave,

re: .....I suppose I need to look at what's behind the 'next thread' button.

A good idea! I just tried it:

while the actual side with your new thread is (at the moment) the internet-address


every 10 threads will lead to the following adresses

and so on...that means only the last digit changes for the page no.

Of course I tried to find out the (last (or first)) page of the tek-tips base and came to:


(Chris's calculation was very close (596*10 = 5960 threads!)

and wondered that on this page there were only three threads:

Name Date Responses Title
rgordley (Programmer) 5/11/99 (posted 4/30) 1 responses printing table definitions
MelanieVA (Programmer) 4/29/99 (posted 3/31) 2 responses CLOSING SQL TABLES
Bill (Visitor) 8/25/98 (posted 8/25) 0 responses Visual Foxpro and Winframe

On the next page:

you will get a page with no threads but only the comment:
" Start A New Thread or Click on Subject Lines to View Threads."


When I use the last/first page and want to see the details of a thread then
e.g. this two detailed threads are visible:
-----------------------------------------------------------------------------------------------------------------------------------
PRINTING TABLE DEFINITIONS
thread184-1355
rgordley (Programmer)
Apr 30, 1999

Is there any way you .....

and:

CLOSING SQL TABLES
thread184-1351
MelanieVA (Programmer)
Mar 31, 1999
I am trying to delete ....etc


That means they have deleted at a certain thread (otherwise they should have started with thread 184-0001).

Anyway ...there is a system visible
a) every "next 10 thread-step" the page-adress changes by a new page-no at the end
b) the oldest page has a message beginning " Start A New Thread or Click on Subject Lines to View Threads."


I am not the expert to write a program which could simulate the mouse-clicks to call Internet pages however
if somebody were in a position to do that (e.g. by working from a self-created vfp-table with all the adresses defined in advance - and then via a shell just doing all the downloads) that would be a glorious program , as one only has to free his harddisk for about 450 Meg and then sleep for one or two nights....*ggg

After this we would have a wonderful file with the know-how and questions of about 6,900 people..more or less everything what has ever been asked is documented there...*ggg ...and by clicking on the predefined table
again could make it possible to update records one is interested in...

Regards
Klaus
 
Chris,

I'm not sure the difference. What I did was click View / (page) source and then copy that into a file and thence into a memo field. Is there another way to do it which stores the data in a shorter format? I rather doubt it since checking this thread in Windows Explorer shows it to be 57k. I suppose you could store the web pages separately with just a reference to the page, but all that does is clutter up the hard drive with 6000 files it didn't have before. I don't see the advantage.

Dave Dardinger
 
Klaus,

The code in the old thread itself is/was:

<a href=&quot;viewthread.cfm?sqid=242759&spid=184&page=2&nx=1 &CFID=57986585&CFTOKEN=69910287&quot; STYLE=&quot;color:red;&quot;>

which I suppose amounts to the same thing. Except that you shouldn't have to switch pages every 10, 15, or 20 threads.

Anyway, when you click the &quot;next thread&quot; link on thread 255481 it gives:

thread184-255481 &page=6&nx=1&CFID=58103915&CFTOKEN=29672829

in the IE address box and 255256 is the new thread displayed. Unfortunately I'm not that up on Java and don't know where to find the definition of 'viewthread.cfm'. So I don't know what to put into the 'navigate' code which would pull up the proper thread.

Acutally I just did a couple of tests. You can change the sqid number and it will pull up the cooresponding thread. However apparently the threads are numbered consequtively regardless of forum and tektips doesn't care. It will pull up the body of the thread and surround it with the surroundings indicated by the spid. Thus when I incremented a thread id it gave me a thread talking about cpu's and Sun Microsystems but listed the thread at top as a 184 thread and gave our list of experts, et. al. Dave Dardinger
 
Dave

What I have been suggesting is save the thread as a web page (.htm file)

Ths keystrokes, in rapid succession, are as follows:-

ALT+F
ALT+A
ALT+S
ALT+Y (If overwrite of existing file required)

Why is is so important to save the thread to another file format, and what benefits are derived from the additional complexity? HTH

Chris [pc2]
 
It's not a case of saving to another format. .htm is just a text file, after all. I'm just thinking that it's easier to search within a table than to have to open any of thousands of separate files. => Added later: And thousands of folders filled with duplicate .gif files, etc. It's got to be better to save the files as memo fields in one table. Admittedly to get things to display right you'll have to add all the gifs and other things into some resource directory, but that shouldn't be hard to do.

Dave Dardinger
 
Dave

Clearly we are at cross purposes.

&quot;I'm just thinking that it's easier to search within a table than to have to open any of thousands of separate files&quot;

You would be searching only within the cursor populated by ADIR(), the fields being CURSORNAME.filename and FILETOSTR(CURSORNAME.filename).

So if there were 6,000 records, a single SCAN...ENDS, or SELECT... is all that's required to perform as complex a search as required.

&quot;Admittedly to get things to display right you'll have to add all the gifs and other things into some resource directory&quot;

Save the thread as a web page and all that is taken care of in that the .htm file is saved to the folder as:-

&quot;Downloading TekTips threads - Tek-Tips.htm&quot;

and the images, css, script files etc saved to the folder:-

&quot;Downloading TekTips threads - Tek-Tips_file&quot;

How do you intend to display the result?

If you display it as a text file, you are going to see all the html tags etc.

Display it in Internet Explorer and, subject to being prompted to going on line, you will be able to :-

Navigate to another thread if the thread reference is included in the web page.

Navigate to a FAQ if the FAQ reference is included in the web page.

Navigate to a URL if the URL reference is included in the web page.


If these are new web pages to you and relevant, you can download and add them to the &quot;database&quot;.

I really cannot see any need for any tables whatsoever for this application - one top-level form, with or without menu/toolbar, without a main.prg, is all that's necessary, IMHO.

If you want to be able to select a folder for the web pages, then you can use a .mem file to store that or any other setup type information. HTH

Chris [pc2]
 
Well, for my tiny test I'd put the contents of the memo field into a temp file and displayed that in IE. Interestingly, while most of the .gifs were absent and the formatting clunky at best, your typing smiley was present since I'd previously downloaded all the smiley and put them somewhere they are accessable. I'm not saying that things will be as automatic as your technique, but I still cringe at the thought of thousands of directories and tens of thousands of files cluttering up the hard drive, slowing down every search I do, etc. If we're going to do it your way, then why do it on our computers at all? The tek-tips search engine isn't that bad, is it? Dave Dardinger
 
I think german's original problem was unreliable access. Perhaps it is a very slow dial-up connection. -Pete
 
Hi Dave, Chris and Pete.

I followed this very interesting discussion - which has nothing to do, whether my connection was too slow - ok - sometimes it was - unfortunately I can not compare the speed with yours however there were days where I had to
wait more than half a minute until I saw the first thread - and when one wants to have several information for
the same theme then it would cost you time - and money too - even here.

So far you are perhaps right, Pete - but there is no quicker way to find something as when you have it
on your own hard disk, when it is automatically updated there and when you can work on additional find routines
which eases it more - and as I said, it is not only this database - there are a lot of other interesting ones in
the net (e.g. chess) where there are the same problems: Huge material which best can be found and
filtered when it is close enough to you.

I do not want to make this line superfluos - on the contrary - all I wanted in the beginning was just a list with
all the headlines for a download (not the contents at that point) - this list could have been sorted by text,
could have been spread by an own defined field (you know of the uncertain headlines we discussed, also for an
own grouping title) - in short - I wanted something which the management refused...and that is what I still
do not understand - and that was the start of our dialogues -
At the moment I am doing nearly what Dave had in mind - I only mark interesting text-modules and with a mouse-click I send them all into one big file as a memo together with the datetime, a headline (which is automatically the thread-title) and so I have now a database where there are things where I personally have that things I want to study deeper, where I can immediately get code snippets etc.
And it is really fast -is without any pics or graphics and do not forget - this way I also pick up other newsletters which are around VFP....and can find more interesting things all in one base....

But what Chris said is also very interesting - and a complete new way of gathering data - I am really curious
how that would work - today hard-disks are very cheap - and the best way could be to buy a small one for
this enterprise - and then one could see where the handling is better....

My concern is, how you Chris will name the files in order to see in advance what is in, as you plan no table
which I built - where the first column is the thread title in my base. Hopefully your names are not the thread...*ggg
There must be some ways to group and analyze also this heap of data-files at least they all can be viewed
in the cursor, as how Chris suggested...


Ok - I am short before vacation - but I hope to see more things like this here - they are very good for thinking
about new ways - and there are always some in programming....

Regards from Germany

Klaus





 
I just found an interesting advertisement for a program called GOLDEN RETRIEVER -
which can transfer - Infos from Internet to VFP, EXCEL and others:

Look at:

Does anyone know this program?

Pricing

Single User License: $99
Developer Licence (royalty free distribution with your own custom program): $249
Webmaster License (free distribution to any of your members for use with forms on your site): Starting at $249 (call for details)

Regards
Klaus
 
Yup. I wrote it. We decided to develop it internally because we couldn't find anything on the market that did anything similar; Now we're releasing it for sale.

It's main strength is that it can do all the work of retriveing form results, parsing them and placing them in a text file (in CSV or plain text Name=Value pairs) which can then be processed in a program.
 
Dave

I can see you might have a problem with the number of files and folders if you only have a single drive or partition.

I dedicate a partition to this sort of archive/reference material and normally either exclude or include it in a Windows search according to what is being sought.

Anyone being serious about this would provide suitable accomodation for the data, so any problem a developer would have with that aspect would be of their own making.

Klaus

&quot;My concern is, how you Chris will name the files in order to see in advance what is in&quot;

When you save a thread with Internet Explorer, the thread title becomes the file title, so saving this thread you get:-

Downloading TekTips threads - Tek-Tips.htm

as the filename, and

Downloading TekTips threads - Tek-Tips_file

as the relevant folder.

So the quick way to save is to hold down the ALT key and press F, A, S, Y (If overwrite of existing file required)

You can do this whilst reading a thread to avoid wasting time.

I am sure one could write a TSR in WSH or something similar to be able to save with a single keystroke or key combination.



&quot;Hopefully your names are not the thread...&quot;

&quot;There must be some ways to group and analyze also this heap of data-files at least they all can be viewed&quot;

Why bother?

The filenames are unimportant - it is the keyword search that is going to return the files with the highest number of instances in descending order.

The primary reason for displaying a grid is to show the search results - the secondary reason is to allow you to get an idea of the contents of a file by its title. HTH

Chris [pc2]
 
hmmm........

Q: how do you 'quote' a different forum thread in your posting?

thanks
vlad
+---------------------------+
|#include<disclaimer.h> |
+---------------------------+
 
my 2 cents, I just save the web pages, but put a name to my liking.

there is the option in browser to save as text format. Attitude is Everything
 
HI,
I have developed a software in VFP to take care of this. It has limitations.. but so long I only use it.. I know what I have to do and so no problems. I gave a thought to develope it further and even release it as freeware, some day, when my mind agrees to do so. However, I can outline the concept I adapted for now.

I created a form, added WebView native class provided along with VFP. I have added/changed a few of the functions.. very few lines of code.. Added buttons to go back or forward just as IE do, URL address box to reach a thread I want.. etc.

I use a table with category/title/address/details ++ a few of my interests as fields.. so that all my web collections can be conveniently kept in a table and searched the way I want using different forms.

The important thing is an add button.. which captures the entire thread content (not just the visible screen) and save it as a new record.. with input boxes to update table fields title/category etc.

When it comes to Tek-tip threads, I havent done it yet.. but plan to identify the thread in the existing data base and replace the entire content automatic. As of now, I can overwrite the existing record of a table.. at the current record pointer location.. the fresh update with a clcik on an update button.. distinct from add button.(As I told I know what to do.. but for a release this has to be more sturdy approach).

The reason, I cannot release is that I have to rewrite the base classes.. the current one being protected for my business reasons. Hope some day I do it. The time is the problem :-( ramani :)
(Subramanian.G),FoxAcc, ramani_g@yahoo.com
 
Well.......

That's all nice and enlightening, but.... My question much more basic.

When I post on Tek-Tips forums I'd like to quote another thread. I can see people do it where the quoted/referenced thread appears something like that [underlined]:

Thread822-350053

A read can click on the referenced thread to read it.

Q: how is't done and what's the TGML construct for that when posting?

thanks
vlad
+---------------------------+
|#include<disclaimer.h> |
+---------------------------+
 
Vlad

Q: how is't done and what's the TGML construct for that when posting?

You just did it, to quote an faq type in &quot;FAQ184-whatever&quot; no quotes, no spaces and i will appear as a hyperlink to click on, same thing for a thread (&quot;thread184-whatever&quot;), no quotes, no spaces. If you are not sure use the &quot;preview post&quot; before submitting.

Just remember this is Forum #184.

And when you are typing you message (or answer) just at the bottom of the box there is a link to show you the &quot;TGML&quot; tags available.

Mike Gagnon
comp14.gif
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top