Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations bkrike on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

C++ tokenizer

Status
Not open for further replies.

naya

Programmer
Apr 8, 2003
3
US

I really need your help guys because I am new to C++ and I don't understand many things.
I want to make a program that reads text files, tokenize the text and print the word frequency (how many same tokens exist in a file). I find some programs similar but I don't understand many things. I want something more simple!!
I really want you to tell me where I can find something like that and simple enough.

Thank you in advance!
 
>I want something more simple!!
The problem with this program is that it will either be very long using simple concepts which will make it seem less simple, or very short using higher level concepts which will make it seem less simple.

Instead of building the entire program from scratch, you should build it incrementally so you understand each piece. Then when you put it together, it's not that complicated. For example, write a program that opens a file and plays around with what it reads, then write a program that takes a sentence as input and tokenize it, writing each token on a separate line. And finally, write a program that does a simple frequency check, such as how many of each character is in so and so string.

Once you have all of those programs, putting them together into the one that you want will be easy.
 
Akarui... THANK YOU!!! Finally someone who understands the concept of simplification in problem solving!!! thread207-514778

[cheers]

I'm considering retiring my Tek-Tips handle now!

-pete


 
Another approach: Dig into flex and bison, which are tools for generating parsers (and compilers actually) in C.

It is looks (and is) quite complex at first, but IMHO worth it it the long run. I used this site as my "tutor":

1) It shows some interesting techniques.
2) Once you get the hang of it you can quite easily make a parser that can parse/tokenize _anything_ (C++, Visual Basic, LPC, YourOwnInventedLanguage, ...).
 
Thank you very much for your advice. I have already started the program but now I got stuck.

I have this function which counts the frequency of a word in a file ("out.txt").This file has the tokens, one at each line. The function is called from main() when I want to count the word occurences of a file. When I call this function again (the file "out.txt" has changed) because I want to count the frequency of words in another file, I get the total frequency of words(including the previous tokens)in the file "1.txt".What I do wrong? Is there something happening with the iterator or the const? I really don't know a lot about these things. Here is the functions:

map<string,int> histogram;

void record(const string& s)
{
histogram++;
}

void print(pair<const string,int> & r)
{
ofstream fout(&quot;eksodos.txt&quot;,ios::app);
fout << r.first << ' ' << r.second << '\n';
}

void freq()
{
ifstream fin(&quot;out.txt&quot;);
istream_iterator<string> ii(fin);
istream_iterator<string> eos;
for_each(ii,eos,record);
for_each(histogram.begin(),histogram.end(),print);

ifstream from(&quot;eksodos.txt&quot;);
ofstream to(&quot;1.txt&quot;);
char ch;
while(from.get(ch)) to.put(ch);
ofstream fout(&quot;eksodos.txt&quot;,ios::trunc);}

Thank you very much in advance
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top