Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations TouchToneTommy on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Parsing a Tab-Delimited text file...

Status
Not open for further replies.

JCruz063

Programmer
Feb 21, 2003
716
US
Hello all,

What I'm dealing with...
I have a text file that is delimited by tabs (\t characters). The file could be viewed as having rows and columns. The rows are delimited by carriage returns/line feeds (\r\n characters), while the columns are delimited by one or more tabs (\t characters). The very first row makes up the heading of the file.

What I need to do...
I need to parse the file and extract the information it contains. For each row, I need to be able to identify each column because each column has a special meaning. Duh!

The problem I'm having...
The number of tabs (\t characters) that appear between each column varies from row to row. It appears that the file was created so that the columns line up visually and thus, the number of tabs between each column varies according to how long the text in each/column is. Certain columns may be blank, which causes the number of tabs to be more for the rows where such columns are blank. The text of in some columns varies from row to row, and when that's the case, the number of tabs between each column is, again, different.
For example, take a look at this:
Code:
[b]Column1[/b]		[b]Col2[/b]	[b]ThisIsTheThirdAndLargestColumn3[/b]	[b]Column4[/b]
Row1Col1	r1c2	row1col3				r1c4
r2c1			AsYouMayKnowThisIsTheSecondRowCol3
Ok - the columns don't quite line up here and I guess that's because I have a different font. It's funny, though, that with the file I have, I could select any font, and the columns will still line up. In any case, the point is this: The number of \t characters is different between columns from row to row. In fact, in my example above, there are 4 \r characters between the data at the third column and the data at the 4th column in the second row - that is the first non-heading row). There's only 1 \r character, however, between the same columns on the third (and last) row. This little detail, combined with the fact that the number of characters of the data in each column differs for every row, makes identifying the columns a difficult task (at least for me).

The question...
Thus, the question is... How can I identify each column in each row, given the fact that the number of tabs between columns varies for each row?

Thanks!

JC

_________________________________________________
To get the best response to a question, read faq222-2244.
 
I'm not sure you can. You don't know if the excess tabs were put there for visual reasons (make the columns line up) or to skip over a column for which there was no data.

Chip H.


____________________________________________________________________
If you want to get the best response to a question, please read FAQ222-2244 first
 
Thanks Chip,
chiph said:
You don't know if the excess tabs were put there for visual reasons (make the columns line up) or to skip over a column for which there was no data.
At first, I thought it was to skip over the columns with no data. However, I counted the number of tabs and there are more tabs than there should be. What comes to mind is that some column headings are wider than the space of a tab and, when there is no data in columns, two or more tabs (as opposed to 1) are added.

Now, the file is generated by a program and I have to assume that such program must have some sort of pattern when creating the file. I counted how many tabs there are between columns (assuming that all columns are blank) and wrote these down - I found the max # of tabs was 3. I think that the program that's generating the file calculates the space taken up by the data in each column, and it adds the tabs that are left at the end of the data. Let me explain with a little example:

Code:
[b]Col1      Col2      ColColColCol3      Col4[/b]
[\t]      [\t]      [\t][\t][\t]       [\t]
data[\t]  data[\t]  data[\t][\t]       data
data[\t]  data[\t]  datadata[\t]       data
In the example above, I'm trying to say that to go from Col1 to Col2, it takes 1 tab. From Col2 To ColColColCol3, it takes 1 tab. From ColColColCol3 to Col4, it takes 3 tabs. Thus, if none of the columns have data except the last one, then there should be 5 tabs before the data on the last column. If there is data on the third column, and that data takes no more than 1 tab, then it should take 2 tabs to get to the data at column 4. When the data in column 3 is longer than the space of 1 tab, it should take only 1 tab to reach the data on Col4.

I hope all of that made sense. If I didn't, I'll try to concentrate to rephrase it in a more-understandble way. If it makes sense though, the question is this: How can I know how many tabs a certain text occupies? If I can find this and my assumptions are correct, I'll be able to identify each column. If I'm thinking the wrong way, can you please point me in the right direction?

Thanks again!

JC


_________________________________________________
To get the best response to a question, read faq222-2244.
 
knowing that a tab means 8 characters you could find how many tabs a text takes with this

string s = "abcdefghij"
Double tabsOccupied = Math.Ceil(s.Length / 8);

which should return 2

--------------------------
"two wrongs don't make a right, but three lefts do" - the unknown sage
 
In your example, the first title row doesn't have any tabs, and your column titles don't have any spaces in them. Is this always the case?

One possibility is to read the first line from the file, and expand any tabs to n spaces. Analyse the resulting string to find the first position of each heading, and chop the rows up into columns using substring based on this. (The end point of the last column will need special treatment, as the data rows might be longer than the title).

This gets over the problem of titles perhaps being shorter than data in the same column, but you might come unstuck if the column titles have spaces in them, as this will cause parsing problems.

Is there any chance of producing the data in another format that is easier to process? Like comma-separated, for instance?
 
Thanks for your replies Guys!

I hope no one gets upset at me but the text file does have a pattern. It was difficult for me to detect at the beginning because (1) there are many columns, (2) in most rows, many of the columns are blank, and (3) the data in the columns is codified and abbreviated, and thus, it's not readable (I mean it's english - but it's not understandable).

I figured out that from the first column to the second, there are always 2 tabs, regardless of how much data is in the columns. Subsequent columns have only 1 tab between them, again, regardless of how much data the columns have. Thus, all this time I had a simple tab-delimited file whose parsing code takes minutes to write. Had I tried the parsing code before looking at the file, I would not have gone through all this trouble.

I appreciate your responses guys. Thanks again!

JC

_________________________________________________
To get the best response to a question, read faq222-2244.
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top