Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations bkrike on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Extract text from txt file

Status
Not open for further replies.

hugh999

MIS
Nov 29, 2001
129
IE
Hi

I have txt file that contains a few hundred lines of text. In the majority of these lines of text there is a file name and I wish to extract the file names from the txt file to a new txt file.

Would the best solution to achieve this be, to search for the text before the file names (Target1) and the text after the file names (Target 2) and extract the text (file names) between Target 1 and Target 2

The text before the file names is:
(filename="
The text after the filenames is:
“)

I would appreciate help on achieving this or an example of code.

Thanks
 
There are two possible ways of doing this:

1) Open the file and read each line (into a String), use IndexOf to find the '(filename="' and '")' parts, then extract the path and append it to the new file...

2) Read the hole file into a String and use Regex to find the paths for you, then save that into the file...

For simplicity, I would go with the first one.

I'll try to do an example for you,
 
Can you please provide a piece of code of how this will work

Thanks
 
I think that you will find that you have the answers to this question already.

Refer back to the questions that you asked in March and November 2004.

[vampire][bat]
 
Thanks for the reminder. In the question i posted in 2004 I was able to know what the text was that I was searching for, Unfortunately for this text file it contains file names that are different and with different extensions. The only way of locating the file names is that the text/characters in front and behind the file names are the same.
 
I guess the best way to do this is to apply a compiled regular expression to each line. If you apply a regex to the whole file it may run into memory issues and be a little slow. Something like this should do it:

Code:
using System.IO;
using System.Text.RegularExpressions;

// ...

            StreamReader reader = null;
            StreamWriter writer = null;
            Regex findFileRegex = new Regex(@"(?<=\(filename="").*?(?=""\))", RegexOptions.Compiled);

            try
            {
                reader = new StreamReader(@"c:\sourceFile.txt");
                writer = new StreamWriter(@"c:\destFile.txt");

                while (!reader.EndOfStream)
                {
                    string line = reader.ReadLine();
                    Match fileMatch = findFileRegex.Match(line);
                    if (fileMatch != null)
                    {
                        writer.WriteLine(fileMatch.ToString());
                    }
                }
            }
            finally
            {
                if (reader != null) reader.Dispose();
                if (writer != null) writer.Dispose();
            }
 
Thanks for the regular expression and code. I converted the code to VB.Net and got it working. The only problem i have is that when the code extracts the file names to the target txt file it leaves blank lines between each file name, these blank lines correspond to the filenames in the source txt file.

Here is my code

Dim reader As StreamReader = Nothing
Dim writer As StreamWriter = Nothing
Dim findFileRegex As Regex = New Regex("(?<=\( filename="").*?(?=""\))", RegexOptions.Compiled)


reader = New StreamReader("c:\check\AA_test.txt")
writer = New StreamWriter("c:\check\output.txt")

While Not reader.Read

Dim line As String = reader.ReadLine()
Dim fileMatch As Match = findFileRegex.Match(line)

If Not fileMatch Is Nothing Then
writer.WriteLine(fileMatch.ToString())
End If
End While

reader.Close()
writer.Close()
 
I forgot I was in the VB forum. The VB.Net version of the code is this:

Code:
Imports System.IO
Imports System.Text.RegularExpressions

'...

Dim reader As StreamReader = Nothing
        Dim writer As StreamWriter = Nothing
        Dim findFileRegex As Regex = New Regex("(?<=\(filename="").*?(?=""\))", RegexOptions.Compiled)

        Try
            reader = New StreamReader("c:\sourceFile.txt")
            writer = New StreamWriter("c:\destFile.txt")

            While Not reader.EndOfStream

                Dim line As String = reader.ReadLine()
                Dim fileMatch As Match = findFileRegex.Match(line)

                If Not fileMatch Is Nothing Then
                    writer.WriteLine(fileMatch.ToString())
                End If
            End While

            
        Finally
            If Not reader Is Nothing Then reader.Dispose()
            If Not writer Is Nothing Then writer.Dispose()
        End Try

In your translation, you seem to have an extra space in the regular expression before filename, but this would stop the thing working completely. I think the issue is that you are using reader.Read and not reader.EndOfStream. Using reader.Read means that you loose the first character off each line.

The test source file I'm using looks like this:

Code:
scrap text (filename="FILENAME1.TXT")
scrap text (filename="FILENAME2.TXT")
scrap text (filename="FILENAME3.TXT")
scrap text (filename="FILENAME4.TXT")
scrap text (filename="FILENAME5.TXT")
scrap text (filename="FILENAME6.TXT")

What does your test file look like? The result file I get looks like this:

Code:
FILENAME1.TXT
FILENAME2.TXT
FILENAME3.TXT
FILENAME4.TXT
FILENAME5.TXT
FILENAME6.TXT

Where you expecting something different?


 
Thanks for the converstaion, i upgraded to VS.NET 2005 and the code works. I still have the problem with the blank lines in the target text file.

If the file name is listed in each line of the source text file then the file names gets extracted to the target text file with no blank lines, but in the source file the file names are not listed in each line of text.

Example:
scrap text (filename="FILENAME1.TXT")
This is a line of text with no file name
scrap text (filename="FILENAME2.TXT")
scrap text (filename="FILENAME3.TXT")
This is a line of text with no file name
This is a line of text with no file name
scrap text (filename="FILENAME4.TXT")
This is a line of text with no file name
scrap text (filename="FILENAME5.TXT")
scrap text (filename="FILENAME6.TXT")

The end result produces the target text file with blank lines between the file names, this corresponds to where the file names are in the source file.
 
Changing the IF line to the following should solve the problem:

Code:
If Not fileMatch Is Nothing AndAlso fileMatch.Success Then
 
Thanks Aptitude, it works great now. I appreciate all your help with this.
Thanks
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top