×
INTELLIGENT WORK FORUMS
FOR COMPUTER PROFESSIONALS

Log In

Come Join Us!

Are you a
Computer / IT professional?
Join Tek-Tips Forums!
  • Talk With Other Members
  • Be Notified Of Responses
    To Your Posts
  • Keyword Search
  • One-Click Access To Your
    Favorite Forums
  • Automated Signatures
    On Your Posts
  • Best Of All, It's Free!
  • Students Click Here

*Tek-Tips's functionality depends on members receiving e-mail. By joining you are opting in to receive e-mail.

Posting Guidelines

Promoting, selling, recruiting, coursework and thesis posting is forbidden.

Students Click Here

Jobs

Parsing Odd Data

Parsing Odd Data

Parsing Odd Data

(OP)
This may be a MySQL query question but I suspect that PHP would be needed instead. In a table with a number of columns, there are two as shown below that I'm having trouble with. The data is from 1930s, 1940s and 1950s automotive parts books so I cannot change it but I need to be able to search on it more efficiently and its formatting is preventing that so I am thinking that either a query that can work through the issues or something else (perhaps the one-time creation of a lookup table) is the way to go.

The problem now is that it is impossible to search on many models because only part of the model number is there. For example, if someone wants to find 2213 nothing here will show up and searching on only 13 or even -13 will bring up thousands of unrelated entries from the table.

The data was formatted only to make it printable (this was OCRed from hard-copy) and it looks straightforward but actually the second example is what it really means:
(These are simplified without extraneous columns but there is a primary key field, among others)

PartNo     Models
_________________________________________________________________
391719     2206-13-26-33; 2302-06-13-32-33

G141111    2631 (BODY 2678); 5431 (BODY 5478)

L422445    2401 (BODIES 2462-65-92-15-98): 2501; 2601-11-33 

Meaning:

PartNo     Models
_________________________________________________________________
391719     2206 2613 2626 2633; 2302 2306 2313 2332 2333

G141111    2631 (BODY 2678); 5431 (BODY 5478)

L422445    2401 (BODIES 2462 2465 2492 2415 2498): 2501; 2601 2611 2633 

I am open to suggestions but what I think I need is to extract the data into something like this but I have no idea how to do so:

PartNo   ModelNo   BodyNo
391719   2206
391719   2213
391719   2226
391719   2233
391719   2302
391719   2306
391719   2313
391719   2332
391719   2333
G141111  2631      2678
G141111  5431      5478
L422445  2401      2462
L422445  2401      2465
L422445  2401      2492
L422445  2401      2415
L422445  2401      2498
L422445  2501      2601
L422445  2501      2611
L422445  2501      2633 

Otherwise some way to make the search engine see 2462-65 (for example) as 2462 2465 would be okay and, in that event, it probably is not necessary to differentiate between model number and body number. There will ultimately be hundreds of thousands of rows of data and there may be other similar oddities that I've not yet come across.

I just spotted another and without going through adding it to the above, the Model column looks like this so clearly I'll need to add a generic model table with the needed numbers:

ALL 22ND; 23RD; 24TH; 25TH; 26TH; 5402-06-13-26-31

RE: Parsing Odd Data

(OP)
Oops! L422445 2401 (BODIES 2462 2465 2492 2415 2498): 2501; 2601 2611 2633 should be:

CODE --> Table

PartNo   ModelNo   BodyNo
391719   2206
391719   2213
391719   2226
391719   2233
391719   2302
391719   2306
391719   2313
391719   2332
391719   2333
G141111  2631      2678
G141111  5431      5478
L422445  2401      2462
L422445  2401      2465
L422445  2401      2492
L422445  2401      2415
L422445  2401      2498
L422445  2601
L422445  2611
L422445  2266 

RE: Parsing Odd Data

Why does this

CODE

2206-13-26-33 
Expand to

CODE

2206 2613 2626 2633 

Ie where is the logic that says transform 22 to 26?

RE: Parsing Odd Data

(OP)
It was a simply typo that I had missed and there may be a couple others but the concept should be clear enough even if my typing is not.

RE: Parsing Odd Data

Oh. I had not made that assumption.

So that series will actually be 2206 2213 2226 2233?

If so then a regex query will allow you to search accurately. And a like query will get you close.

But it looks overall better to restructure the table and do a one time expansion of the textual data into a properly structured format.

RE: Parsing Odd Data

(OP)
Yes, the series will actually be 2206 2213 2226 2233 and my apologies for the confusion. Unfortunately I cannot restructure or modify the existing table or its data so whatever is needed will either have to be done in the programming or by creating a secondary table as described and I have no idea how to begin such a task. If it can be done by a query or even a series of queries, then that's fine but it is beyond my skills to do so!

So far the data is limited to a few thousand test rows so whatever is done would need to be repeatable as it is being populated from the hard copy which itself is extremely time-consuming due to the quality of the originals, headings on each section and subsection, the fact that the pages' layout change slightly from year to year and a few other factors.

RE: Parsing Odd Data

i spent five minutes writing a script for you. one anomaly i cannot solve from your explanation is the significance of 2501 in this

CODE

L422445 2401 (BODIES 2462 2465 2492 2415 2498): 2501 

I suspect that the 2501 will be another modelNum. although I can't see a reason why it is delimited by a comma rather than semi-colon. And in your second explanation you remove all references to 2501.

RE: Parsing Odd Data

(OP)
The 2501 is another model and the clue us that it is separated from the previous model by a semicolon. So for model 2401, it applies only to certain bodies but for 2501 and the others, the part fits all bodies of the listed models.

RE: Parsing Odd Data

(OP)
Yes, my second reference has inadvertently omitted 2501 which should precede 2601 and the last one should be 2633. Sorry, I'm terribly dyslexic but I try!

I don't see anything delimited by a comma but I do see a colon rather than a semicolon which must be an OCR error that I missed.

RE: Parsing Odd Data

i see.

it is going to be difficult to cater for OCR errors as well. but try this and see what output you get

CODE

<?php

$query = <<<SQL
select  partNo, models
from    oldTable
SQL;

$createQuery = <<<SQL
CREATE TABLE IF NOT EXISTS p_m_b (
partNo varchar(20) NOT NULL,
modelNo varchar(8) NOT NULL,
bodyNo varchar (8) NULL
)
SQL;

$indexQuery = <<<SQL
CREATE UNIQUE INDEX IF NOT EXISTS pmb ON p_m_b (partNo, modelNo, bodyNo)
SQL;


mysql_query($createQuery) or die(mysql_error());
mysql_query($indexQuery) or die(mysql_error());

$result = mysql_query($query);
while($row = mysql_fetch_array($result)):
    $partNum = $row[0];
    $chunks = explode(';',$row[1]);  //grab the model data
    $chunks = array_map('trim', $chunks); //trim it
    foreach($chunks as $key=>$val):
        if(strpos($val, '(') === FALSE):
            //no body data
            $prefix = substr($val, 0, 2);
            $bits = explode('-', $val);
            foreach($bits as $_key=>$bit):
                if($_key == 0):
                    addItem($partNum, $bit, null);
                else:
                    addItem($partNum, $prefix . $bit, null);
                endif;
             endforeach;
        else:
            //check for colon
            $bits = explode(':', $val);
            foreach($bits as $bit):
                if(strpos($bit,'(') === false):
                    //no body data
                    $_bits = explode('-', $bi);
                    foreach($_bits as $_key=>$_bit):
                        if($_key == 0):
                            addItem($partNum, $bit, null);
                        else:
                            addItem($partNum, $prefix . $bit, null);
                        endif;
                    endforeach;
                else:
                    //has body data
                    $pattern = "|(\d.*?) \(BOD.*? (.*?)\)|";
                    preg_match_all($pattern, $bit, $matches);
                    foreach($matches[1] as $i=>$j):
                        $modelNum = trim($j);
                        $prefix = substr($matches[2][$i],0,2);
                        foreach(explode('-',$matches[2][$i]) as $p=>$bodyNum):
                            if($p === 0):
                                addItem($partNum, $modelNum, $bodyNum);
                            else:
                                addItem($partNum, $prefix . $modelNum, $bodyNum);
                            endif;
                        endforeach;
                    endforeach;
                endif;
            endforeach;
        endif;
    endforeach;
endwhile;


function addItem($partNum, $modelNum, $bodyNum){
    $insertQuery = <<<SQL
INSERT INTO p_m_b
(partNo, modelNo, bodyNo) 
VALUES
('%s,'%s','%s')
SQL;
    $query = sprintf($insertQuery, $partNum, $modelNum, $bodyNum);
    return mysql_query($query);
} 

RE: Parsing Odd Data

(OP)
Thank you. It's after 3:00 AM now so I'll try it tomorrow but please don't try to cater to OCR errors. Just presume they are not there as I'll run a query and search manually to try to locate things like that. Also, there was a section of un-proofed data and my examples my have come from that area.

RE: Parsing Odd Data

(OP)
I added this to get the connection, which seems to be compatible with your code. The table is being created bo far it is giving an error on the line trying to create the index although I've not yet looked into it to see why. It should be easy to sort out, though. I suspect it will also need a facility to drop the table and recreate it each time it's run but I'll look into that at some later date.

Connected successfully

You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near 'IF NOT EXISTS pmb ON p_m_b (partNo, modelNo, bodyNo)' at line 1



CODE --> PHP

include("config/config.php");

$conn = mysql_connect($dbhost, $dbusername, $dbpass);

if (!$conn) :
    die("Connection failed: " . mysql_connect_error());
endif;
echo "Connected successfully\n\n<p>";
	
$db_selected = mysql_select_db($dbname, $conn);
if (!$db_selected) :
    die ('Can\'t use $dbname : ' . mysql_error());
endif; 

RE: Parsing Odd Data

(OP)
No more colons instead of semicolons but I did come across another odd bit of data for RHD (Right Hand Drive) export cars that I'm not sure how to handle.

Quote:

RHD, 2401; 2501; 2601-11; 5400-01-11

Perhaps you have an idea although probably simply prepending it to the model listing might be okay:

Quote:

RHD 2401
RHD 2501
RHD 2601
RHD 2611
RHD 5400
RHD 5401
RHD 5411

RE: Parsing Odd Data

(OP)
I changed the table name and modified the code a bit so that it creates an auto-incrementing key rather than trying to add an index later. It is creating the table properly but it is not populating it so I added some echos to see how it parsed the data. Apparently the addItem() function is failing but I see nothing wrong unless it needs the connection passed into it.

The lines on-screen also help me to know when it has completed the operation. It's quite fast now but won't be once the data is all there.

With a couple exceptions (see below), the parsing looks good. Here is the code as it is now.

CODE --> PHP

include("config/config.php");

$conn = mysql_connect($dbhost, $dbusername, $dbpass);

if (!$conn) :
    die("Connection failed: " . mysql_connect_error());
endif;
	
$db_selected = mysql_select_db($dbname, $conn);
if (!$db_selected) :
    die ('Can\'t use $dbname : ' . mysql_error());
endif;
	
$query = <<<SQL
SELECT  PartNo, Models
FROM    parts_listing
SQL;

$dropQuery = <<<SQL
DROP TABLE IF EXISTS parsedmodels
SQL;

$createQuery = <<<SQL
CREATE TABLE IF NOT EXISTS parsedmodels (
ID INTEGER NOT NULL AUTO_INCREMENT PRIMARY KEY,
partNo varchar(20) NOT NULL,
modelNo varchar(8) NOT NULL,
bodyNo varchar (8) NULL
)
SQL;

mysql_query($dropQuery) or die(mysql_error());
mysql_query($createQuery) or die(mysql_error());

$result = mysql_query($query);
while($row = mysql_fetch_array($result)):
    $partNum = $row[0];
    $chunks = explode(';',$row[1]);  //grab the model data
    $chunks = array_map('trim', $chunks); //trim it
    foreach($chunks as $key=>$val):
        if(strpos($val, '(') === FALSE):
            //no body data
            $prefix = substr($val, 0, 2);
            $bits = explode('-', $val);
            foreach($bits as $_key=>$bit):
                if($_key == 0):
                    addItem($partNum, $bit, NULL);
		    echo "inserting $partNum, $bit, NULL<br>\n";
                else:
                    addItem($partNum, $prefix . $bit, NULL);
		    echo "inserting $partNum, $prefix$bit, NULL<br>\n";
                endif;
             endforeach;
        else:
            //check for colon
            $bits = explode(':', $val);
            foreach($bits as $bit):
                if(strpos($bit,'(') === FALSE):
                    //no body data
                    $_bits = explode('-', $bi);
                    foreach($_bits as $_key=>$_bit):
                        if($_key == 0):
                            addItem($partNum, $bit, NULL);
			    echo "inserting $partNum, $bit, NULL<br>\n";
                        else:
                            addItem($partNum, $prefix . $bit, NULL);
			    echo "inserting $partNum, $prefix$bit, NULL<br>\n";
                        endif;
                    endforeach;
                else:
                    //has body data
                    $pattern = "|(\d.*?) \(BOD.*? (.*?)\)|";
                    preg_match_all($pattern, $bit, $matches);
                    foreach($matches[1] as $i=>$j):
                        $modelNum = trim($j);
                        $prefix = substr($matches[2][$i],0,2);
                        foreach(explode('-',$matches[2][$i]) as $p=>$bodyNum):
                            if($p === 0):
                                addItem($partNum, $modelNum, $bodyNum);
				echo "inserting $partNum, $modelNum, $bodyNum<br>\n";
                            else:
                                addItem($partNum, $prefix . $modelNum, $bodyNum);
				echo "inserting $partNum, $prefix$modelNum, $bodyNum<br>\n";
                            endif;
                        endforeach;
                    endforeach;
                endif;
            endforeach;
        endif;
    endforeach;
endwhile;
echo "Done";


function addItem($partNum, $modelNum, $bodyNum){
    $insertQuery = <<<SQL
INSERT INTO parsedmodels
(partNo, modelNo, bodyNo) 
VALUES
('%s,'%s','%s')
SQL;
    $query = sprintf($insertQuery, $partNum, $modelNum, $bodyNum);
    return mysql_query($query);
} 

For the exceptions, here are some examples. I already mentioned the RHD issue and thought I had mentioned the ALL issue too. For RHD, you can see that it is inappropriately shoving over the model value into the body column when instead it should prepend RHD to the model. The ALL issue is more difficult and will require that I create another table manually to deal with it, which will be easier once this code is working. That way I can get the values from this data, or possibly the code here can be modified to go it on the fly somehow but I'm not sure how.

So for ALL 24TH, the models table would have something like (this is a partial only):

Quote:

ID     Model
1      2401
2      2402
3      2406
4      2413 

. . . and it would need to give them each their own line

Quote:

inserting 426493, ALL 24TH, NULL
inserting G107762, ALL 24TH, NULL
inserting 426743, RHD, 2401, NULL
inserting 426738, RHD, 2401, NULL
inserting G120854, RHD, 2401, NULL
inserting G120380, RHD, 2401, 2501, NULL

There may also be instances where the bodyNo column will be populated but not the modelNo.

RE: Parsing Odd Data

You need the index for searching. The error was that indices cannot take an if not exists declaration. So of you delete that you should be fine.

Likewise you should not need to drop the table as the queries will just error for those rows that are already in the table. But there is little harm in doing so provides you are not using the ID of the row as a foreign key elsewhere.

The RHD issue is easy to solve but what are the rules? Does RHD 2401 mean that it is the 2401 model with a RHD drive train?

I can't derive the ALL issue without seeing some raw data understanding its meaning. If it means all model nums then that is again easy to solve.

There are references to 24TH in your posts. Is than an OCR artefact and the TH is being substitute for a number?

RE: Parsing Odd Data

(OP)
I don't see how not dropping the table would not allow duplicates since we're using insert without any key. However, the ID key will not be needed for anything so dropping and recreating the table is probably safer, particularly as the data is being developed.

Yes, the RHD would include all items in the field after the RHD so preceding the modelNo with RHD should be fine unless you can think of a better way.

For the All issue, it would be for all models beginning with the series number. That is, All 24th refers to all models that begin with 24 and I can create a lookup table as described in my last posting that lists them but I can't create it until this other table has been populated. Once it has been, I can simply grab out the distinct models into a separate table but it will have the structure as I indicated above. If it helps, I can add a column with 24TH, 25TH etc.

Just some history to try to make more sense of the designations, 24TH is in the data and refers to 24TH series. Not that it matters, but it means 1951 models, and was a code used by the manufacturer. In other words, ALL 24TH means all 1951 models. 22ND was 1948 to mid 1949; 23RD was mid 1949 to 1950; then it was by model year after that although the designation was changed in 1954 to 54TH series, which continued to 1957. I'm not sure how it was done for 1957 and 1958 so it's possible that some tweeking will be needed once I get that far but that's still some time off.

RE: Parsing Odd Data

(OP)
I see that there is a section to check for the odd colon but they no longer exist in the data. Can this be removed or repurposed to check for the comma to prepend RHD to the partNo?

RE: Parsing Odd Data

(OP)
In the script output, I just noticed that some with RHD are parsing oddly. This is the raw model data as it is in the database:

RHD, 2201-02-11-22-32-40; 2301

and RHD applies to all models listed, not just those before the semicolon. Here is the script output:

Quote:

378882, RHD, 2201, NULL
378882, RH02, NULL
378882, RH11, NULL
378882, RH22, NULL
378882, RH32, NULL
378882, RH40, NULL
378882, 2301, NULL

Until the RHD issue is corrected, this script output should be:

Quote:

378882, RHD, 2201, NULL
378882, RHD, 2202, NULL
378882, RHD, 2211, NULL
378882, RHD, 2222, NULL
378882, RHD, 2232, NULL
378882, RHD, 2240, NULL
378882, RHD, 2301, NULL

Note the RH as the first part of some model numbers and the lack of RHD on the last entry.

To clarify some entries, this from the script's output means that All applies to all 24th, 25th, 26th and 54th series models. First the raw data, then the script output:

ALL 24TH; 25TH; 26TH; 54TH

Quote:

418316, ALL 24TH, NULL
418316, 25TH, NULL
418316, 26TH, NULL
418316, 54TH, NULL

RE: Parsing Odd Data

(OP)
There is now a table called parts_modelno that contains a listing of the models for use with the All entries. It's simply a single column gleaned from a query of the distinct parsed data that the script created. Here is a sample from 22ND series and if it would be helpful I can add a column with 22ND etc.

Quote:

modelNo
2201
2202
2206
2211
2213
2220
2222
2226
2232
2233
2240

RE: Parsing Odd Data

Someone else may be able to assist this evening - regrettably I'm out of action until tomorrow. Apologies

RE: Parsing Odd Data

(OP)
No problem as I'm leaving town this afternoon for a couple days.

RE: Parsing Odd Data

(OP)
As mentioned, I added echos so that I could see what the script had done but I was surprised to see that it writes it all upon completion. Is there a way to make it echo line by line as it does the work?

On the model entries, this may be a bit tricky but I came across an instance where ALL does not apply to every one. It applies to those with an ordinal entry but not the others

ALL 22ND; 23RD; 2413; 2513; 2613-33

This means ALL 22ND and ALL 23RD but only the specific models for 24TH, 25TH and 26TH.

RE: Parsing Odd Data

I don't think that the parsing is difficult _technically_, it is more the searching that you will have trouble with now. Having said that, however, imposing structure on essentially free form text is seldom straightforward from a business rules context. Particularly so when your last post indicates that whomever wrote the original dataset did not even adhere to their own convention.

on searching, for example if someone says that they have a model 2401 your search will have to be

CODE

select * from parts
where modelNo REGEXP '24(01|[^\d])' 

RE: Parsing Odd Data

(OP)
Thank you. Right now I'm more interested in getting the data cleanly parsed out, then I can use it on the search and it has other uses too. It's close but not quite there.

RE: Parsing Odd Data

I was partially prevaricating as the nature of how you were going to search must lead the nature of how you parse out the data.

Given the lack of structure it is difficult to see a neat way of constructing a parser.

And I am concerned that your statement re the meaning of RHD as a qualifier to all models before and after semicolons completely breaks the rules of the rest of the data structure.

Will RHD always appear ONLY ever as the first three characters of a model string? Ditto ALL? If so perhaps some logical rigour can be applied.

RE: Parsing Odd Data

(OP)
My search engine has been online for some time so it not relevant to this discussion due to the way it works. Once this data has been parsed out I can use the new table to get the matches needed.

RHD it would apply to everything in the rest of the cell so each model number would be preceded with RHD. Due to the very nature of RHD vehicles, there are no parts I know of specifically for some RHD models that could possibly fit other non-RHD models. As far as I can tell so far, RHD is always at the beginning of the field.

ALL is also always at the beginning of the field. However, ALL does not necessarily apply to everything in the cell. It applies only to the ordinal numbers, like 22ND, 23RD etc. This seems to be working like the rest except that ALL is there only on the first one. However, these ALL entries need to be broken apart for individual models from the parts_modelno lookup table.

RE: Parsing Odd Data

search
believe me - it is entirely relevant. how you want to search must completely lead how you structure parsed data.

RHD
so it is impossible to have, say

CODE

2401;RHD 2402-03-04-05 
i.e. RHD must always be first in the field and apply to all models in the field.

[b]ALL[/b}
i think I'd need to see a comprehensive set of examples of each use of the term ALL to derive a plan to parse sensibly for that.

RE: Parsing Odd Data

(OP)
I didn't really want to get into it but the search engine does not directly search the parts data since it is set up to search all entries and tables on the site. It searches a table into which all the data has been dumped en masse. It might be relevant but it's unrelated to my question and to what I am trying to accomplish by cleanly parsing out the data. Until the data has been parsed out, the search engine is not what I'm working on.

No, it is not possible to have an RHD as you described. It is always at the beginning, is always separated from the rest by a comma and will always apply to all entries:

This will not happen:

CODE

2401;RHD 2402-03-04-05 

This is the way it always will be:

CODE

RHD, 2401-02-03-04-05 

There cannot possibly RHD parts that also fit LHD cars so it would apply to all the models listed.

Here are some examples of ALL:

CODE

ALL 54TH
ALL 22ND; ALL 23RD
ALL 24TH; 25th; 26th; 54th
ALL 22ND; 23RD; 24TH; 25TH; 26TH; 5402-06-13-26-31 

There are several like ALL 22ND; ALL 23RD but if it helps I can change them to ALL 22ND; 23RD

There may also be a combination:

CODE

RHD, ALL 22ND; 23RD 

RE: Parsing Odd Data

What is the meaning of RHD ALL 23RD?

Does it mean that all models starting with a 23 are RHD?

RE: Parsing Odd Data

(OP)

Quote:

What is the meaning of RHD ALL 23RD?

I didn't give an example like that but if you mean RHD, ALL 22ND; 23RD, it means ALL 22ND and ALL 23RD series RHD cars. To be honest, I don't know if it means that every 23RD series car had an RHD equivalent or if it means that RHD applies to ALL 23RD where they were RHD but for the sake of programming, we'll have to presume the former.

Quote:

Does it mean that all models starting with a 23 are RHD?

Well, not exactly but, yes, when related to the particular partNo it does. In other words, not all 23RD cars are RHD (or course) but for the particular part number, that is what it means. The part is for ALL 23RD series right-hand drive cars.

(I would be curious if anyone else watching this forum can guess what made of car we're talking about!)

RE: Parsing Odd Data

(OP)
Are there any more ideas to finalize this? It's almost there but not quite.

RE: Parsing Odd Data

jpadie requests you contact him privately to continue this, due to some unfortunate events.

----------------------------------
Phil AKA Vacunita
----------------------------------
Ignorance is not necessarily Bliss, case in point:
Unknown has caused an Unknown Error on Unknown and must be shutdown to prevent damage to Unknown.

Web & Tech

RE: Parsing Odd Data

(OP)
Yes, I know and saw his message here before someone deleted it.

RE: Parsing Odd Data

o.k, perfect.

----------------------------------
Phil AKA Vacunita
----------------------------------
Ignorance is not necessarily Bliss, case in point:
Unknown has caused an Unknown Error on Unknown and must be shutdown to prevent damage to Unknown.

Web & Tech

Red Flag This Post

Please let us know here why this post is inappropriate. Reasons such as off-topic, duplicates, flames, illegal, vulgar, or students posting their homework.

Red Flag Submitted

Thank you for helping keep Tek-Tips Forums free from inappropriate posts.
The Tek-Tips staff will check this out and take appropriate action.

Reply To This Thread

Posting in the Tek-Tips forums is a member-only feature.

Click Here to join Tek-Tips and talk with other members! Already a Member? Login

Close Box

Join Tek-Tips® Today!

Join your peers on the Internet's largest technical computer professional community.
It's easy to join and it's free.

Here's Why Members Love Tek-Tips Forums:

Register now while it's still free!

Already a member? Close this window and log in.

Join Us             Close