Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations Chriss Miller on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Regex help for sql parsing

Status
Not open for further replies.

jpadie

Technical User
Joined
Nov 24, 2003
Messages
10,094
Location
FR
Hi

i'm looking for some pointers on pattern matching the following sql construct

Code:
select date_format({absolutely anything could go here, including commas}, '%Y') from table

i need to be able to retrieve
date_format(...,[format])
... and
[format]
as backreferences

the patterns i have been working with look like

Code:
(date_format *\((.*?),(.*?)\));

but these fall down where the ... contains, for example, a function inside. example;

Code:
select date_format(date_add('2007-01-01', interval '3' days), '%y-%m-%d') from table

any pointers would be very welcome.

thanks
Justin
 
Take a look at this thread I responded to back in May:


The tools demonstrated there, namely recursive regular expressions, will be able to do the job that you want just fine. I suggest that you start by creating a list of examples that you would like to match. And then slowly expand the functionality until all succeed. If you need any help getting it working right, feel free to ask. Just remember to post the code that you have thus far, including the examples.

- Miller
 
a simple approach:

Code:
my $foo = q{select date_format(date_add('2007-01-01', interval '3' days), '%y-%m-%d') from table};
$foo =~ /date_format\s*\((.*),\s*'.*'\)\s*from table/;
print $1;

------------------------------------------
- Kevin, perl coder unexceptional! [wiggle]
 
thank you both for replying.
@MillerH
I think I see how your recursion works, although from my limited knowledge of perl it appears that the recursion is programmatic whereas I was assuming I would need some form of lookahead or lookaround within the pattern. Unfortunately I am not actually using perl as the programming language (PHP), I was simply asking the regex question here as other conversations with TT users led me to believe that you guys are the most advanced on regex topics.
Apologies for not being clear on this. as I said, I thought the solution could be pure regex (but on reading further I see that only the .Net library has a balancing engine which, I think, would have been needed to balance brackets etc).

@KevinADC
your method would not work for me as it stands because I cannot guarantee that the sql function is the last item in a select phrase. a typical query could, for example, look like
Code:
select date_format(date_add(dateField, INTERVAL '3' days), '%Y-%m-%d) as fDate, lastname, firstname from sometable order by dateField ASC LIMIT 3, 4;

I'm sorry that I did not make this clear.

Also, I don't think it would work if there were a comma in the first parameter of the function, e.g. the second example I posted above. I may have misunderstood though.

By the way, this exercise is part of a sql rewriting engine I am building to allow Wordpress to run on different database backends, aiming primarily at SQLite

But, as it happens, a good night's sleep and your posts led me to reanalyse the logical approach and I've realised that there is one constant in the pattern, and that is that the end of the date_format clause must always be " comma quote formatstring quote close-bracket " with varying degrees of interspersed white space. this pattern is unlikely (within the class of application I am looking at) to be repeated elsewhere in the sql clause so I have focused on that to provide a (currently) working solution.

for posterity (and peer review), the pattern I am using is as follows (for PHP preg_replace use). I have split the pattern out completely (and inefficiently) to aid interpretation and review. in practice, i would compress to a single line and avoid the x modifier.

Code:
$pattern  = 	'/date_format';	//match the function call
$pattern .=		'\s*';	//allow white space after the function call
$pattern .=		'\(';	//match the opening of the function bracket
$pattern .=		'(?P<param1>.*?)';	//match everything in the first parameter of the function
$pattern .=		',\s*';	//match the comma that starts the second parameter of the function with an arbitrary amount of whitespace after
$pattern .=		'(?P<formatString>';		//start a backreference
$pattern .=		'(';		//start a second backreference to match quotes
$pattern .=		'[\'"]';	//match the quotes that start the format string
$pattern .=		')';		//end the second backreference
$pattern .=		'\s*';		//more arbitrary whitespace (but there should not be any)
$pattern .= 	'%';		//match the format control character
$pattern .=		'.*?';		//match any characters following it
$pattern .=		'[^\\\\]\\3';		//match the same quote as started the format string (might need to allow for escaped backslashes)
$pattern .=		')';		//close the primary backreference (captures the whole format string
$pattern .=		'\s*\)';	//match an arbitrary amount of white space followed by a closing bracket
$pattern .=		'/imx';		//pattern modifiers to allow case insensitivity, multiline and arbitrary white space in the pattern.
 
Concerning, your comment about the programmatic nature of my regex example. Yes, it is. But all regexes are ultimately little subroutines. My example simply takes advantage of a lot of the more advanced features provided by perl 5.6+. Obviously, I wouldn't have suggested them if I had known you were using php though.

Adjusting your regex to perl read-ability, although not compile-ability:

Code:
    [red]m{[/red][purple][/purple]
[purple]        date_format [purple][b]\s[/b][/purple]* [purple][b]\([/b][/purple] [purple][b]\s[/b][/purple]*   # match the function call[/purple]
[purple]            (?P<param1>.*?)      # First parameter of the function[/purple]
[purple]            [purple][b]\s[/b][/purple]* , [purple][b]\s[/b][/purple]*            # Begin Second Parameter[/purple]
[purple]            (?P<formatString>    # start a backreference[/purple]
[purple]                ( [[purple][b]\'[/b][/purple]"] )        # Quote for start of formatString[/purple]
[purple]                    [purple][b]\s[/b][/purple]*          # Arbitrary whitespace (there should not be any)[/purple]
[purple]                    % .*?        # match the format control character[/purple]
[purple]                [^[purple][b]\\[/b][/purple]][purple][b]\3[/b][/purple]          # End of Quote (might need to allow for escaped backslashes)[/purple]
[purple]            )                    # End formatString[/purple]
[purple]        [purple][b]\s[/b][/purple]* [purple][b]\)[/b][/purple]                   # End Function[/purple]
[purple]    [/purple][red]}[/red][red]imx[/red]

Overall, it looks like your regex will work out alright for most cases.

However, there are almost always exceptions to any regex. The question is, are those exceptions acceptable?

It would of course help to know in more detail what your ultimate goal is. It does not sounds like you are aiming for a one time translation or parsing of these queries, but instead for continued usage of this functionality. This seems strange to me.

If it were just a one time use, I would advise you to just use unix grep of the most simplest form:

Code:
grep -ir 'date_format' *

Anyway, if you would like to see an attempt at partial balanced parsing using a single regex instead of nested:

Code:
[olive][b]while[/b][/olive] [red]([/red]<DATA>[red])[/red] [red]{[/red]
	[url=http://perldoc.perl.org/functions/chomp.html][black][b]chomp[/b][/black][/url][red];[/red]
	
	[url=http://perldoc.perl.org/functions/print.html][black][b]print[/b][/black][/url] [red]"[/red][purple][blue]$_[/blue][purple][b]\n[/b][/purple][/purple][red]"[/red][red];[/red]
	
	[olive][b]if[/b][/olive] [red]([/red][red]m{[/red][purple][/purple]
[purple]		date_format [purple][b]\s[/b][/purple]* [purple][b]\([/b][/purple] [purple][b]\s[/b][/purple]*[/purple]
[purple]			(?:					# First Parameter[/purple]
[purple]				(?> [^[purple][b]\,[/b][/purple][purple][b]\)[/b][/purple][purple][b]\'[/b][/purple][purple][b]\"[/b][/purple][purple][b]\([/b][/purple]]* )				# Any Character, Except...[/purple]
[purple]				|[/purple]
[purple]				(?: [purple][b]\'[/b][/purple] (?: (?>[^[purple][b]\'[/b][/purple][purple][b]\\[/b][/purple]]+) | [purple][b]\\[/b][/purple] . )* [purple][b]\'[/b][/purple] )	# Single Quote (Escaped)[/purple]
[purple]				|[/purple]
[purple]				(?: [purple][b]\"[/b][/purple] (?: (?>[^[purple][b]\"[/b][/purple][purple][b]\\[/b][/purple]]+) | [purple][b]\\[/b][/purple] . )* [purple][b]\"[/b][/purple] )	# Double Quote (Escaped)[/purple]
[purple]				|[/purple]
[purple]				[purple][b]\([/b][/purple] .*? [purple][b]\)[/b][/purple]						# Parenthesis (not balanced, nor string escaped)[/purple]
[purple]			)*					# End First Parameter[/purple]
[purple]		(?> [purple][b]\s[/b][/purple]* , [purple][b]\s[/b][/purple]* ) # Parameter Separation (no backtracking)[/purple]
[purple]			(					# Second Parameter (Quoted String)[/purple]
[purple]				[purple][b]\'[/b][/purple] (?: (?>[^[purple][b]\'[/b][/purple][purple][b]\\[/b][/purple]]+) | [purple][b]\\[/b][/purple] . )* [purple][b]\'[/b][/purple]	# Single Quote (Escaped)[/purple]
[purple]				|[/purple]
[purple]				[purple][b]\"[/b][/purple] (?: (?>[^[purple][b]\"[/b][/purple][purple][b]\\[/b][/purple]]+) | [purple][b]\\[/b][/purple] . )* [purple][b]\"[/b][/purple]	# Double Quote (Escaped)[/purple]
[purple]			)					# End Second Parameter[/purple]
[purple]		[purple][b]\s[/b][/purple]* [purple][b]\)[/b][/purple][/purple]
[purple]	[/purple][red]}[/red][red]ix[/red][red])[/red] [red]{[/red]
		[black][b]print[/b][/black] [red]"[/red][purple][purple][b]\t[/b][/purple][blue]$.[/blue] - [blue]$1[/blue][purple][b]\n[/b][/purple][purple][b]\n[/b][/purple][/purple][red]"[/red][red];[/red]
	[red]}[/red]
[red]}[/red]

[teal]__DATA__[/teal]
[teal]select date_format(dateField, '%Y') from table[/teal]
[teal]select DATE_FORMAT(NOW(), '%Y') from table[/teal]
[teal]select date_format('never, but maybe , "%MOOOO")', "%Y") from table[/teal]
[teal]select date_format(date_add('2007-01-01', interval '3' days), '%y-%m-%d') from table[/teal]
[teal]select date_format(date_add(dateField, INTERVAL '3' days), '%Y-%m-%d') as fDate, lastname, firstname from sometable order by dateField ASC LIMIT 3, 4;[/teal]

As, you can see, gah is that long! However, it covers all of those three cases literally. It follows the rules of functions instead of any specialized knowledge about SQL or what your format string will be. With compiled expressions I linked you to, that can be a lot simpler.

Ultimately though, I'm hoping that you'll find the grep code is what you want. Real time parsing of this this nature is a mess, and quite tricky to get right.

Good luck,
- Miller
 
Thanks Miller,

i'll digest your balancing code after some coffee. Somehow reading regex on the page is far more difficult than writing them yourself!!

you are right in that i am looking for a real time parsing solution: continued usage. I accept that I am going to have to be happy with a 90% solution.


The ultimate goal is to port Wordpress to use SQLite and Mysql out of the box (and other dbs) by enhancing the already existing abstraction layer. Others have kind of succeeded in the port by manually changing all of the sql in the application: but this is a second-rate solution as the fork must be maintained against every release of Wordpress (and, as some measure of proof, the most popular fork is still buggy as hell and has not been maintained since 2006)

The problem is that the author(s) of Wordpress have used a lot of the non-SQL functions of mysql to make their coding lives easier: nothing inherently wrong with this within their design constraints. there are also a variety of different styles employed (less good), so I cannot be confident that a specific pattern will be matched for all cases.

so what I am trying to do is build a set of regex that will parse those queries that I know to be incompatible and that I know are currently used in WordPress. these two criteria give me a limited set of problems - all to the good. At the same time I am moving the variable handling and enquoting of SQL queries to a sub-routine so that Wordpress will become significantly more secure against sql injection attacks (and use prepared statements and placeholders).

So as not to have to maintain an ongoing diff against the core Wordpress code, I am building the abstraction layer as a quasi-plugin in a single file. A design constraint is that all changes to the application behaviour must be stored in this single file. If I build it well enough I should be lucky at least some of the time that updates to the wordpress app won't bust the abstraction layer!

outside of plugins - the good news is that there is definite number of sql calls within the application and after some bedding in I will be confident that I've caught them all and thus can publish the abstraction layer for others to test. The killer at the moment is that Wordpress has a complex diffing algorithm for handling database schema changes on upgrades. I can't see a way to fix this neatly so am just bypassing it for the time being. Re-injecting the plugins - my hope is that most of these simply perform select queries on the dataset and so will play nicely with the abstraction layer. If they don't then I guess people will have to fix the regex themselves or tell me about the problem so that I can fix it. it will be an iterative (but hopefully not recursive...) process.

thanks for your time again.
Justin
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top