INTELLIGENT WORK FORUMS
FOR COMPUTER PROFESSIONALS

Log In

Come Join Us!

Are you a
Computer / IT professional?
Join Tek-Tips Forums!
  • Talk With Other Members
  • Be Notified Of Responses
    To Your Posts
  • Keyword Search
  • One-Click Access To Your
    Favorite Forums
  • Automated Signatures
    On Your Posts
  • Best Of All, It's Free!

*Tek-Tips's functionality depends on members receiving e-mail. By joining you are opting in to receive e-mail.

Posting Guidelines

Promoting, selling, recruiting, coursework and thesis posting is forbidden.

Jobs

SQL statement for approximate string matching

SQL statement for approximate string matching

(OP)
Hello,

I'm trying to compose an SQL-query which does approximate string matching using n-grams.
This is the problem that I'm trying to solve: we have a system on which you can register your company and users within that company. However, before a new company is registered, we would like to do a check to see whether that company already exists in the database. In this case there is a good possibility that the user didn't type the company name exactly as in the existing record in the database.

The idea that I'm trying to implement is this: from the input string (the new company name) we generate a number of n-grams. Let us let n be 3. Then the 3-grams of the string "approximate" for example are "app", "ppr", "pro", "rox", "oxi", "xim", "ima", "mat", and "ate". I would like to do a select-statement which returns (or counts) all the company names in the COMPANY table that contain a certain number of these 3-grams.
More generally, we are looking for a SQL-statement that selects all rows that, given n conditions, satisfy at least i of these n conditions.
How would we do this in a SQL select-expression?

   ---------------------------------------------------

The idea that I got is to use a SQL-query of the following form:

CODE

SELECT ... FROM ... WHERE (CASE WHEN #CONDITION1# THEN 1 ELSE 0) + (CASE WHEN #CONDITION2# THEN 1 ELSE 0) + ... + (CASE WHEN #CONDITIONn# THEN 1 ELSE 0)>=i
.
For this specific scenario the SQL-query would look something like

CODE

SELECT COMPANYFULLNAME FROM COMPANY WHERE (CASE WHEN LOWER(COMPANYFULLNAME) LIKE '%mic%' THEN 1 ELSE 0) + (CASE WHEN LOWER(COMPANYFULLNAME) LIKE '%icr%' THEN 1 ELSE 0) + (CASE WHEN LOWER(COMPANYFULLNAME) LIKE '%cro%' THEN 1 ELSE 0)>=3
.
The problem is that DB2 finds syntax errors in this SQL. It says

CODE

SQL0104N  An unexpected token ")" was found following "%mic%' THEN 1 ELSE 0".  Expected tokens may include:  "END
It seems as though DB2 doesn't like the parentheses around the CASEs. However, when I remove these parentheses, I get a different syntax error:

CODE

SQL0104N  An unexpected token ">=" was found following "%cro%' THEN 1 ELSE 0".  Expected tokens may include:  "END
Is there a way to write this SQL-statement so that DB2 accepts it?
Or should I take a different approach on this.

I'll appreciate any help.
Thanks in advance.
 

RE: SQL statement for approximate string matching



hi,

The WHERE clause is looking for an expression, not a value.

For instance in the WHERE, you might use a CASE decode to return a field name or a field value, but eventually you need...

CODE

WHERE SomeField SomeOperator SomeValue
 

Skip,

glassesJust traded in my old subtlety...
for a NUANCE!tongue

RE: SQL statement for approximate string matching

(OP)
I believe the form of an expression in a WHERE clause is a bit more general, such as

CODE

WHERE #expression1# #operator# #expression2#

But man, I just figured out what I did wrong in that SQL-query, and indeed, it was just a simple syntax error.
The syntax of the CASE construct is of the form "CASE WHEN #condition# THEN #expression1# ELSE #expression2# END". I clean forgot the "END" at the end. When the syntax error told me that an END token was expected, I thought that it meant the end of the query.
The correct form of the type of SQL-query that I'm implementing is

CODE

SELECT ... FROM ... WHERE CASE WHEN #CONDITION1# THEN 1 ELSE 0 END + CASE WHEN #CONDITION2# THEN 1 ELSE 0 END + ... + CASE WHEN #CONDITIONn# THEN 1 ELSE 0 END>=i
And the example SQL-query instantiation is

CODE

SELECT COMPANYFULLNAME FROM COMPANY WHERE CASE WHEN LOWER(COMPANYFULLNAME) LIKE '%mic%' THEN 1 ELSE 0 END + CASE WHEN LOWER(COMPANYFULLNAME) LIKE '%icr%' THEN 1 ELSE 0 END + CASE WHEN LOWER(COMPANYFULLNAME) LIKE '%cro%' THEN 1 ELSE 0 END>=2
And yes, this query works perfectly!
Thanks for the help!
 

RE: SQL statement for approximate string matching

We can simplyfy the same query as below:
SELECT
 (CASE WHEN LOWER(COMPANYFULLNAME) LIKE '%mic%' THEN 1
       WHEN LOWER(COMPANYFULLNAME) LIKE '%icr%' THEN 1     WHEN LOWER(COMPANYFULLNAME) LIKE '%cro%' THEN 1 ELSE 0 END)
FROM COMPANY;

RE: SQL statement for approximate string matching




Yes you were correct.  Now I see what you are attempting.

Skip,

glassesJust traded in my old subtlety...
for a NUANCE!tongue

RE: SQL statement for approximate string matching

bghh,
Can you use the SOUNDEX function?

Marc

RE: SQL statement for approximate string matching

(OP)
Well Marc,
the SOUNDEX function (or DIFFERENCE function, which is based on SOUNDEX) works quite well in some cases. One specific case where it doesn't work well is when the two strings that are compared are similar, but they don't start with the same letters; for instance "Apple Co" and "The Apple Co" is a poor match according to DIFFERENCE.
However a pair of strings like "Microblah Pty Ltd" and "Mikrobla" is a good match according to DIFFERENCE.
The solution above with the CASEs works well, but if performance becomes an issue, I'll consider using DIFFERENCE.
 

Red Flag This Post

Please let us know here why this post is inappropriate. Reasons such as off-topic, duplicates, flames, illegal, vulgar, or students posting their homework.

Red Flag Submitted

Thank you for helping keep Tek-Tips Forums free from inappropriate posts.
The Tek-Tips staff will check this out and take appropriate action.

Reply To This Thread

Posting in the Tek-Tips forums is a member-only feature.

Click Here to join Tek-Tips and talk with other members!

Resources

Close Box

Join Tek-Tips® Today!

Join your peers on the Internet's largest technical computer professional community.
It's easy to join and it's free.

Here's Why Members Love Tek-Tips Forums:

Register now while it's still free!

Already a member? Close this window and log in.

Join Us             Close