×
INTELLIGENT WORK FORUMS
FOR COMPUTER PROFESSIONALS

Log In

Come Join Us!

Are you a
Computer / IT professional?
Join Tek-Tips Forums!
  • Talk With Other Members
  • Be Notified Of Responses
    To Your Posts
  • Keyword Search
  • One-Click Access To Your
    Favorite Forums
  • Automated Signatures
    On Your Posts
  • Best Of All, It's Free!

*Tek-Tips's functionality depends on members receiving e-mail. By joining you are opting in to receive e-mail.

Posting Guidelines

Promoting, selling, recruiting, coursework and thesis posting is forbidden.

Students Click Here

a question regarding data integrity
7

a question regarding data integrity

a question regarding data integrity

(OP)
I'm sure what I'm about to pose here is far from a-typical.  What I am looking for is how this situation is commonly addressed in a professional db environment.

The issue:

You have a table with CompanyID(PK) and CompanyName.
One user enters ABC Co Inc. for the company name.  A couple of days later, another user enters in Associated Business Contractors, and on another day, a different user enters Smith Enterprises, dba Assoc. Biz Inc.

All of these entries refer to the same company, but since there are thousands of Company records, no one users seems to notice that it has been already entered under a different name.  Consequenty, database integrity gets marred because then different users update related records for different ABC companies.

What is a good strategy to deal with this?

RE: a question regarding data integrity

Restrict that type of input to a position or department which takes data integrity seriously.  

If that's not possible, hopefully your applications provide a bit of intelligence to help the user select an existing record rather than creating a new one.

If that's not the case, perform routine data cleansing operations.

RE: a question regarding data integrity

(OP)
i did write a routine that does a basic check, but it's really not enough.

Quote:

Restrict that type of input to a position or department which takes data integrity seriously.

yeah, i wish!

Thing is that either we give these people some authority, and they mess up the integrity, or we restrict it, but then I or my ops manager ends up wasting time doing simple entry every day.

By data cleansing, I suppose you mean something like doing a data dump on the table every so often and looking for duplicates and update the database accordingly, right?  
 

RE: a question regarding data integrity

For data cleansing, you could write a routine or create a report which someone looks at to find duplicates, and do updates on your database from that.  

I've also seen systems (and written similar logic myself) which provide "merge" capabilities so that an aware user can pick two entities and merge them into one, including all of the transactional data as well.  There are several ways to do that, but the best way includes leaving a historical trail so that you are aware of what has been merged and when.  These may require multiple database operations, so it would work best in an ACID compliant DBMS.    

RE: a question regarding data integrity

(OP)
I had thought of designing screens to allo people to the merge idea, leaving the fix to the the "aware" user, with the historical trial as you describe.   Of course that could probably get a little complex, but i can see how that would be doable.

So you sound like you're basically confirming my hunch that for this kind of thing, there is really no way to automate fully - it's more of a quesiton of management of the amount human intervention required - is that about right?

RE: a question regarding data integrity

Yes that's exactly right.  

However, one thing I forgot to mention is that there is software and services out there for purchase and fees that may be able to help you find duplicates, but the cost may outweigh the benefits.  The only time I've used such a service was for address cleansing, and if your records contain the addresses of the businesses, then it may be worth a look.  Whether or not there is anything out there that would be able to map ABC to Associated Business Contractors I have no idea.  Someone else may have more insight into the algorithms these systems use, but regardless, I would be hesitant to automate updates based on these types of software.

RE: a question regarding data integrity

Would it be possible to include a strictly regulated abbreviation field that would take the first letter of each element of the company name? It might then be possible to show the user other entries with the same abbreviation. Soundex might also be helpful.

RE: a question regarding data integrity

Do your users enter addresses for the new companies? If so you might do a check on the address with existing companies and ask them if they meant (ABC company) when they type American Business Company. Then have a process that either purges the new record or dedups them depending on what other work happens.

You could also set up a report for management of all companies that are located in the same place (or have the same name)and ask them to identify and get rid of duplicates.

You should set up a formal dedup process. In ours we select the profile we want to keep and it adds any records that don't match in tables that are a one to many relationship (all the orders for both profiles would be kept for instance, but the id would be changed to the one you were keeping) and in other tables asks the user to choose the correct information by comparing them side by side (Such as is it ABC or American Business Company). Depending on the data associations this may be something that has to be done one at time with manual decision making or a group of records can be identified and the dedupping happen automatically based on a defined set of business rules.  

"NOTHING is more important in a database than integrity." ESquared
 

RE: a question regarding data integrity

(OP)
great discussion.  I'm grateful for your interest.  Soundex is interesting.

we do track addresses, but they are not required, but even if we did, typos and other mistakes could still derail this, don't you think?   However on the other hand, checks like this could at least close the gap and make it incrementally harder to enter potential dups.

Also seems like writing a little mini-app that does all this stuff could be quite time consuming, which I guess leads me to the this rhetorical question that only I and the firm can answer: - how painful of a problem is this to management?  Where does working on an app(let) that makes data entry considerably more air-tight fall with the myriad of other priorities they have for their data management system?

RE: a question regarding data integrity

Quote:

Do your users enter addresses for the new companies? If so you might do a check on the address with existing companies and ask them if they meant (ABC company) when they type American Business Company. Then have a process that either purges the new record or dedups them depending on what other work happens.
Taking this even a step further, you could enter the duplicate, rejected company name into a stop-list to prevent people from repetitively entering duplicate records.  The stop-list should also link to the appropriate company record.

RE: a question regarding data integrity

3
I feel a bit queasy here, maybe I'm the one who's barking mad, but isn't this all twaddle?

<rant on>
Allowing duplicates and then cleaning them, whatever next?

I suspect that this is a fundamental cock-up in business processes.  Each company can exist only once (that's a legal obligation in the U.K., and I suspect also in the U.S.) so why are you even contemplating duplicates?

Users must only be allowed to select from a list of already-known companies, or be allowed to request the addition of a new potential business partner.  Each company must have a unique identifier (again, it's a legal obligation) so why don't you get the users to do enough work to know with whom they are dealing?  Surely striking a multi-billion dollar deal with the Acme Widget company is a disaster, when actually you dealt with the Acme Wodget company.

If users don't know who they're dealing with, why oh why are they being allowed to add reference data to a database?

Are your punters so dappy that they can't be bothered to contact Companies House (or the U.S. equivalent thereof) and download a definitive list of all registered companies.  After that, users may only select from legitimate companies, not post code (zip code) plonkers!

<rant off>

<hastily dons asbestos-lined trousers>

 

Regards

T

RE: a question regarding data integrity

(OP)
well you do have a point and it is well taken despite the rant.  but let me ask you a simple question.  Do you want to be the guy that all the account people call every day to add new accounts to the system? Is your time best spent doing rote data entry - or should there be some way to automate that, which is the basis of the attempt at automation.   

actually, i think the single point of entry is the way we're going to deal with it.  Someone (not me) is going to just have to suck it up and do it.
 

RE: a question regarding data integrity

unfortunately, somebody always has to...if not this, then something else...

Leslie

In an open world there's no need for windows and gates
www.ubuntu.com

RE: a question regarding data integrity

Sadly, I fear you're correct.

However, I do think there might be some mileage in downloading a master list from Companies House.  Users could then select, and where the "company doesn't exist" manual intervention would be necessary.  At least that way you would only be fettling the system when essential, not every time someone makes a new business contact.

Also, by assigning a unique company identifier within your system, you immediately simplify the merging problem later on, as you can just transfer the "meat" of the contacts, and ignore the already-known demographic stuff, such as address.

Regards

T

RE: a question regarding data integrity

Star for Tharg - good to see a commonsense approach. I have always found it worrying when a company allows IT guys to make decisions that fall outside their areas of knowledge and make work for others which IT should have been preventing.

___________________________________________________________
If you want the best response to a question, please check out FAQ222-2244: How to get the best answers first.
'If we're supposed to work in Hex, why have we only got A fingers?'
Drive a Steam Roller
Steam Engine Prints

RE: a question regarding data integrity

I would think the account people could enter the data as they know it and it be immediately available until verification and validation is performed and the record is finalized?

< M!ke >
Where are we going and why am I in this handbasket?

RE: a question regarding data integrity

(OP)

Quote:


company allows IT guys to make decisions that fall outside their areas of knowledge

this was not an IT decision, and it is hardly something that falls outside of our knowledge.  It was a mangement decision, which IT implemented.  I will grant you that this was an IT oversight, and IT should have recognized that this was not a good idea an advised management accordingly, since it is not management's responsibility to be able to forsee such issues.

 

RE: a question regarding data integrity

Quote:

Users must only be allowed to select from a list of already-known companies, or be allowed to request the addition of a new potential business partner.
This is so naive that I would subtract stars if I could.

First of all, your "list of already-known companies" is always outdated.  Always.

You (and your applauders) have obviously never dealt with sales or marketing.  Do you think you sell to IBM, period?  That's the company, isn't it?

The truth is that this is much more complicated than you represent.  Very often a "customer" is an independently operating part of a larger business:  they order things on their own and they pay invoices on their own.  If those invoices were directed to "Official Co. HQ, Inc." HQ wouldn't know what to do with them.

This type of data deconfliction takes human intelligence to be able to tell whether the business unit "Official Co., Western Region" is the same as "Official Co., Denver, CO."  What's your guess?

Getting back to sales, you must enable your staff to take orders, period, without the need to guess at a selection of an existing business or (horrors!) wait for approval to enter a business.  

M!ke above offers a reasonable method of dealing with "sloppy" data entry.

RE: a question regarding data integrity

(OP)
wow i like the debate here.  

the message I'm taking away from all this is there seems to be no simple or de-facto way to get around the human intervtion aspect, but if you try, you're probably into a lot of creative engineering which may or may not be worth the trouble, when compared with just having a single point of entry.  

harebrain hit it on the head, I support all marketing and account management types, and information changes frequently and rapidly, causing a lot of outdated information which is hard to stay on top of no matter how you slice it.



 

RE: a question regarding data integrity

Also I know of no US equivalent list that lists all of the millions of companies in this country. Further to put all of them on a drop-down list would be impractical.

"NOTHING is more important in a database than integrity." ESquared
 

RE: a question regarding data integrity

Yeah, at best you would have to aggregate all 50 states lists of tax ID (which I'm not sure are public info anyway.) Not to mention, I believe it's legal in some circumstances to be a "business" without a tax ID.  Then you have 'ABC, Inc.' doing business as 'XYZ Assoc.', 'ABC, Inc' dba 'JKL, Services', etc.  All different entities.

As harebrain mentioned, you have different offices of 1 company that may be separate billing entities.

And just for more fun as an example, go to Google and search for "Northstar Consulting".  Many different companies with the same name.

There's no real way to automate this.

_____
Jeff
It's never too early to begin preparing for International Talk Like a Pirate Day
"The software I buy sucks,  The software I write sucks.  It's time to give up and have a beer..." - Me

RE: a question regarding data integrity

Folks,

my rant was about the fundamental idiocy of deliberately allowing dross into a db, only to have the anti-dross workers come around later to clean it up - the thread's title mentions that dreaded word "Integrity" and sql sister's signature implies that she too thinks it's kind of important every now and then.

So, to take issue with the esteemed protagonists involved in this 'ere debate, I proceed in order of response:-

Harebrain, you are quite correct, in that so many companies have different offices around the country, etc.  This only goes to demonstrate the well known phenomenon of "hidden complexity".  As your post elegantly points out, there's more to it than meets the eye.  My response is "so bleeding what?".  Dealing with hidden complexity in business processes is why software developers get paid.  If it was easy, we'd all be doing it.  If the system designers can't handle the complexity, get yourself some different designers.

Boiling down all the complex business interactions involved, and producing a clear cut requirements specification would be no trivial exercise.  Nonetheless, if the business wants excellence, it will just have to cough up for it.  A master list might not be feasible in the end, but a free-for-all is far worse.  If human intervention is unavoidable, then it should be carefully controlled and audited.

As regards customer recognition, isn't it bad to ring up a customer and discuss something, only to have them respond "but I talked to your colleague xyz only two days ago, don't you guys talk to each other?".  Resolving differences is (IMHO) really important to have joined-up interaction with customers.

Second, the illustrious SQL Sister.  I must say that I am amazed that companies can trade with no legal obligation to have an address of incorporation.  So in the U.S. of A anyone can set up a company, start trading, open business bank accounts, not bother to pay taxes etc., all with no governance - amazing.

Second, the notion of a drop down list with a million vaues is risible.  Any developer worth their salt would craft a form offering appropriate filtering, selection criteria, recent favourites etc.  I don't believe that's a genuine issue.

As for the issue of billing addresses etc., any general purchase ledger software that can't cope with a business incorportated as address A, having goods delivered to its premises at addressed b,c,d, and with a billing address at e is dross, and should be replaced immediately.  This is old hat for GPL software, so why should it be any different for other suites attempting (and apparently failing) to offer similar functions.

Regards

T

RE: a question regarding data integrity

Tharg, in the US a "corporation" is simply one of many ways to setup a business.  Simple companies can be setup where there is no separation between your personal assets and company assets. You can have many different legal entities using the same business address or one corporation using many billing addresses.  You can have one entity "doing business as" another entity with legal separation between them.  In general, it's a free for all.

To open a business bank account you would need a tax ID for the business, but that may be federal or may be state issued. There is no master list.

_____
Jeff
It's never too early to begin preparing for International Talk Like a Pirate Day
"The software I buy sucks,  The software I write sucks.  It's time to give up and have a beer..." - Me

RE: a question regarding data integrity

That explains some of the problems.  Here, every limited company (even those run by one man and his dog) are obliged to have on public display a certificate of incorporation at the company's registered address.  Each company has a unique number, assingned by Comopanies House, who liaise with H.M. Customs and Excise to make sure that all duties and taxes are paid.

You do seem to have a bit of a "snowflake in a blizzard" problem your side of the pond.  I still believe that there is no excuse for poor software though.  If business in the U.S. is like that, then shouldn't the major software houses at least make a fist at dealing with the situation?  I must say that I'd want Messrs. Clinton and Obama to do something about such a situation though - it seems ripe for fraudsters etc.

Maybe we could set up a Nigerian Trading company based in the U.S., and see how many punters we can fleece before the old bill come calling?

Sorry, I just realised that I've wandered off thread.

Regards

T

RE: a question regarding data integrity

 

In the U.S. all you need to open a business bank account is a Federal Tax ID for the business.  If the business has no employees, then a Federal ID is not required and one can use their personal Social Security Number instead.

Businesses here can be Corportations, Limited Liability Companies (LLC), Partnerships, Limited Parterships, Sole Proprietors, etc all with different Federal, State, County, and local requirements for registration or non-registration as the case may be.

Nor does is a business required to have a distinct business name.  Anyone can use their personal name for the business name so long as no one else is using the same name in the counties that it is registered in.  In reality, there can be over 3000 totally unrelated businesses with exactly the same name, one in each county.  However, in practice, most businesses register in all or most counties in one or more states.  If it is a Corporation or some other state regulated business, then registering with the state automatically registers the business in all of the counties in the state.

As for how to handle the data integrity problem, I use Zip Codes as the primary entry with a drop-down list of available customers in that zip.

If perchance the zip is unknown, the second choice is using phone numbers for lookup, HOWEVER, new customers are not allowed to be entered just because a phone number cannot be found since most companies have several or even thousands of different phone numbers.

Finally, lookups are allowed using company names.  This uses a modified Soundex where names like 'Eden Truffles,' Dental Supplies,' and 'Teen Eating House' would all come up on the same list.  This does not work very well when the same company name (AAMCO for example) shows up hundreds of times along with unrelated variations (AMCO, AMOCO, etc).

After all of these have been checked, then and only then, can a new customer be added.  This does not totally solve the problem, but it eliminates most duplicates up front.

Since I generally only need to worry about a maximum of 20,000 business entities this seems to work well for me.

 

mmerlinn

http://mmerlinn.com

"Political correctness is the BADGE of a COWARD!"

 

RE: a question regarding data integrity

Quote:

my rant was about the fundamental idiocy of deliberately allowing dross into a db
Picture yourself and the owner of a business, standing behind a salesman taking an order over the phone and entering it into the computer.  Now ask the owner, "What do you want, this sale or data integrity?"  The business owner knows what "sale" means.

My point was that operational reality often clashes with CS dogma.  Welcome to the real world, where we must reconcile ugly realities with pretty models.

RE: a question regarding data integrity

[quote="harebrain"] Picture yourself and the owner of a business, standing behind a salesman taking an order over the phone and entering it into the computer.  Now ask the owner, "What do you want, this sale or data integrity?"  The business owner knows what "sale" means.

My point was that operational reality often clashes with CS dogma.  Welcome to the real world, where we must reconcile ugly realities with pretty models. [/quote]

Very good point. So taking away the possibility of a user to add a company would be no good idea. Nevertheless having two records for the same company messes up all statistics on that business contact etc.

First it seems a little too less information stored into the companies record, only a PK and the company name.

If companies are customers, why not use the wide spread practice of adding your own identiefier, eg a customer number you may ask of recurring customers to reidentify them exactly or some other data which suits as an identifier as telefone number or mail adress, as they are by their nature existing only once. Or add your own 3-5 letter acronym to the company name, which would less likely differ each time someone enters it.

I understand that a company can have many phone numbers and mail adresses, but that may reflect the fact you have to do with different departments with their independant accounting office.

The idea for an algorithm to make suggestions for what company the name entry could be using Soundex is a start. But Soundex does not help with eg abbreviations. Having a full text index you could search for parts of the company names entered combined with an acronym search would perhaps help more.

Bye, Olaf.

RE: a question regarding data integrity

thargtheslayer,

So, what about all the sole traders who don't need to register with Companies House?

NuJoizey,

If you wish to go down the controlled route, Dun and Bradstreet are the leaders in solutions for this problem.  http://www.dnb.com

Craig

RE: a question regarding data integrity

They are required to declare themselves as sole traders.  Again, that's mandatory, and has subtly different legal connotations.  I believe that a sole trader may have his personal assets seized against any debts, whereas a limited company (as its name suggest) has liability limited to its assets, and not the personal assets of its directors.

Regards

T

RE: a question regarding data integrity

No, sole traders do not need to register with Companies House.  They are unincorporated and therefore not a company by definition.

And then, what would you do with either an individual who wanted to trade with you or an international company?  Would you turn them away because 'you don't have a company number'?  I'd love to see you explain to the boss that the reason the business has to turn down work is because 'Computer says no'.

C

RE: a question regarding data integrity

To add to the mess of using a TaxID, if the business entity is an individual and not a Inc, then you would be storing that Individual's SSN. Now you get into SOX, and privacy issues of insuring that the SSN and all other "privacy Information" is encrypted and secured and all employees that look at that account with the SSN have signed Confidentiality notices.

So to use the SSN/FEID as the unique look up value, just snowballed into data security. Phone numbers and addresses tend to be a good starting place for comparing against existing parties. Sure you're going to have the occasional typo, but it does weed out many of the potential dupes you will encounter. It's a fairly simple check to see if phone/address all ready exists, especially if you control the exact format that the phone number is entered in.

Not perfect, but the cheapest and easiest way to help. A weekly/monthly report of all new clients could also cut into the dupe count, greatly. And with a small company that cannot see the ROI on buying "look up" data, it's about all you have.

"If I were to wake up with my head sewn to the carpet, I wouldn't be more surprised than I am right now."

RE: a question regarding data integrity

Hmmmmm...

In real-life databases, duplication happens. There's stuff you can do to make it less likely, but it's bound to happen in a database of any size. A few random thoughts:

One approach I've used to reduce duplicate records is to only allow users to enter a new record after they've searched existing ones. For example:
  1. The user populates the customer details by clicking a [Customer...] button.
  2. The button pops up a dialog asking for name, address and other search parameters. The user enters what (s)he knows and clicks [Search].
  3. A list of matching entries is shown. The user may pick one of these, or click a [New...] button
  4. Only at this point may they add a new customer
This approach works best when there's a good chance that the record is already in the database (I've used it for address entry where we've got a table of postal addresses). If a large percentage of customers are new ones, this approach may not be for you.

Since companies can and do operate under multiple names, your database should/could be designed to cope with this. An extra table can be used to hold aliases for the companies you deal with, and factored in to any search/duplicate spotting routines you have.

Nonetheless, duplicates will still occur from time to time, so I suggest you should have a process in place to merge them together. Reports that identify potential duplicates are a good idea too.

Having said that, you also need to determine how serious a problem a few duplicate records is to your business. I'm working in a Social Services department at the moment, and it's really important to us to know that the John Smith whose father is occasionally violent is the same John Smith that's turning up in the hospital twice a week. Does it really matter that you think you've got 2053 customers, but you've actually got 2050 and 3 duplicates? That's a cost-benefit question for the business to answer.

-- Chris Hunt
Webmaster & Tragedian
Extra Connections Ltd

RE: a question regarding data integrity

I too work with a database which must allow duplicates.  It's a medical application for hospitals.  For reasons of either vanity or embarassment, some folk lie about who they are when being treated.  They say that this is the "first time I've been treated."

Despite the fact that their medical history may contain information which is invaluable to the clinician attempting diagnosis and treatment, this happens often.  Some repeat offenders are recognised by staff who quietly enter the correct demographic data.  The NHS Number is used as the unique identifier in this case, and we have to merge duplicate records together sometimes, when tests reveal that two supposedly different people are one and the same.  Duplicate elimination is a major hassle and fraught with risk.  What if you wrongly merged Joe Smith with Joseph Smith, only to find out later that they are different, and one is allergic to penicillin and the other isn't?

There's nowt so queer as folk!

Regards

T

RE: a question regarding data integrity

Therg,

That sort of problem was dealt with years ago by Martin Fowler et al by the Cosmos project for the NHS.  In his book, Analysis Patterns, he describes merge/unmerge strategies pretty clearly and how to deal with the risks.  It was canonical for me when working on NPfIT.

C

RE: a question regarding data integrity

Quote (tharg):

some folk lie about who they are when being treated.
I've worked around clinical settings (ED) too.  Some folks are confused.  Some folks are unconscious.  Some folks are otherwise incapacitated or uncommunicative.  How about not tarring them all as "liars?"

RE: a question regarding data integrity

No tarring implied hare, just stating facts.  The software does allow for unknown persons to be treated.  For example, the A&E department might get a call from an ambulance crew saying "We've got three casualties from the pile-up, we're on the way now, ETA ten minutes, we'll need 5 units of blood on arrival."  

Our system duly allows A&E staff to book out units of blood for an unknown person of unknown age and/or gender, as it's recognised that the patient may be unknown (because right now they're in the back of an ambulance doing 70 m.p.h.) or unable to speak for him or herself.  The patient may be so bad that immediate transfusion of o negative is required to save their life - the lab will do a full cross match when things have stabilised.

That's a different functional requirement, which is duly catered for.

We do get liars though - bizarre really, they jeopardise their own treatment for some strange reason.  In this instance, the person is fully functional (otherwise how could they lie?) and deliberately providing false data.  Obviously this is not intended to cast aspersions on those folk who are mentally ill, and may be incapable of telling wrong from right.

Regards

T

Red Flag This Post

Please let us know here why this post is inappropriate. Reasons such as off-topic, duplicates, flames, illegal, vulgar, or students posting their homework.

Red Flag Submitted

Thank you for helping keep Tek-Tips Forums free from inappropriate posts.
The Tek-Tips staff will check this out and take appropriate action.

Reply To This Thread

Posting in the Tek-Tips forums is a member-only feature.

Click Here to join Tek-Tips and talk with other members! Already a Member? Login

Close Box

Join Tek-Tips® Today!

Join your peers on the Internet's largest technical computer professional community.
It's easy to join and it's free.

Here's Why Members Love Tek-Tips Forums:

Register now while it's still free!

Already a member? Close this window and log in.

Join Us             Close