×
INTELLIGENT WORK FORUMS
FOR COMPUTER PROFESSIONALS

Log In

Come Join Us!

Are you a
Computer / IT professional?
Join Tek-Tips Forums!
  • Talk With Other Members
  • Be Notified Of Responses
    To Your Posts
  • Keyword Search
  • One-Click Access To Your
    Favorite Forums
  • Automated Signatures
    On Your Posts
  • Best Of All, It's Free!
  • Students Click Here

*Tek-Tips's functionality depends on members receiving e-mail. By joining you are opting in to receive e-mail.

Posting Guidelines

Promoting, selling, recruiting, coursework and thesis posting is forbidden.

Students Click Here

performance problem when counting records when count is than 10 millions

performance problem when counting records when count is than 10 millions

performance problem when counting records when count is than 10 millions

(OP)
I have a performance problem

I have two tables one is master other is transaction data. And transaction data record count is in more than 10 millions

I need to get the the transaction data records count where the key field is in master. Please provide the best solution I have tried with joins ... in ... not in conditions. if you have any way out let me know

RE: performance problem when counting records when count is than 10 millions

The solution this problem can be summarised in one word: indexes. As a minimum, you should have an index tag on the transaction table's foreign key, that is, the field in the transaction table that points back to the master table. This is usually something like a customer ID (if the transactions are invoices or orders), or something similar. I assume you alrady have indexes on the two tables' primary keys.

If you don't already have such a tag in place, you will notice a huge difference in performance when you add it. But you can go a bit further. If at all possible, make the key fields (the foreign key mentioned above and all primary keys) as short as possible. An integer data type is much more efficient than a long character string a multi-digit number, for example.

Mike



__________________________________
Mike Lewis (Edinburgh, Scotland)

Visual FoxPro articles, tips and downloads

RE: performance problem when counting records when count is than 10 millions

Quote:

I need to get the the transaction data records count where the key field is in master.

Sorry, I just noticed that part of your question.

You will need something like this:

CODE -->

SELECT COUNT(TransID) FROM Transactions ;
  WHERE MasterID IN (SELECT MasterID FROM Master) 

where TransID is the foreign key (e.g. CustomerID) and MasterID is the primary key of the Master Table.

But you willstill need index tags (on the two IDs) to get an acceptable performance.

Mike

__________________________________
Mike Lewis (Edinburgh, Scotland)

Visual FoxPro articles, tips and downloads

RE: performance problem when counting records when count is than 10 millions

(OP)
my query is select count(*) from t1 where t1.a +t1.b in ( select t2.a from t2)

where t1 is transactional data t2 is master and t1 has index on a and b fields and t2 has index on a field

my only problem is I need count of t1 data that matches t2

my query is taking 12 seconds when data is 6 lakhs and I am going execute this on 12 millions data think of performance

RE: performance problem when counting records when count is than 10 millions

The answer is - without a doubt - indexes. If you will post your table structure and show exactly what you want out of the child table that would be the most efficient way to come up with an actual solution.

Larry - TPG

RE: performance problem when counting records when count is than 10 millions

Are t1.a and t1.b character fields? If so, it might be more efficient to have an index on a + b rather than separate indexes on a and b. It might be worth experimenting.

Also, make sure that a and b are as small as possible. The time taken to retrieve records using an index is proportional to the length of the index. If you can reduce the size of the two character fields, you will see an improvement.

Mike

__________________________________
Mike Lewis (Edinburgh, Scotland)

Visual FoxPro articles, tips and downloads

RE: performance problem when counting records when count is than 10 millions

(OP)
Mike both are character fields

earlier I used to get the record count using the reccount() - deleted records count. which was very faster. Now I have some orphan records in transactional data which is creating problem and I need to join the table with master data.

RE: performance problem when counting records when count is than 10 millions

Well, what I said about having an indexon a + b still stands (as does making the fields as short as possible).

But what exactly is your goal? Is it to find (and perhaps delete?) the orphaned records? If so, it would be much more efficient to find records from t1 that are NOT in t2.

Or are you trying to perform some query on the transaction data, but you are worried that the orphaned records will muck up the results in some way? If so, a normal INNER join should solve that. Selecting from t2 inner joined to t1 will exclude the orphans.

Mike

__________________________________
Mike Lewis (Edinburgh, Scotland)

Visual FoxPro articles, tips and downloads

RE: performance problem when counting records when count is than 10 millions

(OP)
I need to have correct count of data in transactional table that matches the data in master table. I hope you understand

Orphan records are not fully orphan.. which has value in key field but the value is not there in the master table
In short I need reccount() of transactional data that matches master table with good performance

I tried creating indexes.. I tried joins but failed to get good performance

RE: performance problem when counting records when count is than 10 millions

Are you familiar with Rushmore? Are you using it? To be more specific (without seeing your structure and telling exactly what you want). Rushmore is the answer to your problem.

RE: performance problem when counting records when count is than 10 millions

Hi Srinuvelati

You're hammering on the fact about orphaned data, but Mike got that, it has nothing to do with your performance problem, an index on a+b of the t1 table makes VFP use that index for looking up whether that compound key exists in the t2 master table.

Separate indexes on a and b don't help speed up the lookup process, really. You need an index on a+b, like here

CODE

Create Cursor t1 (a char(4), b char(4))
Index on a+b tag ab

Create Cursor t2 (a char(8))

* group 1
Insert into t2 values ('aaaabbbb')
Insert into t1 values ('aaaa','bbbb')
Insert into t1 values ('aaaa','bbbb')

* orphaned case
Insert into t1 values ('cccc','dddd')

* group 2
Insert into t2 values ('eeeeffff')
Insert into t1 values ('eeee','ffff')

Select count(*) from t1 inner join t2 on t2.a = t1.a+t1.b
Select t1.a, t1.b, count(*) from t1 inner join t2 on t2.a = t1.a+t1.b group by t1.a, t1.b 

Rushmore optimization tells it uses the index tag ab, separate indexes are not used for join optimization.

Bye, Olaf.

Olaf Doschke Software Engineering
https://www.doschke.name

RE: performance problem when counting records when count is than 10 millions

And just to be clear, this will speed up in comparison with a case without the necessary index, but this will be a join optimization, no wonder. it's not turning the query to a single index seek, if you expect this to give you a record count as fast as reccount() you're expecting too much.

If you need this very often and fast, the usual solution will be to maintain your own record counters, a metadata table you update with each insert, update, and delete. Yes, this has the risc to become wrong if not all code also maintains this metadata table, so you'd perhaps put the maintenance into insert/update/delete triggers.

Bye, Olaf.

Olaf Doschke Software Engineering
https://www.doschke.name

RE: performance problem when counting records when count is than 10 millions


Hi Srinuvelati,

Believe it, but the best response on your question is when you reply to our question:

If you will post your table structure and show exactly what you want out of the child table that would be the most efficient way to come up with an actual solution.

What is the reason you simply ignore?

Koen

RE: performance problem when counting records when count is than 10 millions

In addition to posting your table structure (including all the index expressions), it would be helpful to know if this is a once-off operation (for example, to get rid of orphaned transactions that are present by accident) or if it is something you need to do a on a regular basis.

Mike

__________________________________
Mike Lewis (Edinburgh, Scotland)

Visual FoxPro articles, tips and downloads

RE: performance problem when counting records when count is than 10 millions

Hi Srinuvelati,

Mike is, of course, very correct. Please do not forget to mention that. Also would like to know what the current blocksize ? sys(2012) is.

Koen

RE: performance problem when counting records when count is than 10 millions

(OP)
Olaf

the below sql is taking 12 seconds ... the sqls you have provided is taking more time

SELECT COUNT(*) FROM t1 WHERE t1.a + t1.b in ( select a FROM t1 ) INTO CURSOR t2

t1 has 35 fields index fields length is 11 chars and t2is a cursor has only one field

RE: performance problem when counting records when count is than 10 millions

Srinuvelati,

have you read anything we said?
My query is optimized by an index on a+b, did you create that? Your measurement is meaningless, until you create the index that query needs to run fastest.

Bye, Olaf.

Olaf Doschke Software Engineering
https://www.doschke.name

RE: performance problem when counting records when count is than 10 millions

(OP)
I have already mentioned that I have index on a+b fields on transactional data and on a field on master data.

RE: performance problem when counting records when count is than 10 millions

You said you have index on a and b in transactional data, not that you have an index on a+b.

CODE -->

Index on a+b tag ab 

So you have such an index?

Bye, Olaf.

Olaf Doschke Software Engineering
https://www.doschke.name

RE: performance problem when counting records when count is than 10 millions

(OP)
Yes I have that index...good amount of data in transactional table 6 lakh records

RE: performance problem when counting records when count is than 10 millions

Srinuvelati, it would also be helpful if you could address the points I raised in my above post timestamped 11 Dec 19 15:03.

Mike

__________________________________
Mike Lewis (Edinburgh, Scotland)

Visual FoxPro articles, tips and downloads

RE: performance problem when counting records when count is than 10 millions

I am familiar with Lakh, it's simply 1,00,000 and instead of having a comma every 3 magnitudes you're putting this there. You already said it's 10 million records.

The quantity of data makes things slower. The time necessary rises in different ways, sometimes exponentially, sometimes quadratic, sometimes only logarithmic. The best you can expect is linear performance in case you want to count everything. This count isn't maintained in the DBF header like RECCOUNT().

Anyway, the number of records is uninteresting for the fact which indexes are used for query optimization, that's a qualitative issue and you need index expressions on the expressions used to join or filter.

I think you won't get much faster than what you already have anyway.
I tried 10 million records with separate a,b and with the normal case of a foreign key that's identical to the main key.

It takes the same time for me, about 10 seconds.
You'll never get this down to <1 second. This always is a full scan through the data, no matter if DBF or CDX.

Edit: Dropping all indexes I get a duration of 15 seconds, so the indexing helps to speed up the join, but since a full count isn't a selective query this doesn't profit as much.

If you'd delete detail data with main data you could make use of VFP9s feature to better optimize queries regarding DELETED() and would not need to join the main data for the count. It obviously depends on your needs of the detail data, as you tell it you keep orphaned detail data. I don't know your reasons for that, you didn't react much to questions also of the others.

Bye, Olaf.

Olaf Doschke Software Engineering
https://www.doschke.name

RE: performance problem when counting records when count is than 10 millions

Quote (Olaf)

I am familiar with Lakh, it's simply 1,00,000 and instead of having a comma every 3 magnitudes you're putting this there

I am too. But I thought it was 10^5 rather than 10^6 (I might be wrong). But it's true that the placement of commas is non-standard.

I first heard the term "Lakh" in the 1960s, in Lionel Davidson't novel The Rose of Tibet, which involved a treasure worth 500 lakh rupees or something similar.

&& End of useless information

Mike

__________________________________
Mike Lewis (Edinburgh, Scotland)

Visual FoxPro articles, tips and downloads

RE: performance problem when counting records when count is than 10 millions

(OP)
I also feel the same performance for this cannot be increased... is there any other way of finding how many orphans.. bu using relation ..or something else etc... you are telling 10 secs for 10 million records.. for me it is taking 10secs for 1 million records

RE: performance problem when counting records when count is than 10 millions

Quote:

it is taking 10secs for 1 million records

But what we are saying is that the increase in time is not necessarily linear. Just because it takes 10 seconds for one million, that doesn't mean it will take 100 seconds for 10 million.

But even if does take 100 seconds, that brings us back to my earlier question (which you still haven't answered): Is this a one-off requirement, or something that you will be doing regularly? If the former ... well, you could have finished the job by now. If the latter, what is the cost of keeping the system busy for a couple of minutes?

And another question: Why are these orphans getting into the transaction file in the first place? This is presumably something you want to avoid (or is it?). If that's right, perhaps you should be devoting your attention to how to prevent them.

Mike

__________________________________
Mike Lewis (Edinburgh, Scotland)

Visual FoxPro articles, tips and downloads

RE: performance problem when counting records when count is than 10 millions

Well, the time for me maybe better because of better hardware, I also don't have your table, only sample data with only the fields mentioned and not more, my CDX is not having further tags, etc.

That all plays into the overall time necessary. So any values or only valid for a certain system, not generally. So our systems have a factor of 10 for the performance, of HDD and/or RAM. it seems. Or despite saying so you don't have the indexes necessary to speed things up a bit more, though you think. Anyway, you don't get a factor 10 out of an index, per my experiments.

If you deleted orphaned detail transactions you don't need to join, just count.

And Mike, yes it is 10^5, therefore the strange comma positions. But in the initial post, the talk was about 10 million anyway.

Bye, Olaf.

Olaf Doschke Software Engineering
https://www.doschke.name

RE: performance problem when counting records when count is than 10 millions

There are still possibilities, because

1. you could try to reindex the tables, the recreation of indexes may remove bloat and make index tags cleaner and faster

2. you may set up foreground/background memory - SYS(3050).
The typical Rushmore bitmap will have 10 million bits as your table is that long and every record is represented as 1 bit, hits with 1, misses with 0, but still 1 bit per record. That's 10 Mbit = 1.25 MB. Not very much, but that's just for this query. The CDX I created with just the one necessary index tag has ~50MB and you don't need to keep that in memory, but since you don't have any filter conditions but the join condition you'll go through all of it and likely it helps if you reserve more memory, why not 100 MB or 256 MB?

3. You could do it like VFP and better and maintain a counter. When you use RECCOUNT() that actually just reads a count from the DBF header, and relies on this being the correct record count. In fact, you get errors, if that doesn't match with the file size as of HEADER()+RECCOUNT()*RECSiZE() (+2 IIRC, but you get the idea).

RECCOUNT() counts deleted rows, too, that's its weakness. But you can do better and as I already suggested use triggers to maintain the actual count of non-orphaned records, even if you don't cascade deletes. When a record of the main table is deleted you can subtract as many records of the other table as are referring to this record.

Bye, Olaf.


Olaf Doschke Software Engineering
https://www.doschke.name

RE: performance problem when counting records when count is than 10 millions

(OP)
Mike

I am transferring the foxpro data to Oracle... I am facing this problem in initial load... And I believe this problem will appear in incremental load too. I mean its not one time job.. I dont have any idea from where this orphans are getting added. as this application is 25 year old. And I want to avoid these orphans in the transfer every time whenever I transfer. at the same time I know this problem will be resolved by deleting these records. however I want to fix this type of issues for the future transfers.

Olaf.. I will check the sql with increasing memory

RE: performance problem when counting records when count is than 10 millions

I don't know about Oracle, but with SQL Server you'd go through CSV with a format file using BCP tool.

In Incremental load you'll also not have that count problem, do you? The biggest data migration I did going directly from DBF into MSSQL was going in 10,000 record chunks using SQLPrepared statements and in part because that doesn't need that high RAM and in another part, because SQL Server only handles a certain amount of locks in parallel. You find a natural chunk size that works faster than going for all data at once or for single rows at a time.

Finding chunks does not necessary mean you always have exactly N records in them, I did query certain ID ranges, so typically the N just was the upper bound.

And last not least: Why not delete the rows you want to purge in the migration process? If they are really orphaned and also not of use in the current frontend, then get rid of them, also PACK.

Even if you get new ones due to having neither defined cascading deletions nor restricting master data deletions in referential integrity code, getting rid of data you don't need even as a side process not integrated into the main application can help.

Bye, Olaf.

Olaf Doschke Software Engineering
https://www.doschke.name

RE: performance problem when counting records when count is than 10 millions

And just for the sake of fun, when you do delete orphaned rows and INDEX ON NOT DELETED() TAG deleted BINARY.
The index size now is <1.5MB.

And then SELECT COUNT(*) FROM t1, as you don't need to join t2, and VFP9 Rushmore takes binary indexes on DELETED() besides also supporting filtered FOR DELETED() and more (see the VFP9 help "Indexes Based on Deleted Records") it breaks down to this on my computer:

Time to count ~10 million records with 5% deleted rows

Without index: 2.4 Seconds
With index: 0.012 seconds

So that's the most promising way of fast counting in the single table and even if you want to refrain modifying your production data, for the migration you could and perhaps even should use a data copy, ie restored a backup to work without being influenced by the current application usage of the data. On the copy, you can even work with exclusive on and have your migration EXE on the file server, etc.

And then you can not only delete, you can not only index on DELETED(), you can also PACK, that makes it even better because RECCOUNT() then is 1:1 the count.

Since you brought up that migration topic I'm not sure anymore, if it's really just about the count but any handling of data without a high selectivity, ie boiling a result down to just a few records. Most important that's not to be confused with just having one aggregate value. That's a small result set, but to get it you need to go through all data. High selectivity means only addressing a small number of rows.

Well, from my migration experiences I can only say partitioning data helps a lot but I don't see why that would need such a join optimization. Aside from some small tables when you go for the massive tables and migrate chunks of data the most natural way in nested or hierarchical structured relationships is going from one head record to all detail data and not vice versa.

Bye, Olaf.

Olaf Doschke Software Engineering
https://www.doschke.name

Red Flag This Post

Please let us know here why this post is inappropriate. Reasons such as off-topic, duplicates, flames, illegal, vulgar, or students posting their homework.

Red Flag Submitted

Thank you for helping keep Tek-Tips Forums free from inappropriate posts.
The Tek-Tips staff will check this out and take appropriate action.

Reply To This Thread

Posting in the Tek-Tips forums is a member-only feature.

Click Here to join Tek-Tips and talk with other members! Already a Member? Login

Close Box

Join Tek-Tips® Today!

Join your peers on the Internet's largest technical computer professional community.
It's easy to join and it's free.

Here's Why Members Love Tek-Tips Forums:

Register now while it's still free!

Already a member? Close this window and log in.

Join Us             Close