[Leaplist] coding - rdbms, revisions, and fks
John Simpson
jms1 at jms1.net
Fri Dec 4 01:52:40 EST 2009
On 2009-12-03, at 2241, Phil Barnett wrote:
>
> You can certainly do this, but it's wasteful.
>
> All you really need in the history is:
>
> 1. Original records that generated the mailing that don't change. If you
> need to edit them, save a new different copy with a new differnent id. Link
> the two records together so you know they are cousins.
>
> 2. History tables with dates and pointers to the original records.
>
> This consumes less space.
>
> Duplicating data that is already in the system is wasteful if you can refer
> to an original copy of it.
one, disk space is cheap these days.
two, you are correct- it's nowhere near an optimal design. however it does separate the archives from the live data, and it allows reports to be run based entirely on the history table, which leaves the live tables free for edits, since you won't have to worry about locking issues slowing things down.
although now you've got me thinking, and that can be a dangerous thing.
so here's another idea.
in the tables where values might change (i.e. person and address) have a second integer field called "rev", and make the table's primary key the combination of the ID and rev fields. whenever an entity is first created, that's "rev 1" for that person/address. when an entity is changed, insert a new record with the same ID value and "rev 2". the next change would be rev 3, and so forth. (personally i would use rev values starting from zero, but that's just me.)
each new record would contain all of the same data from its predecessor, with changes made as appropriate- so that each record can potentially be used by itself (i.e. if you fix a typo, you don't want "rev 2" to have ONLY the new corrected spelling of the city, with the other fields empty to reflect the concept of "use the data from the previous version". that's just asking for trouble.)
obviously you'll have to modify some of your existing code.
in the code which allows users to edit records, instead of sending a SELECT statement, you would need to send an INSERT statement using the existing data, substituting new values as needed, and with the rev number incremented. (this also means that these tables would only need SELECT and INSERT permissions for the app user, since there would be no more UPDATE.)
in the code which actually does a mailing... i'm sure your existing SELECT query to get the address data for doing a mailing is complex enough as it is, so here's a way which is easier to implement and will probably run faster than trying to force the SQL server to only return those records which have the highest rev for that aid value...
first start a transaction, to ensure that the data doesn't change between the two queries.
then send the first query, which gets the highest revs assiciated with each entity. store these values in memory. (i'm not sure what language you're using, but i would almost certainly be using perl, and this is a perfect job for a hash.) the query would look like this:
SELECT aid,max(rev) FROM addresses GROUP BY aid
then send your existing query to find the addresses which are eligible, but add the "rev" field to the list of fields you're SELECTing. then, as you step through the records which come back from the query, if the rev value is not the same as the value you have in memory, ignore that record (because the one with the highest rev is presumably in the result set as well.)
as you add records to your history table, instead of storing the full contents of the customer name and address fields, you store the ID and rev values you ended up using.
once you're done processing records, release the transaction. it shouldn't matter if you use COMMIT or ROLLBACK, since you will not have changed any data anyway.
this approach is less wasteful of space, but it makes certain kinds of post-send queries more complicated. for example, how many mail pieces did you send to zip code 23455 in november? the history table has every piece which was sent, but because the actual addresses are stored in the address table, you have to link in the address table by (aid,rev) in order to be able to select on the zip code. the query might look like this:
SELECT COUNT(*) FROM mailing , address
WHERE mailing.mid = address.mid AND mailing.rev = address.rev
AND mailing.started >= '2009-11-01' AND mailing.started < '2009-12-01'
AND address.zip LIKE '23455%'
i'm sure there's a better way to express this query, using some variant of JOIN, but it's been a while since i needed to use any JOIN clauses in a query, so i'm not as familiar with them as i probably should be. but you can see the idea- the first two WHERE clauses serve to link the tables, and the other three are the criteria to actually select records.
----------------------------------------------------------------
| John M. Simpson --- KG4ZOW --- Programmer At Large |
| http://www.jms1.net/ <jms1 at jms1.net> |
----------------------------------------------------------------
| http://video.google.com/videoplay?docid=-1656880303867390173 |
----------------------------------------------------------------
-------------- next part --------------
A non-text attachment was scrubbed...
Name: PGP.sig
Type: application/pgp-signature
Size: 194 bytes
Desc: This is a digitally signed message part
Url : http://lists.leap-cf.org/pipermail/leaplist/attachments/20091204/30b9bb79/PGP.bin
More information about the Leaplist
mailing list