Entry tags:
database design question
Dear LJ Brain Trust,
I'm too new to this to know if this is a "DB 101" question or more interesting, but Googling isn't getting me anywhere and I'm not getting a clear understanding from casual interrogation of coworkers, so I'm trying y'all.
Let's say I have a dataset that has half a dozen(ish) types of data -- posts, and users, and comments, and a few others. Logically, each of these types is a table, and I can join as necessary across them (to, say, find all posts by a given user). That's fine.
Now suppose I have, oh, 250 such datasets (same schemas, different data), and I'm going to load them all into a single database because I want to be able to do queries across datasets. There are two ways I might approach this:
1. Make 250 sets of tables. Each table is named for its dataset (e.g. Set1.Posts, Set1.Users, ... Set250.Posts, Set250.Users). If I want to do a query across all the users, everywhere, I, um... can I do the moral equivalent of "JOIN *.Posts"? How would that work?
2. Make big tables. I have one Posts table, and one Users table, and so on; at data load I cons up values for a new "dataset" column to say where each row came from. If I only want to look at one site I use a WHERE on that column to restrict the SELECT; if I want everything, it's all there for the taking.
Let me dispense with one possible concern: these datasets are updated from time to time, so at times I'll want to refresh the data from a dataset because an update came out. (An update is a new, complete copy, not a delta.) My database supports data partitioning, so deleting just the rows from that dataset before loading in a fresh copy is not a problem. (In option 1, you just drop the table.)
It seems to me that there is everything to gain and (probably?) nothing to lose by doing #2, if my database can handle that many rows. Because my database is in the "big-data analytics" space, I believe it can handle that. I mean, this database can operate on petabytes of data; my li'l pile of datasets will be well under a terabyte. (In fact, #1 didn't even occur to me, until I mentioned this project in passing and somebody suggested it.)
Is that reasonable, or am I missing something big and obvious? What factors should I be considering that I haven't mentioned? And if I'm wrong and I should be doing #1, how do I write a JOIN that crosses all the datasets?
(All the Stack Exchange data, in case you're wondering.)
I'm too new to this to know if this is a "DB 101" question or more interesting, but Googling isn't getting me anywhere and I'm not getting a clear understanding from casual interrogation of coworkers, so I'm trying y'all.
Let's say I have a dataset that has half a dozen(ish) types of data -- posts, and users, and comments, and a few others. Logically, each of these types is a table, and I can join as necessary across them (to, say, find all posts by a given user). That's fine.
Now suppose I have, oh, 250 such datasets (same schemas, different data), and I'm going to load them all into a single database because I want to be able to do queries across datasets. There are two ways I might approach this:
1. Make 250 sets of tables. Each table is named for its dataset (e.g. Set1.Posts, Set1.Users, ... Set250.Posts, Set250.Users). If I want to do a query across all the users, everywhere, I, um... can I do the moral equivalent of "JOIN *.Posts"? How would that work?
2. Make big tables. I have one Posts table, and one Users table, and so on; at data load I cons up values for a new "dataset" column to say where each row came from. If I only want to look at one site I use a WHERE on that column to restrict the SELECT; if I want everything, it's all there for the taking.
Let me dispense with one possible concern: these datasets are updated from time to time, so at times I'll want to refresh the data from a dataset because an update came out. (An update is a new, complete copy, not a delta.) My database supports data partitioning, so deleting just the rows from that dataset before loading in a fresh copy is not a problem. (In option 1, you just drop the table.)
It seems to me that there is everything to gain and (probably?) nothing to lose by doing #2, if my database can handle that many rows. Because my database is in the "big-data analytics" space, I believe it can handle that. I mean, this database can operate on petabytes of data; my li'l pile of datasets will be well under a terabyte. (In fact, #1 didn't even occur to me, until I mentioned this project in passing and somebody suggested it.)
Is that reasonable, or am I missing something big and obvious? What factors should I be considering that I haven't mentioned? And if I'm wrong and I should be doing #1, how do I write a JOIN that crosses all the datasets?
(All the Stack Exchange data, in case you're wondering.)
no subject
But your comment prompts a question that I want to double-check: users have globally-unique IDs, but other things, like posts, don't. I have been assuming that it's ok if my "posts" table has 250 "post #1"s in it, because there'll also be a column for site and in combination these provide unique identification. A DB table doesn't need to have a single column with unique values, does it? Or am I wrong and I'm also going to need to cons up a global ID (post ID + site) for each row?
no subject
I understand there are RDBMSs, MySQL I think, that do require primary keys on all tables, but it's not, IIRC, ANSI standard. Grownup SQLs (Oracle, PostgreSQL) don't require primary keys.
In such SQLs, there's nothing preventing you from having a table that consists of a single column (or N columns) and 100 rows (or M rows) all of which are set to the value "rutabaga". Or NULL.
Some RDBMSs (PostgreSQL) have a hidden unique auto-assigned OID column you can use in a pinch.
Some RDBMSs allow you to define a multi-column unique key which does exactly what you want here: it's a constraint which says the combination of the value of fieldA and value of fieldB must be unique across the table.
Or you can, yes, add a column with the two values concatenated, if it makes you and/or the RDBMS you're using happy, though it's considered bad normalization.
no subject
I don't think my queries care about unique keys, so unless the DB does (and doesn't supply its own OID column), I don't think I need to do something as ugly as fabricating that new column. I'll check to see if a multi-column unique key is a feature in this one.
no subject
no subject
Such a schema does no wrong in the sight of Codd, but some ORMs require the primary key to be a single field.