Entry tags:
database design question
Dear LJ Brain Trust,
I'm too new to this to know if this is a "DB 101" question or more interesting, but Googling isn't getting me anywhere and I'm not getting a clear understanding from casual interrogation of coworkers, so I'm trying y'all.
Let's say I have a dataset that has half a dozen(ish) types of data -- posts, and users, and comments, and a few others. Logically, each of these types is a table, and I can join as necessary across them (to, say, find all posts by a given user). That's fine.
Now suppose I have, oh, 250 such datasets (same schemas, different data), and I'm going to load them all into a single database because I want to be able to do queries across datasets. There are two ways I might approach this:
1. Make 250 sets of tables. Each table is named for its dataset (e.g. Set1.Posts, Set1.Users, ... Set250.Posts, Set250.Users). If I want to do a query across all the users, everywhere, I, um... can I do the moral equivalent of "JOIN *.Posts"? How would that work?
2. Make big tables. I have one Posts table, and one Users table, and so on; at data load I cons up values for a new "dataset" column to say where each row came from. If I only want to look at one site I use a WHERE on that column to restrict the SELECT; if I want everything, it's all there for the taking.
Let me dispense with one possible concern: these datasets are updated from time to time, so at times I'll want to refresh the data from a dataset because an update came out. (An update is a new, complete copy, not a delta.) My database supports data partitioning, so deleting just the rows from that dataset before loading in a fresh copy is not a problem. (In option 1, you just drop the table.)
It seems to me that there is everything to gain and (probably?) nothing to lose by doing #2, if my database can handle that many rows. Because my database is in the "big-data analytics" space, I believe it can handle that. I mean, this database can operate on petabytes of data; my li'l pile of datasets will be well under a terabyte. (In fact, #1 didn't even occur to me, until I mentioned this project in passing and somebody suggested it.)
Is that reasonable, or am I missing something big and obvious? What factors should I be considering that I haven't mentioned? And if I'm wrong and I should be doing #1, how do I write a JOIN that crosses all the datasets?
(All the Stack Exchange data, in case you're wondering.)
I'm too new to this to know if this is a "DB 101" question or more interesting, but Googling isn't getting me anywhere and I'm not getting a clear understanding from casual interrogation of coworkers, so I'm trying y'all.
Let's say I have a dataset that has half a dozen(ish) types of data -- posts, and users, and comments, and a few others. Logically, each of these types is a table, and I can join as necessary across them (to, say, find all posts by a given user). That's fine.
Now suppose I have, oh, 250 such datasets (same schemas, different data), and I'm going to load them all into a single database because I want to be able to do queries across datasets. There are two ways I might approach this:
1. Make 250 sets of tables. Each table is named for its dataset (e.g. Set1.Posts, Set1.Users, ... Set250.Posts, Set250.Users). If I want to do a query across all the users, everywhere, I, um... can I do the moral equivalent of "JOIN *.Posts"? How would that work?
2. Make big tables. I have one Posts table, and one Users table, and so on; at data load I cons up values for a new "dataset" column to say where each row came from. If I only want to look at one site I use a WHERE on that column to restrict the SELECT; if I want everything, it's all there for the taking.
Let me dispense with one possible concern: these datasets are updated from time to time, so at times I'll want to refresh the data from a dataset because an update came out. (An update is a new, complete copy, not a delta.) My database supports data partitioning, so deleting just the rows from that dataset before loading in a fresh copy is not a problem. (In option 1, you just drop the table.)
It seems to me that there is everything to gain and (probably?) nothing to lose by doing #2, if my database can handle that many rows. Because my database is in the "big-data analytics" space, I believe it can handle that. I mean, this database can operate on petabytes of data; my li'l pile of datasets will be well under a terabyte. (In fact, #1 didn't even occur to me, until I mentioned this project in passing and somebody suggested it.)
Is that reasonable, or am I missing something big and obvious? What factors should I be considering that I haven't mentioned? And if I'm wrong and I should be doing #1, how do I write a JOIN that crosses all the datasets?
(All the Stack Exchange data, in case you're wondering.)
no subject
If you do #1 anyway, the way you write a query across multiple tables like that would be with UNION. UNION is almost but not quite the GOTO of SQL and having to use it is a pretty reliable indicator that something has gone wrong in your life. And you would have to use it 249 times to query across 250 instances of the same table. So if I made some terrible decision that landed me in the situation of having to do this, I would totally automate the generation of that UNION-intensive query in some third language (perl, PHP, a MSWord mail-merge, anything.) Also, once I had generated the statements, I use them as the bases of views, so then I don't ever have to look at them again.
ETA: A third option is to use table inheritance, if your RDBMS supports it, but it's not ANSI standard SQL so it might not. In table inheritance, you make a parent table such as "users" and you have a column in there for the dataset, and then you have 250 child tables "users_0001", "users_0002", etc, where for every record the dataset column is set to "0001" or whatever. You can then query across all users by querying the users table, but do all your updates and deletes and inserts by the individual subtables. That might be more convenient.
ETA2: It's been almost a decade and two major versions of PostgreSQL since I did work with table inheritance, so I may be missing some crucial details. If you think it might be for you, look into it in detail to make sure if your RDBMS supports it, it does so sufficiently and without any gotchas. There were definitely important gotchas in PostrgreSQL 7.4.x's table inheritance support.
(no subject)
(no subject)
(no subject)
(no subject)
(no subject)
(no subject)
(no subject)
no subject
Exactly. In terms of meanings, #2 is strictly more powerful; any query against a list of #1's tables can be mechanically transformed to a WHERE clause for querying #2, but not the other way around. Or #2 can do a "GROUP BY instance" to compute over each of the subsites and, say, order them by trollitude.
Sometimes performance trumps cleanliness, but it doesn't sound like the person suggesting #1 is so much an expert in your database's performance characteristics. If your system needs you to manually ask for an index or something to make #2 fast, you'll learn. :)
(no subject)
(no subject)
no subject
If I'd inherited situation number 1, I might consider creating a view of all the billions of tables so they look like solution 2, but still have the data frobbing benefits of number 1. With enough horsepower the inefficiency might not matter.
(no subject)
no subject
(no subject)
(no subject)
(no subject)
(no subject)
(no subject)
no subject
I guess in this case it doesn't matter, but #2 also forces unifying the schemas of all the datasets. #1 with differing dataset schemas would be slightly insane to write queries for.
no subject