database design question
Oct. 20th, 2014 11:17 pmDear LJ Brain Trust,
I'm too new to this to know if this is a "DB 101" question or more interesting, but Googling isn't getting me anywhere and I'm not getting a clear understanding from casual interrogation of coworkers, so I'm trying y'all.
Let's say I have a dataset that has half a dozen(ish) types of data -- posts, and users, and comments, and a few others. Logically, each of these types is a table, and I can join as necessary across them (to, say, find all posts by a given user). That's fine.
Now suppose I have, oh, 250 such datasets (same schemas, different data), and I'm going to load them all into a single database because I want to be able to do queries across datasets. There are two ways I might approach this:
1. Make 250 sets of tables. Each table is named for its dataset (e.g. Set1.Posts, Set1.Users, ... Set250.Posts, Set250.Users). If I want to do a query across all the users, everywhere, I, um... can I do the moral equivalent of "JOIN *.Posts"? How would that work?
2. Make big tables. I have one Posts table, and one Users table, and so on; at data load I cons up values for a new "dataset" column to say where each row came from. If I only want to look at one site I use a WHERE on that column to restrict the SELECT; if I want everything, it's all there for the taking.
Let me dispense with one possible concern: these datasets are updated from time to time, so at times I'll want to refresh the data from a dataset because an update came out. (An update is a new, complete copy, not a delta.) My database supports data partitioning, so deleting just the rows from that dataset before loading in a fresh copy is not a problem. (In option 1, you just drop the table.)
It seems to me that there is everything to gain and (probably?) nothing to lose by doing #2, if my database can handle that many rows. Because my database is in the "big-data analytics" space, I believe it can handle that. I mean, this database can operate on petabytes of data; my li'l pile of datasets will be well under a terabyte. (In fact, #1 didn't even occur to me, until I mentioned this project in passing and somebody suggested it.)
Is that reasonable, or am I missing something big and obvious? What factors should I be considering that I haven't mentioned? And if I'm wrong and I should be doing #1, how do I write a JOIN that crosses all the datasets?
(All the Stack Exchange data, in case you're wondering.)
I'm too new to this to know if this is a "DB 101" question or more interesting, but Googling isn't getting me anywhere and I'm not getting a clear understanding from casual interrogation of coworkers, so I'm trying y'all.
Let's say I have a dataset that has half a dozen(ish) types of data -- posts, and users, and comments, and a few others. Logically, each of these types is a table, and I can join as necessary across them (to, say, find all posts by a given user). That's fine.
Now suppose I have, oh, 250 such datasets (same schemas, different data), and I'm going to load them all into a single database because I want to be able to do queries across datasets. There are two ways I might approach this:
1. Make 250 sets of tables. Each table is named for its dataset (e.g. Set1.Posts, Set1.Users, ... Set250.Posts, Set250.Users). If I want to do a query across all the users, everywhere, I, um... can I do the moral equivalent of "JOIN *.Posts"? How would that work?
2. Make big tables. I have one Posts table, and one Users table, and so on; at data load I cons up values for a new "dataset" column to say where each row came from. If I only want to look at one site I use a WHERE on that column to restrict the SELECT; if I want everything, it's all there for the taking.
Let me dispense with one possible concern: these datasets are updated from time to time, so at times I'll want to refresh the data from a dataset because an update came out. (An update is a new, complete copy, not a delta.) My database supports data partitioning, so deleting just the rows from that dataset before loading in a fresh copy is not a problem. (In option 1, you just drop the table.)
It seems to me that there is everything to gain and (probably?) nothing to lose by doing #2, if my database can handle that many rows. Because my database is in the "big-data analytics" space, I believe it can handle that. I mean, this database can operate on petabytes of data; my li'l pile of datasets will be well under a terabyte. (In fact, #1 didn't even occur to me, until I mentioned this project in passing and somebody suggested it.)
Is that reasonable, or am I missing something big and obvious? What factors should I be considering that I haven't mentioned? And if I'm wrong and I should be doing #1, how do I write a JOIN that crosses all the datasets?
(All the Stack Exchange data, in case you're wondering.)
(no subject)
Date: 2014-10-21 03:52 am (UTC)Oh, and if I were going to implement a parent table with 250 child tables, I'm pretty confident I would be automating the generation of something, possibly in some other language, possibly in pl/pgsql, possibly doing something creative with trigger functions. I'm pretty sure you can create new tables in trigger functions, so....
ETA: No worries! Now that I think of it, heck, in PostgreSQL, you could have a trigger function ON INSERT on the "users" parent table, so you could just attempt to INSERT what should be "users_0014" right into "users", and the trigger function looks at the "dataset" column, sees "0014", checks and sees there's no "users_0014" child table and makes it, and inserts the data there instead.
(no subject)
Date: 2014-10-22 01:11 am (UTC)Yeah, agreed. Since I control where the data gets downloaded, I can control file/directory names and then script it. :-)
Thanks for the tips on parent tables and trigger functions! I don't know what all that means yet, but I now have some vocabulary with which to go looking.