Lemmy Server Performance

423 readers

1 users here now

lemmy_server uses the Diesel ORM that automatically generates SQL statements. There are serious performance problems in June and July 2023 preventing Lemmy from scaling. Topics include caching, PostgreSQL extensions for troubleshooting, Client/Server Code/SQL Data/server operator apps/sever operator API (performance and storage monitoring), etc.

founded 1 year ago

MODERATORS

[email protected]

post inclusion, a solid WHERE clause filter before any JOIN on SELECT post (listing of posts) (lemmy.ml)

submitted 1 year ago by [email protected] to c/[email protected]

20 comments fedilink hide all child comments

you are viewing a single comment's thread
view the rest of the comments

[–] [email protected] 1 points 1 year ago (1 children)

Again though, that sounds like something that can be indexed. And posts more than a few days old usually won't receive new voted very often. I think Reddit may archive very old posts so they can't receive be votes at all. Are these table scans only an issue when the person is trying to browse by best of all time?

For recent posts, yeah, a bit of buffering could help if votes are arriving very fast. That seems like an eventual good optimization. For now there is not enough traffic to need it, I'd expect.

[–] [email protected] 1 points 1 year ago (1 children)

Again though, that sounds like something that can be indexed.

I don't get what you are suggesting. There are INDEX.

The problem being addressed is that there is no WHERE clause that actually limits the posts.

JOIN is done on a table without first eliminating rows... and that worked OK when there was only 50,000 posts in the database, but now that it is over 1 million rows - it is causing major performance problems.

[–] [email protected] 2 points 1 year ago* (last edited 1 year ago) (1 children)

Sorry about the slow response. What I mean is, suppose you have a column with 1000 integers (27,5,100,60,...). You want to print the top 10:

SELECT num FROM xyz ORDER BY num DESC LIMIT 10;

The db has to scan all 1000 rows to find the 10 biggest numbers. Now suppose instead there is an index on that column, i.e. a btree that lets you search for a value with very few operations, or traverse the list in order. Now the SELECT doesn't have to examine all the rows. It only has to traverse 10 items from the index, starting at the large end. It does mean that UPDATE and INSERT operations for that columb become more expensive, since the index has to be updated too, but that too is less expensive than a table scan.

I'm saying that by having similar indexes on the possible sorting orders of read queries, you can likewise get rid of all the table scans. Does that make sense?

Similarly if you JOIN two indexed fields, that is like merging two sorted lists. The db can traverse both indexes in parallel to find the matching values. Db's can be very clever about stuff like this. It helps though if you use EXPLAIN to make sure they are doing the right thing.

[–] [email protected] 1 points 1 year ago (1 children)

It makes sense, but there are indexes.

As the subject of the post says... it is JOIN behavior that's the problem. The queries work perfectly fine when you ask for posts without doing JOIN to a bunch of empty tables.

[–] [email protected] 1 points 1 year ago (1 children)

Hmm, ok, something weird and pg specific might be going on. JOIN to an empty or almost empty table (I guess you mean outer join) sounds surpising but I'd hope the query planner can still do something reasonable. Anyway I don't feel like I'm being helpful at this point, so I'll stay out of your way. I'll be interested to know how it goes though.

[–] [email protected] 1 points 1 year ago (1 children)

i’d hope the query planner can still do something reasonable.

PostgreSQL specifically guards against queries with more than 8 joins... and Lemmy plows right past that.

[–] [email protected] 1 points 1 year ago (1 children)

What can I say, that sounds suspicious both from the PG side (complex queries with lots of joins are sometimes useful, such as for reporting) and on the Lemmy side (executing such queries in response to routine web requests is a pretty bad smell). It's still early days so this seems like a better time to re-examine the schema and migrate if necessary, than after waiting until there's a ton more data and activity.

[–] [email protected] 1 points 1 year ago* (last edited 1 year ago) (1 children)

It’s still early days

Lemmy has been on GitHub since February 2019, over four years. It isn't new at all. Several instances go way back.

The answer is: ORM.

[–] [email protected] 2 points 1 year ago

I don't mean the code is new, I mean the user base and data corpus are small compared to what we are hoping for. You're probably right about the ORM. :/