FS#9424 - db.lck storing pid

Attached to Project: Pacman
Opened by pajaro (pajaro) - Friday, 01 February 2008, 03:40 GMT
Last edited by Allan McRae (Allan) - Saturday, 09 February 2013, 05:10 GMT
Task Type Feature Request
Category General
Status Assigned
Assigned To Andrew Gregory (andrewgregory)
Architecture All
Severity Low
Priority Normal
Reported Version 3.1.1
Due in Version Undecided
Due Date Undecided
Percent Complete 0%
Votes 3
Private No

Details

Description:

Again, storing pid in /var/pacman/db.lck should be the way to be. Reasons why:
- Pacman would be able to handle automatically lock file removal without user interaction if db.lck was dead, but it would still be locked if it was alive.
- Locking systems i saw use to store pid.
- Apart from pacman, any other applications would be able handle pid.
- Removing lock file when pacman crashed takes a few seconds seconds. After some months in can get to one minute. Since we have thousands of arch users, it gives us thousands of minutes that could have been saved. Thousands of minutes is enough hours to have to have some meals. This meals equal to at least one chicken.

Save the chickens. Add pid to db.lck.

Jokes apart, what's wrong about storing pid in db.lck?
This task depends upon

Comment by Roman Kyrylych (Romashka) - Friday, 01 February 2008, 13:29 GMT
It was decided that db.lck should be stored in db.
Comment by pajaro (pajaro) - Friday, 01 February 2008, 13:59 GMT
Romashka, you didn't read me. I didn't say anything about the lock file location. I was talking about storing pacman pid (process id) in the file db.lck.

Also, I was wondering why good locking systems create symlinks to temporary files. I found why: the pid that pacman had when the db.lck file was created can be assigned to another process after a reboot, so the workaround for this is to create db.lck in /tmp and symlink it to /var/lib/pacman. This way nonroot users still can't lock pacman db, but after a reboot the lock file is gone.

So my question, again, is... why not to store pacman pid (process id) in db.lck (/var/lib/pacman/db.lck)?

Comment by Roman Kyrylych (Romashka) - Friday, 01 February 2008, 14:01 GMT
Ah, sorry, I've missed the pid part.
Comment by pajaro (pajaro) - Friday, 01 February 2008, 15:11 GMT
it's ok, it can happen to anyone :)
Comment by Dan McGee (toofishes) - Saturday, 02 February 2008, 15:58 GMT
They aren't the prettiest functions in the world, but take a look at _alpm_lckmk and _alpm_lckrm in lib/libalpm/util.c. It should be quite easy to make a patch for this change.
Comment by pajaro (pajaro) - Saturday, 02 February 2008, 18:50 GMT
let's see what i can do :)
Comment by pajaro (pajaro) - Sunday, 03 February 2008, 02:29 GMT
I have it working.

Now the function _alpm_lckmk can return 2 error values, -1 and -2.
-1: "unable to lock database": there was a error creating the lock file. It still has associated the error message PM_ERR_HANDLE_LOCK, so localizations have to update it and remove all the "if you are sure pacman is not running..." tail, since it is not needed anymore.
-2: "another instance is running". It has a new error message, PM_ERR_INSTANCE_RUNNING.

In the attached tar.gz there are the diffs for "util.c", "trans.c", "error.c" and "alpm.h" for the current version of pacman.

Comment by pajaro (pajaro) - Sunday, 03 February 2008, 02:44 GMT
Of course, I had run some tests:
- attempting to run a second instance of "pacman -S" returns "another instance is running" message
- when pacman finishes running lock files and its symlink in pacman db get erased.
- if I kill "pacman -S" (this leaves lock files in place) and i run it again no message shows up (as expected, since pid stored in the lock is not running anymore).

Also, I realized that I misstyped "alpm.h.diff", and i wrote "almp.h.diff" instead.
Comment by Dan McGee (toofishes) - Sunday, 03 February 2008, 02:59 GMT
Can we get a single unified diff (diff -Naur) or a git patch, please? It is much easier to read than 4 separate patches not in unified format.
Comment by pajaro (pajaro) - Sunday, 03 February 2008, 03:53 GMT
I think this is what you need.
Comment by Xavier (shining) - Sunday, 03 February 2008, 11:35 GMT
You really have it working? It doesn't work here :)
First problem : existingpid is not null terminated, which makes the stat /proc/pid fail when it should succeed.
Second prob : when you reboot (or simply delete the file in /tmp/), then run pacman again, it doesn't delete the broken symlink : /var/lib/pacman/db.lck
And then when it tries to overwrite it with the correct symlink to the new pid file in /tmp/, it simply fails, and you are stuck with the broken symlink.

Finally, I am not sure what would happen if these functions were run as user (and not root). I didn't test that though, because pacman has a root check
that happens first (but there were discussions about removing that root check).
Comment by pajaro (pajaro) - Sunday, 03 February 2008, 12:14 GMT
Improved error handling.
Comment by Xavier (shining) - Sunday, 03 February 2008, 12:18 GMT
I am afraid you attached exactly the same file :)
Did you also take care of the 2 problems I mentioned?
Comment by pajaro (pajaro) - Sunday, 03 February 2008, 12:42 GMT
Ooops, i was aditing the wrong file, so the 2nd patch is the same that the 1st one.

Also changed lockfile mode from 0000 (noaccess) to 0444 (readonly).
Comment by pajaro (pajaro) - Sunday, 03 February 2008, 13:00 GMT
The reason for changing lockfile mode from noaccess to readonly is that if it has noaccess mode only root users can check if pacman db is locked.

Consistent readonly access to the db requires lock checking, because if the db is being modified while you query it, you will get unconsistent results.

If only root users can check lock status, root access is required for consistent db queries.

Comment by pajaro (pajaro) - Sunday, 03 February 2008, 13:09 GMT
Xavier, I didn't see your posts.
Comment by pajaro (pajaro) - Sunday, 03 February 2008, 13:53 GMT
Xavier, it really works here. None of the problems you describe appear to me, so i don't know if this last update solves your problems.

On the other hand, this would be compatible with removing root check, since write access would be required to create "/var/lib/pacman/db.lck" anyway.

Keep in mind that i didn't code in C since i was 16, so maybe there's a shortand for the for loop i added to fill existingpid with zerous

Thank you for your comments :)
Comment by pajaro (pajaro) - Sunday, 03 February 2008, 14:02 GMT
About running them as user:

- Users can't delete /var/lib/pacman/db.lck
- Users can't create /var/lib/pacman/db.lck
- Users can create /tmp/pacman-lock-[pid] files, but the pid will shouldn't get repeated before a reboot.

Also, since i don't know the max index of a pid, I made it get to a max pid of 999999. If this isn't big enough, i can increase it.

Comment by Xavier (shining) - Sunday, 03 February 2008, 14:12 GMT
That last version seems to work without editing, thanks :)
For initializing the array existingpid, you could use memset.
Or you could use dynamic arrays and calloc / free.

But only one zero is important, the one terminating the string, so thats what I did :
+ int n = read(fd, existingpid, sizeof(existingpid));
+ close(fd);
+ existingpid[n] = '\0';
Comment by pajaro (pajaro) - Sunday, 03 February 2008, 16:55 GMT
Updated patch with Xavier contribution :)
Comment by Xavier (shining) - Sunday, 03 February 2008, 17:30 GMT
Well, now you can remove the for loop filling with zeros :)
But no need to resend the patch for now. It still works like that.
Let's wait for other peoples' comments or suggestions.
Comment by pajaro (pajaro) - Sunday, 03 February 2008, 18:02 GMT
:P that's true
Comment by Loui Chang (louipc) - Tuesday, 22 July 2008, 23:02 GMT
Yeah!
Comment by Xavier (shining) - Wednesday, 23 July 2008, 06:05 GMT
I posted this patch on pacman-dev ML back then, and Nagy made 2 comments which I answered there :
http://www.archlinux.org/pipermail/pacman-dev/2008-February/011222.html

Anyway, this patch caused very little interest, so it didn't go far.
Comment by Allan McRae (Allan) - Wednesday, 23 July 2008, 06:58 GMT
I think this is an interesting feature and the patch looks quite reasonable. I guess it just got lost because it wasn't submitted as a git patch to the mailing list.
Comment by pajaro (pajaro) - Wednesday, 23 July 2008, 10:07 GMT
I cannot follow the pacman-dev mailing list, so i write here the answer to this:

> 2. Just for safety we could compare our current pid and the lock-pid on lockfile
> removal.
>

The problem comes when the db.lck didn't get removed (because of a power shutdown, a system failure, etc). When it gets succesfully removed the extra safety is not needed.
Comment by pajaro (pajaro) - Wednesday, 23 July 2008, 10:09 GMT
Also, in reply to that, this patch also allows processes that are not named pacman to lock/unlock the db safely
Comment by Allan McRae (Allan) - Monday, 21 December 2009, 10:42 GMT
Hmmm... it seems I completely forgot about this bug report and patch, but I did add the pid to the lock file:
http://projects.archlinux.org/users/allan/pacman.git/commit/?id=b55f4780

I am stuck trying to read that PID in and convert it to a pid_t to see if that process is running in a nice portable way. POSIX says pid_t is an int type so I suppose we can read it in to a char* and then convert that to a long then to a pid_t and we should be safe. Would I just use a "magic number" to define the array size needed to read this value in, or is there a better way?
Comment by Jonathan Lestrelin (Gentledevil) - Tuesday, 29 March 2011, 15:40 GMT
I've made two small patches to try to improve handling of db.lck using flock(2) advisory locks to know if the process that created the file is still alive.

The first patch just avoids the need to manually remove db.lck, while the second patch is an alternative which instead of exiting if an other process still hold a lock on db.lck wiat for him to finish.

I'm not very sure of my code as I'm not (yet) very experimented in writing C. Also note that this approach may not work if db.lck is stored on nfs (does anybody do this ?).

What do you think of it ?
Comment by Dan McGee (toofishes) - Tuesday, 29 March 2011, 16:26 GMT
Patch 2 is a definite no go- we can't have a blocking action like this in a library, and if I'm not mistaken you've introduced a no-idle, 100% CPU loop. This won't cut it.

Patch 1 has potential, although it doesn't update the behavior of a few scripts that would need to be adjusted to accommodate this new behavior. It also just doesn't feel quite right...I'm not sure why that is though. Is there some good discussion out there that explains O_EXCL vs. fctnl() locking vs. flock() behavior, and hopefully on more platforms than just Linux?

Another thing that occurred to me today is we can be much smarter with our locking in general; I've opened another feature request for this: FS#23501
Comment by Jonathan Lestrelin (Gentledevil) - Tuesday, 29 March 2011, 17:21 GMT
"Is there some good discussion out there that explains O_EXCL vs. fctnl() locking vs. flock() behavior, and hopefully on more platforms than just Linux?"
You can look at http://0pointer.de/blog/projects/locking and http://0pointer.de/blog/projects/locking2 but if you care of NFS and cross-platform it's going to be tricky...

"if I'm not mistaken you've introduced a no-idle, 100% CPU loop. This won't cut it."
D'oh !
Comment by Pablo Lezaeta (Jristz) - Friday, 14 April 2017, 03:53 GMT
is not easier to just "pidof pacman >> db.lck"

the bug just talk about of just add the pid of pacman to db.lck nothing more, no "pid plucking" nor "flock waiting" nor "to do fancie things".
For those other things like controling locking with that info a different bug should be opened.
Comment by Allan McRae (Allan) - Friday, 14 April 2017, 04:03 GMT
We already have FS#23501 which was opened six years ago (after the last last comment on this bug before today). So you contribution is a bit late...

Loading...