FS#9424 - db.lck storing pid
Attached to Project:
Pacman
Opened by pajaro (pajaro) - Friday, 01 February 2008, 03:40 GMT
Last edited by Allan McRae (Allan) - Saturday, 09 February 2013, 05:10 GMT
Opened by pajaro (pajaro) - Friday, 01 February 2008, 03:40 GMT
Last edited by Allan McRae (Allan) - Saturday, 09 February 2013, 05:10 GMT
|
Details
Description:
Again, storing pid in /var/pacman/db.lck should be the way to be. Reasons why: - Pacman would be able to handle automatically lock file removal without user interaction if db.lck was dead, but it would still be locked if it was alive. - Locking systems i saw use to store pid. - Apart from pacman, any other applications would be able handle pid. - Removing lock file when pacman crashed takes a few seconds seconds. After some months in can get to one minute. Since we have thousands of arch users, it gives us thousands of minutes that could have been saved. Thousands of minutes is enough hours to have to have some meals. This meals equal to at least one chicken. Save the chickens. Add pid to db.lck. Jokes apart, what's wrong about storing pid in db.lck? |
This task depends upon
Also, I was wondering why good locking systems create symlinks to temporary files. I found why: the pid that pacman had when the db.lck file was created can be assigned to another process after a reboot, so the workaround for this is to create db.lck in /tmp and symlink it to /var/lib/pacman. This way nonroot users still can't lock pacman db, but after a reboot the lock file is gone.
So my question, again, is... why not to store pacman pid (process id) in db.lck (/var/lib/pacman/db.lck)?
Now the function _alpm_lckmk can return 2 error values, -1 and -2.
-1: "unable to lock database": there was a error creating the lock file. It still has associated the error message PM_ERR_HANDLE_LOCK, so localizations have to update it and remove all the "if you are sure pacman is not running..." tail, since it is not needed anymore.
-2: "another instance is running". It has a new error message, PM_ERR_INSTANCE_RUNNING.
In the attached tar.gz there are the diffs for "util.c", "trans.c", "error.c" and "alpm.h" for the current version of pacman.
- attempting to run a second instance of "pacman -S" returns "another instance is running" message
- when pacman finishes running lock files and its symlink in pacman db get erased.
- if I kill "pacman -S" (this leaves lock files in place) and i run it again no message shows up (as expected, since pid stored in the lock is not running anymore).
Also, I realized that I misstyped "alpm.h.diff", and i wrote "almp.h.diff" instead.
First problem : existingpid is not null terminated, which makes the stat /proc/pid fail when it should succeed.
Second prob : when you reboot (or simply delete the file in /tmp/), then run pacman again, it doesn't delete the broken symlink : /var/lib/pacman/db.lck
And then when it tries to overwrite it with the correct symlink to the new pid file in /tmp/, it simply fails, and you are stuck with the broken symlink.
Finally, I am not sure what would happen if these functions were run as user (and not root). I didn't test that though, because pacman has a root check
that happens first (but there were discussions about removing that root check).
Did you also take care of the 2 problems I mentioned?
Also changed lockfile mode from 0000 (noaccess) to 0444 (readonly).
Consistent readonly access to the db requires lock checking, because if the db is being modified while you query it, you will get unconsistent results.
If only root users can check lock status, root access is required for consistent db queries.
On the other hand, this would be compatible with removing root check, since write access would be required to create "/var/lib/pacman/db.lck" anyway.
Keep in mind that i didn't code in C since i was 16, so maybe there's a shortand for the for loop i added to fill existingpid with zerous
Thank you for your comments :)
- Users can't delete /var/lib/pacman/db.lck
- Users can't create /var/lib/pacman/db.lck
- Users can create /tmp/pacman-lock-[pid] files, but the pid will shouldn't get repeated before a reboot.
Also, since i don't know the max index of a pid, I made it get to a max pid of 999999. If this isn't big enough, i can increase it.
For initializing the array existingpid, you could use memset.
Or you could use dynamic arrays and calloc / free.
But only one zero is important, the one terminating the string, so thats what I did :
+ int n = read(fd, existingpid, sizeof(existingpid));
+ close(fd);
+ existingpid[n] = '\0';
But no need to resend the patch for now. It still works like that.
Let's wait for other peoples' comments or suggestions.
http://www.archlinux.org/pipermail/pacman-dev/2008-February/011222.html
Anyway, this patch caused very little interest, so it didn't go far.
> 2. Just for safety we could compare our current pid and the lock-pid on lockfile
> removal.
>
The problem comes when the db.lck didn't get removed (because of a power shutdown, a system failure, etc). When it gets succesfully removed the extra safety is not needed.
http://projects.archlinux.org/users/allan/pacman.git/commit/?id=b55f4780
I am stuck trying to read that PID in and convert it to a pid_t to see if that process is running in a nice portable way. POSIX says pid_t is an int type so I suppose we can read it in to a char* and then convert that to a long then to a pid_t and we should be safe. Would I just use a "magic number" to define the array size needed to read this value in, or is there a better way?
The first patch just avoids the need to manually remove db.lck, while the second patch is an alternative which instead of exiting if an other process still hold a lock on db.lck wiat for him to finish.
I'm not very sure of my code as I'm not (yet) very experimented in writing C. Also note that this approach may not work if db.lck is stored on nfs (does anybody do this ?).
What do you think of it ?
pidfile_flock_wait.diff (1 KiB)
Patch 1 has potential, although it doesn't update the behavior of a few scripts that would need to be adjusted to accommodate this new behavior. It also just doesn't feel quite right...I'm not sure why that is though. Is there some good discussion out there that explains O_EXCL vs. fctnl() locking vs. flock() behavior, and hopefully on more platforms than just Linux?
Another thing that occurred to me today is we can be much smarter with our locking in general; I've opened another feature request for this: FS#23501
You can look at http://0pointer.de/blog/projects/locking and http://0pointer.de/blog/projects/locking2 but if you care of NFS and cross-platform it's going to be tricky...
"if I'm not mistaken you've introduced a no-idle, 100% CPU loop. This won't cut it."
D'oh !
the bug just talk about of just add the pid of pacman to db.lck nothing more, no "pid plucking" nor "flock waiting" nor "to do fancie things".
For those other things like controling locking with that info a different bug should be opened.