FS#20056 - [pacman] implement parallel transfers
Attached to Project:
Pacman
Opened by Tomas Mudrunka (harvie) - Friday, 02 July 2010, 18:39 GMT
Last edited by Eli Schwartz (eschwartz) - Monday, 10 August 2020, 19:42 GMT
Opened by Tomas Mudrunka (harvie) - Friday, 02 July 2010, 18:39 GMT
Last edited by Eli Schwartz (eschwartz) - Monday, 10 August 2020, 19:42 GMT
|
Details
Summary and Info:
i'd like to see parallel transfers implemented directly in pacman. we can download two or more packages at the same time and we can also download each package from different mirror. we can also -Sy repo indexeses from all repositories parallely. i know that there are some wrappers like powerpill, etc... but i don't like them much and i think that such feature should be implemented directly in pacman. pacman can use 3rd party binaries ("XferCommand") to download the packages and other files, so i believe that such commands can be run on background... but clean output of pacman should be also preserved... eg.: apt-get is able to process more files parallely, but it looks like vomiting 50 litres of some ugly sh*t into the terminal. there should be some nice progress bar showing overall process and not leaving mess on screen or maybe there can be multiple progressbars (one for each running process). anyway the clean look of pacman output should be definitely preserved, but parallel transfers are really usefull feature... peace |
This task depends upon
Closed by Eli Schwartz (eschwartz)
Monday, 10 August 2020, 19:42 GMT
Reason for closing: Implemented
Additional comments about closing: Implemented for the internal downloader in various commits including
557845bc971ff272c53da773baea277a2d2d47b8
0346e0eef224ab8ba22b659026ffdf2bfe95f3ae
16d98d657748fdbf32ab24db56d3cd4a23447673
Please open a new ticket for XferCommand support, if desired.
Monday, 10 August 2020, 19:42 GMT
Reason for closing: Implemented
Additional comments about closing: Implemented for the internal downloader in various commits including
557845bc971ff272c53da773baea277a2d2d47b8
0346e0eef224ab8ba22b659026ffdf2bfe95f3ae
16d98d657748fdbf32ab24db56d3cd4a23447673
Please open a new ticket for XferCommand support, if desired.
2.) if you are waiting for one of unresponsive repos (unofficial repos are often down eg.: arch-games, etc...) you can download files (or wait for timeout) from other repos in meantime
2.1) sometimes server just stucks and one of it's processes becomes bit unresponsive, while new connections are running well
3.) today it's bit shame to make FTP/HTTP transfers synchronously. imagine that some stupid server software will be able to serve only to one client at the time and some client will stuck. this can be done in some early-ARPANet laboratory, but now? oh, come on! this is similar. we should not wait for slow people.
imagine this 100% parallelized future with all those big clusters of servers with multiple quantum processors and "distributed everything" creating one big swarm sharing all resources. archlinux is fast and almost ready for future, pacman should be also :-)))
However, I generally do other things while an update runs so I would have no motivation to implement this either. Also, I think it would be herd to keep our nice output (I agree that apt's output in this respect is horrendous).
I think this is a "patches welcome" type request, as long as the number of parallel downloads was configurable and the output stayed sane.
file: [++++______+++++_____+++_______]
(same file downloaded parallely 3times - maybe from different mirrors specified in pacman.conf)
core/file1, archaudio/file2, arch-games/file3:
[+++++_________][++++++++______][+++___________]
(three different files downloaded parallely - in optimal case downloaded each from different repository)
In fact, I might even be willing to dig into pacman myself and add this functionality.
I that is possible with the actual state of pacman, I dont see with the xfercommand example for a paralel download with aria2 cant be added instead of built a wrapper in pacman.
for the other side, the only Thing I see is a correct handling of .part files in pacman to prevent race conditions and make pacman wait until all download operations in xfercommands are finished.
QUICK HACK SOLUTION - no compilation required:
- run pacman -Sup to get the list of all the package that need to be downloaded
- download them (im thinking on parallel with torrent + webseed)
- run pacman -Syu
- your Xfercommand will copy the package requested from your download directry to the one requested by pacman (download on the spot eventual missing package)
This work, but require some hook before and after pacman (to clean the cache), something a heper could do easily.
BETTER SOLUTION:
the problem seems to be in download_with_xfercommand(), it will be called for every single file and in particualr after a download it will call rename() (linee 277 of conf.c, commit a7dbe4635b4af9b5bd927ec0e7a211d1f6c1b0f2)
Now, it seems like if we don-t add %o, it will not rename or check the file. this mean more hard life for the script , i have no idea how to tell him the download path, but we can add a new paramenter
At this point is our command in Xfercommand taht will report to a daemon to tell the new file to download
finally we need to also add a new parameter command, Xfercommand_wait; if present, AFTER "downloading" all files, pacman will call this command and wait for its execution (classic 0 for no error, and anything else error)
BEST SOLUTION:
if a special setting is created, like "multi_xfer", in download_files() replace the "for(i = files; i; i = i->next) {...)" block is replaced by a call to the executable specified by multi_xfer: parameter are -c "chache_dir", -s "server1,server2,...", and -p "package1,package2". server parameter is the only optional.
This seems the most clean, but maybe I'm missing some side effect of the function we are not calling (for example, running ALPM_EVENT_PKGDOWNLOAD_START and other event for single files, but i guess this is not a big issue as i THINK we can just print the output of "multi_xfer" execuatable!
I'm gonna wait the response from someone expert on the code before staring the modification. The function call should be easy, what I still have to dog into is how to get multi_xfer from pacman.conf/argument list
As old implementation will not set the flag for parallel download(see point 3.), the lib remain backward compatible
1. I am adding dependency to pthread, is that ok?
2. download_single_file() did NOT change, BUT I'm assuming EVENT(handle, &event); and and MALLOC(payload->fileurl, len, RET_ERR(handle, ALPM_ERR_MEMORY, -1)); are thread safe.
we could comment out EVENT if in parallel mode AND not thread safe; but i have no idea about RET_ERR(handle, ALPM_ERR_MEMORY, -1)
3. I would like to add a flag in the conf file (and/or command line) to activate the parallel download. Can you please tell me how to do it? I would like to not break any compatibility
dome implementation details:
download_files() has minimal modifcation; if parallel download AND external command for download is set, then parallel_download_files() is called, otherwise serial_download_files() that is exactly the same code as before
THERE IS NO LIMIT AT THE PARALLEL DOWNLOAD; this is meant to be used with some kind of daemon that will mangae the downloads.
> I am adding dependency to pthread, is that ok?
No, it isn't. You'll likely need to deal with compatibility concerns (e.g. we have a large MinGW userbase). Moreover, you likely don't want it -- see above.
> this is meant to be used with some kind of daemon that will mangae the downloads.
Some kind of daemon? Please no... the intent is for pacman itself to manage everything.
> CURLM is a far better approach
I did not know about it, I will keep that in mind when and if i will extend the parallel functionality to pacman itself. Also i had the idea to call a specific script to "lock" until complete, that is not a big issue to implement.(as soon As i understand how get data from the command line/pacman.conf, a tip in that direction would be nice)
> Your approach also results in pacman-side callbacks that can't possibly be rationalized into meaningful progress bars
You are right, but pretty sure that is a problem already there with Xfercommand. Pacman could just visualize how many process are waiting, he even know the size of the file for each thread and so estimate a "global" process bar. After all we don't really care about single file. As that is NOT enabled by default, i think is not a big deal to chenge the view.
Also using the external lock script, that script could be the one dispalyng the information.
>No, it isn't. You'll likely need to deal with compatibility concerns (e.g. we have a large MinGW userbase).
This is quite surprising for me, are you aware of any alternative I could use? otherwise I can look for some alternative, a quick google search has shown miniwg has winpthreads.
Or we can move the parallel support as optional build and disable if pthread is not available.
or i can use the idea with the "blocking" script, that would drop completly the phtread dependency, as the "go in background" behaviur would be manage by the external script called.
> the intent is for pacman itself to manage everything.
this patch is targeting the xfercommand, so there is any loss of "management" from pacman that is already there, is just a different way to manage external scrpt.
------
Actually the implementation with the block script seems much easier and even less code, so I'll give a quick try right now.
it tooks more time as I wanted, but a first implementation is there.
HOW IT WORK:
in pacman.conf set XferCommand to use an external program (mandatory) AND XferLockCommand (mandatory too, otherwise the old behaviur is maintaned)
for examle (code of the script for TESTING at the end. It is for testing so no judment please. In particular the lock script is just waiting for ALL istancces of curl to terminate)
XferCommand = /bin/backgroundDownload.sh %u %o
XferLockCommand = /bin/waitBackgroundDownload.sh
The new behaviur does NOT use thread but expect the script XferCommand to fork.
At the end of all the call to download, the XferLockCommand is called, and is expected to terminate only when ALL download completed.
WARNING: DIFFERENCE FROM CURRENT XferCommand BEHAVIUR: the ".part"has been removed as that was done for each file after running XferLockCommand, of course now that was failing as the file is probably not there. If tehre is interest to pull this patch, I will try to move the ".part" rename in a different spot in te code, should not be too hard.
All the code will use the download function that call the non-blocking XferCommand, AND the locking XferLockCommand, causing them to be single file download as now. (for example the db update)
Basically the real multifile parallel download is only for downloading the packages.
cat backgroundDownload.sh
#!/bin/sh
echo "downloading $1"
curl "$1" -f -O -J &
cat waitBackgroundDownload.sh
#!/bin/sh
prog="curl"
echo "Waiting for all instances of $prog to complete"
ps -Af | grep $prog | grep -v grep | wc -l
while [ `ps -Af | grep $prog | grep -v grep | wc -l` -ne 0 ]
do
echo "still waiting"
sleep 1
done
toofishes: I want this because initiating a HTTP request has a significant overhead through the proxy at my company.
My desired solution for this: spawn an external script, and write, tab-separated, the desired output path, the $repo, $arch, and package name, and the first 10(? or all?) URLs to download from according to the mirrorlist. This script could be as simple as sed -nr 's/([^\t]*)\t([^\t]*)\t([^\t]*)\t([^\t]*)\t(.*)/\6\n\tout=\1/p' | aria2c -i -.
Attached: My current workaround.
I think the issue with this method is that CLI parameter are up to 256 (ok, maybe modern system has more) character per line, and some other limitation compiled into the kernel.
I wont develop any further patch until I get an official answer here or on the mailing list, see https://lists.archlinux.org/pipermail/pacman-dev/2018-March/022378.html)
I also developed (uploaded as gist yesterday!) a script very similar to your, just a bit "smarter":
- uses checkupdates to get the list of pagkage to update and their version
- get the list of (uncommented) mirror from mirrorlist
- uses ALL the mirror found in aria2c to download every single package (save them into pacman cache folder, hardcoded /var/cache/pacman/pkg/, todo: read from pacman.conf)
- every 10 instance of aria2c run, it wait for ALL of them to complete, then download the next 10 (todo: instead, always keep 10 instance running)
- run pacman -Syu --noconfirm
you can find the script here: https://gist.github.com/MauroMombelli/04b24fa0644a4870869099276b86f2d4, once i fix a some isse, like checking for priviledge at startup, reading the cache path from the conf file, and read 10 parallelfile download (actually im thinking to decrease them to 3 or 5, since we are already calling multiple server)
I also have a server that is currenlty genereting .torrent file, with 10 random mirror for each as "webseed".
The idea is to opnionally use those torrent file, but i still have to decide how to distribute them; i would like to have something with low hoverhead and possibly p2p too, to avoid killing the bandwith of the server :/
all the rest are experiment to see how the system works and where can be improved.
Then we can talk how powerpill lack a repository and a place dedicated on listing issue, it suggest (maybe im wrong) the developer is not open to collaborations.
To provide XferCommand support for downloading a *single* file with aria2 using multiple http sources, you can just specify multiple Server urls directly to aria2. pacman needs to be taught how to recognize an external downloader that supports this, and how to add those multiple sources to the XferCommand command line.
To provide internal downloader or XferCommand support for downloading multiple files in parallel, pacman needs to know how to batch the same jobs it already does.
There is no foundation needed. Neither of these are something that requires server-side support, it would be handled entirely in libalpm.
(As for git repository hosting, https://bbs.archlinux.org/viewtopic.php?pid=954336#p954336 and https://bbs.archlinux.org/viewtopic.php?pid=1584207#p1584207 indicate you're not entirely wrong. Though there are support threads for issues.)
>Using external wrappers is just as not-applicable as a daemon.
i disagree, external program would simpli send signal to the deamon, instructing it to download stuff or return when the job is done.
In my specific case i was thinking that a pacman invocation would simply tell the "download manager", that continuisly seed the packages at very low speed, to start downloading the new pakgages (only for those over a certain size), and remove all bandwidth limit until download is complete.
>To provide internal downloader or XferCommand support for downloading multiple files in parallel, pacman needs to know how to batch the same jobs it already does.
yes, that is a solution but for me the code is still too complex to know what and where, as falconindy points out, there is even the callback system to be taken care of, for internal downloader.
For xfer, why should pacman know? he send the request to the script, then if the script hadle them one by one or all together or whatever is a problem of the script, as long as a lock mechanish is provided where necessary.
As I am new to this code, I am try to implement a small modification that give a resonable and non-hackis support, all without breaking the current behaviur.
Eli Schwartz: I'm well aware of powerpill, but I don't want any wrappers around pacman to begin with, and installing it from AUR was annoying. But yeah, as long as this is not some tiny clean patch, it is better for this to stay outside of pacman.
Is there a better place for discussing this with mauro? (I'm idling on IRC as JCaesar…)
The primary target for development in terms of download is the internal (libcurl) downloader. I will not consider patches for implementing multiple transfers via XferCommand until it is implemented in the internal downloader. The implementation for both will overlap somewhat and maintaining support for non-basic features of the external downloader will never take precedence over implementing new features for the internal downloader. So I prefer not to implement a feature that we may deliberately break and remove in the future.
This seems a major modification, would be nice to have some more input on how implement this in a clean way
@Julius: i try to keep the patch as small as possible, but definetly the stdio would be much cleaner, especially if there would be a protocol to communicate the status of the download.
im lestofante on irc freenode
1. Can we assign a dedicated reviewer to avoid the bystander effect? It seems like giving some feedback on anatolik's changes is required.
2. anatolik, I see that you called it a WIP. Can you list any known deficiencies that need to be addressed? Or is it ready for inclusion as far as you are aware?
Now that we have internal downloader support (our main target) if anyone wants to extend the XferCommand functionality to also have this capability, they are welcome to do so. For now, I think this is done...