FS#20056 - [pacman] implement parallel transfers

Attached to Project: Pacman
Opened by Tomas Mudrunka (harvie) - Friday, 02 July 2010, 18:39 GMT
Last edited by Eli Schwartz (eschwartz) - Monday, 10 August 2020, 19:42 GMT
Task Type Feature Request
Category General
Status Closed
Assigned To Anatol Pomozov (anatolik)
Architecture All
Severity Medium
Priority Normal
Reported Version 3.3.3
Due in Version 6.0.0
Due Date Undecided
Percent Complete 100%
Votes 41
Private No

Details

Summary and Info:
i'd like to see parallel transfers implemented directly in pacman. we can download two or more packages at the same time and we can also download each package from different mirror. we can also -Sy repo indexeses from all repositories parallely.

i know that there are some wrappers like powerpill, etc... but i don't like them much and i think that such feature should be implemented directly in pacman. pacman can use 3rd party binaries ("XferCommand") to download the packages and other files, so i believe that such commands can be run on background...

but clean output of pacman should be also preserved... eg.: apt-get is able to process more files parallely, but it looks like vomiting 50 litres of some ugly sh*t into the terminal. there should be some nice progress bar showing overall process and not leaving mess on screen or maybe there can be multiple progressbars (one for each running process). anyway the clean look of pacman output should be definitely preserved, but parallel transfers are really usefull feature...

peace
This task depends upon

Closed by  Eli Schwartz (eschwartz)
Monday, 10 August 2020, 19:42 GMT
Reason for closing:  Implemented
Additional comments about closing:  Implemented for the internal downloader in various commits including

557845bc971ff272c53da773baea277a2d2d47b8
0346e0eef224ab8ba22b659026ffdf2bfe95f3ae
16d98d657748fdbf32ab24db56d3cd4a23447673

Please open a new ticket for XferCommand support, if desired.
Comment by Dan McGee (toofishes) - Friday, 02 July 2010, 18:42 GMT
I've never understood the need for this type of thing- are there not mirrors that can keep up with your connection speed? This would not benefit me whatsoever as I always saturate my connection, so chances of me wanting to work on this are pretty close to zero.
Comment by Tomas Mudrunka (harvie) - Friday, 02 July 2010, 19:07 GMT
1.) your isp can already have the file in his cache when you are able to receive it
2.) if you are waiting for one of unresponsive repos (unofficial repos are often down eg.: arch-games, etc...) you can download files (or wait for timeout) from other repos in meantime
2.1) sometimes server just stucks and one of it's processes becomes bit unresponsive, while new connections are running well
3.) today it's bit shame to make FTP/HTTP transfers synchronously. imagine that some stupid server software will be able to serve only to one client at the time and some client will stuck. this can be done in some early-ARPANet laboratory, but now? oh, come on! this is similar. we should not wait for slow people.

imagine this 100% parallelized future with all those big clusters of servers with multiple quantum processors and "distributed everything" creating one big swarm sharing all resources. archlinux is fast and almost ready for future, pacman should be also :-)))
Comment by Allan McRae (Allan) - Thursday, 08 July 2010, 04:47 GMT
Despite the above explanation that confused me more than anything... I can see some use for this coming from a place that tends to be far away for various non-Arch repos.

However, I generally do other things while an update runs so I would have no motivation to implement this either. Also, I think it would be herd to keep our nice output (I agree that apt's output in this respect is horrendous).

I think this is a "patches welcome" type request, as long as the number of parallel downloads was configurable and the output stayed sane.
Comment by Tomas Mudrunka (harvie) - Thursday, 24 March 2011, 15:25 GMT
i can imagine output as:

file: [++++______+++++_____+++_______]
(same file downloaded parallely 3times - maybe from different mirrors specified in pacman.conf)

core/file1, archaudio/file2, arch-games/file3:
[+++++_________][++++++++______][+++___________]
(three different files downloaded parallely - in optimal case downloaded each from different repository)
Comment by Matt Peterson (ricochet1k) - Friday, 22 April 2011, 15:58 GMT
I know there are a few programs, such as Aria2, that are designed to run many simultaneous downloads that can be read from a file, and already have some kind of nice output handling. Adding parallel download directly to pacman might be hard, but how hard would it be to add a ParallelXferCommand option to pacman.conf that would take an output directory and a file with a list of the files to download? That way at least it wouldn't take a pacman wrapper to do parallel downloads.

In fact, I might even be willing to dig into pacman myself and add this functionality.
Comment by Pablo Lezaeta (Jristz) - Tuesday, 26 April 2016, 21:49 GMT
Question, without the hastle of a wrapper with the actual state, is possible just to ad an Xfercommand with the propper intruction to do a paralel download using aria2?
I that is possible with the actual state of pacman, I dont see with the xfercommand example for a paralel download with aria2 cant be added instead of built a wrapper in pacman.
for the other side, the only Thing I see is a correct handling of .part files in pacman to prevent race conditions and make pacman wait until all download operations in xfercommands are finished.
Comment by mauro (lesto) - Sunday, 18 February 2018, 18:20 GMT
looking into it right know, unfortunately the code is quite uncommented and navigating trough the callback is a bit complex.

QUICK HACK SOLUTION - no compilation required:
- run pacman -Sup to get the list of all the package that need to be downloaded
- download them (im thinking on parallel with torrent + webseed)
- run pacman -Syu
- your Xfercommand will copy the package requested from your download directry to the one requested by pacman (download on the spot eventual missing package)

This work, but require some hook before and after pacman (to clean the cache), something a heper could do easily.

BETTER SOLUTION:
the problem seems to be in download_with_xfercommand(), it will be called for every single file and in particualr after a download it will call rename() (linee 277 of conf.c, commit a7dbe4635b4af9b5bd927ec0e7a211d1f6c1b0f2)
Now, it seems like if we don-t add %o, it will not rename or check the file. this mean more hard life for the script , i have no idea how to tell him the download path, but we can add a new paramenter

At this point is our command in Xfercommand taht will report to a daemon to tell the new file to download

finally we need to also add a new parameter command, Xfercommand_wait; if present, AFTER "downloading" all files, pacman will call this command and wait for its execution (classic 0 for no error, and anything else error)

BEST SOLUTION:
if a special setting is created, like "multi_xfer", in download_files() replace the "for(i = files; i; i = i->next) {...)" block is replaced by a call to the executable specified by multi_xfer: parameter are -c "chache_dir", -s "server1,server2,...", and -p "package1,package2". server parameter is the only optional.

This seems the most clean, but maybe I'm missing some side effect of the function we are not calling (for example, running ALPM_EVENT_PKGDOWNLOAD_START and other event for single files, but i guess this is not a big issue as i THINK we can just print the output of "multi_xfer" execuatable!

I'm gonna wait the response from someone expert on the code before staring the modification. The function call should be easy, what I still have to dog into is how to get multi_xfer from pacman.conf/argument list
Comment by mauro (lesto) - Sunday, 25 February 2018, 15:38 GMT
implemented an initial patch (see attachment), it is very small and it seems to work (the make check fail some test, but also the master where I branch from) but i have some question for the code expert:

As old implementation will not set the flag for parallel download(see point 3.), the lib remain backward compatible

1. I am adding dependency to pthread, is that ok?

2. download_single_file() did NOT change, BUT I'm assuming EVENT(handle, &event); and and MALLOC(payload->fileurl, len, RET_ERR(handle, ALPM_ERR_MEMORY, -1)); are thread safe.
we could comment out EVENT if in parallel mode AND not thread safe; but i have no idea about RET_ERR(handle, ALPM_ERR_MEMORY, -1)

3. I would like to add a flag in the conf file (and/or command line) to activate the parallel download. Can you please tell me how to do it? I would like to not break any compatibility

dome implementation details:
download_files() has minimal modifcation; if parallel download AND external command for download is set, then parallel_download_files() is called, otherwise serial_download_files() that is exactly the same code as before
THERE IS NO LIMIT AT THE PARALLEL DOWNLOAD; this is meant to be used with some kind of daemon that will mangae the downloads.
Comment by Dave Reisner (falconindy) - Sunday, 25 February 2018, 19:12 GMT
Thanks for the patch, but I can tell you from personal experience that using CURLM is a far better approach than trying to multiplex CURL handles on individual pthreads. Your approach also results in pacman-side callbacks that can't possibly be rationalized into meaningful progress bars for a human.

> I am adding dependency to pthread, is that ok?
No, it isn't. You'll likely need to deal with compatibility concerns (e.g. we have a large MinGW userbase). Moreover, you likely don't want it -- see above.

> this is meant to be used with some kind of daemon that will mangae the downloads.
Some kind of daemon? Please no... the intent is for pacman itself to manage everything.
Comment by mauro (lesto) - Sunday, 25 February 2018, 20:02 GMT
hello, please note the current implementation is more a request for comment and does NOT work with the internal CURL calls, but only with external program.

> CURLM is a far better approach

I did not know about it, I will keep that in mind when and if i will extend the parallel functionality to pacman itself. Also i had the idea to call a specific script to "lock" until complete, that is not a big issue to implement.(as soon As i understand how get data from the command line/pacman.conf, a tip in that direction would be nice)

> Your approach also results in pacman-side callbacks that can't possibly be rationalized into meaningful progress bars

You are right, but pretty sure that is a problem already there with Xfercommand. Pacman could just visualize how many process are waiting, he even know the size of the file for each thread and so estimate a "global" process bar. After all we don't really care about single file. As that is NOT enabled by default, i think is not a big deal to chenge the view.
Also using the external lock script, that script could be the one dispalyng the information.

>No, it isn't. You'll likely need to deal with compatibility concerns (e.g. we have a large MinGW userbase).

This is quite surprising for me, are you aware of any alternative I could use? otherwise I can look for some alternative, a quick google search has shown miniwg has winpthreads.
Or we can move the parallel support as optional build and disable if pthread is not available.
or i can use the idea with the "blocking" script, that would drop completly the phtread dependency, as the "go in background" behaviur would be manage by the external script called.

> the intent is for pacman itself to manage everything.

this patch is targeting the xfercommand, so there is any loss of "management" from pacman that is already there, is just a different way to manage external scrpt.

------

Actually the implementation with the block script seems much easier and even less code, so I'll give a quick try right now.
Comment by mauro (lesto) - Thursday, 01 March 2018, 19:11 GMT
here i am again, patch for parallel download using script (proposal N°2) attached
it tooks more time as I wanted, but a first implementation is there.

HOW IT WORK:

in pacman.conf set XferCommand to use an external program (mandatory) AND XferLockCommand (mandatory too, otherwise the old behaviur is maintaned)

for examle (code of the script for TESTING at the end. It is for testing so no judment please. In particular the lock script is just waiting for ALL istancces of curl to terminate)
XferCommand = /bin/backgroundDownload.sh %u %o
XferLockCommand = /bin/waitBackgroundDownload.sh

The new behaviur does NOT use thread but expect the script XferCommand to fork.
At the end of all the call to download, the XferLockCommand is called, and is expected to terminate only when ALL download completed.
WARNING: DIFFERENCE FROM CURRENT XferCommand BEHAVIUR: the ".part"has been removed as that was done for each file after running XferLockCommand, of course now that was failing as the file is probably not there. If tehre is interest to pull this patch, I will try to move the ".part" rename in a different spot in te code, should not be too hard.

All the code will use the download function that call the non-blocking XferCommand, AND the locking XferLockCommand, causing them to be single file download as now. (for example the db update)
Basically the real multifile parallel download is only for downloading the packages.

cat backgroundDownload.sh
#!/bin/sh

echo "downloading $1"
curl "$1" -f -O -J &

cat waitBackgroundDownload.sh
#!/bin/sh

prog="curl"
echo "Waiting for all instances of $prog to complete"
ps -Af | grep $prog | grep -v grep | wc -l
while [ `ps -Af | grep $prog | grep -v grep | wc -l` -ne 0 ]
do
echo "still waiting"
sleep 1
done

Comment by Julius Michaelis (caesar) - Tuesday, 13 March 2018, 14:54 GMT
mauro: Your current suggestion for waitBackgroundDownload.sh will wait for any curl, or in fact anything that has curl in the arguments, no? While that can certainly be fixed, I doubt any kind of locking solution implemented in shell scripts will ever be very reliable. Wouldn't it be easier to spawn one process and write all the desired packages to its stdin? Once it exits, all downloads are considered complete?
toofishes: I want this because initiating a HTTP request has a significant overhead through the proxy at my company.

My desired solution for this: spawn an external script, and write, tab-separated, the desired output path, the $repo, $arch, and package name, and the first 10(? or all?) URLs to download from according to the mirrorlist. This script could be as simple as sed -nr 's/([^\t]*)\t([^\t]*)\t([^\t]*)\t([^\t]*)\t(.*)/\6\n\tout=\1/p' | aria2c -i -.

Attached: My current workaround.
   wacman (0.7 KiB)
Comment by mauro (lesto) - Tuesday, 13 March 2018, 18:26 GMT
Hello Julius!

I think the issue with this method is that CLI parameter are up to 256 (ok, maybe modern system has more) character per line, and some other limitation compiled into the kernel.
I wont develop any further patch until I get an official answer here or on the mailing list, see https://lists.archlinux.org/pipermail/pacman-dev/2018-March/022378.html)

I also developed (uploaded as gist yesterday!) a script very similar to your, just a bit "smarter":
- uses checkupdates to get the list of pagkage to update and their version
- get the list of (uncommented) mirror from mirrorlist
- uses ALL the mirror found in aria2c to download every single package (save them into pacman cache folder, hardcoded /var/cache/pacman/pkg/, todo: read from pacman.conf)
- every 10 instance of aria2c run, it wait for ALL of them to complete, then download the next 10 (todo: instead, always keep 10 instance running)
- run pacman -Syu --noconfirm

you can find the script here: https://gist.github.com/MauroMombelli/04b24fa0644a4870869099276b86f2d4, once i fix a some isse, like checking for priviledge at startup, reading the cache path from the conf file, and read 10 parallelfile download (actually im thinking to decrease them to 3 or 5, since we are already calling multiple server)

I also have a server that is currenlty genereting .torrent file, with 10 random mirror for each as "webseed".
The idea is to opnionally use those torrent file, but i still have to decide how to distribute them; i would like to have something with low hoverhead and possibly p2p too, to avoid killing the bandwith of the server :/
Comment by Eli Schwartz (eschwartz) - Tuesday, 13 March 2018, 19:13 GMT
If all you wanted was something that wrapped around pacman and aria2, https://xyne.archlinux.ca/projects/powerpill/ has existed for quite a long time.
Comment by mauro (lesto) - Tuesday, 13 March 2018, 19:54 GMT
The idea here is to build the foundation to make multiple server and parallel download possible natively;
all the rest are experiment to see how the system works and where can be improved.

Then we can talk how powerpill lack a repository and a place dedicated on listing issue, it suggest (maybe im wrong) the developer is not open to collaborations.
Comment by Eli Schwartz (eschwartz) - Tuesday, 13 March 2018, 20:55 GMT
Using external wrappers is just as not-applicable as a daemon.

To provide XferCommand support for downloading a *single* file with aria2 using multiple http sources, you can just specify multiple Server urls directly to aria2. pacman needs to be taught how to recognize an external downloader that supports this, and how to add those multiple sources to the XferCommand command line.
To provide internal downloader or XferCommand support for downloading multiple files in parallel, pacman needs to know how to batch the same jobs it already does.

There is no foundation needed. Neither of these are something that requires server-side support, it would be handled entirely in libalpm.

(As for git repository hosting, https://bbs.archlinux.org/viewtopic.php?pid=954336#p954336 and https://bbs.archlinux.org/viewtopic.php?pid=1584207#p1584207 indicate you're not entirely wrong. Though there are support threads for issues.)
Comment by mauro (lesto) - Tuesday, 13 March 2018, 22:26 GMT
eli, please note my patch ARE targeting with libalpm, i shown 2 possible implementation and temporary workaround for who is interested; those interested could be who develop packet manager too.

>Using external wrappers is just as not-applicable as a daemon.

i disagree, external program would simpli send signal to the deamon, instructing it to download stuff or return when the job is done.
In my specific case i was thinking that a pacman invocation would simply tell the "download manager", that continuisly seed the packages at very low speed, to start downloading the new pakgages (only for those over a certain size), and remove all bandwidth limit until download is complete.

>To provide internal downloader or XferCommand support for downloading multiple files in parallel, pacman needs to know how to batch the same jobs it already does.

yes, that is a solution but for me the code is still too complex to know what and where, as falconindy points out, there is even the callback system to be taken care of, for internal downloader.

For xfer, why should pacman know? he send the request to the script, then if the script hadle them one by one or all together or whatever is a problem of the script, as long as a lock mechanish is provided where necessary.
As I am new to this code, I am try to implement a small modification that give a resonable and non-hackis support, all without breaking the current behaviur.
Comment by Julius Michaelis (caesar) - Tuesday, 13 March 2018, 23:36 GMT
mauro: The limitation on the length of the command line is not a problem because you can pass download links to the external downloader via its stdin. That would also make your problem of having for blocks of 10 downloads go away.
Eli Schwartz: I'm well aware of powerpill, but I don't want any wrappers around pacman to begin with, and installing it from AUR was annoying. But yeah, as long as this is not some tiny clean patch, it is better for this to stay outside of pacman.

Is there a better place for discussing this with mauro? (I'm idling on IRC as JCaesar…)
Comment by Allan McRae (Allan) - Wednesday, 14 March 2018, 02:12 GMT
Repeating what I wrote on the mailing list:

The primary target for development in terms of download is the internal (libcurl) downloader. I will not consider patches for implementing multiple transfers via XferCommand until it is implemented in the internal downloader. The implementation for both will overlap somewhat and maintaining support for non-basic features of the external downloader will never take precedence over implementing new features for the internal downloader. So I prefer not to implement a feature that we may deliberately break and remove in the future.
Comment by mauro (lesto) - Wednesday, 14 March 2018, 18:06 GMT
@allan that is fair, but that would also need a complete(?) rewrite of the GUI to handle the parallel upload, and probably the event system, right?

This seems a major modification, would be nice to have some more input on how implement this in a clean way

@Julius: i try to keep the patch as small as possible, but definetly the stdio would be much cleaner, especially if there would be a protocol to communicate the status of the download.

im lestofante on irc freenode
Comment by Allan McRae (Allan) - Tuesday, 19 November 2019, 04:49 GMT
So I don't have to re-look at this again... Attached is a basic example of moving up and down in a terminal updating output. This could be used to update the progress of multiple downloads.
   test.c (0.2 KiB)
Comment by Anatol Pomozov (anatolik) - Thursday, 06 February 2020, 01:35 GMT
Hey folks. There is finally some progress with the parallel pacman downloader. You can find WIP version of the feature at my github branch https://github.com/anatol/pacman/tree/parallel-download Please take a look. Any feedback or suggestions are very welcome.
Comment by Kevin Cox (kevincox) - Monday, 16 March 2020, 03:49 GMT
Unless I'm looking in the wrong place it appears that this isn't moving. I propose the follow to push this along.

1. Can we assign a dedicated reviewer to avoid the bystander effect? It seems like giving some feedback on anatolik's changes is required.
2. anatolik, I see that you called it a WIP. Can you list any known deficiencies that need to be addressed? Or is it ready for inclusion as far as you are aware?
Comment by Allan McRae (Allan) - Monday, 16 March 2020, 03:53 GMT
You are looking in the wrong place. We have mailing lists for patch review.
Comment by Anatol Pomozov (anatolik) - Monday, 20 July 2020, 16:38 GMT
The parallel download changes got merged pacman master branch. The feature will be available with pacman 6.x release.
Comment by Eli Schwartz (eschwartz) - Tuesday, 21 July 2020, 16:19 GMT
  • Field changed: Percent Complete (0% → 100%)
Nice work!

Now that we have internal downloader support (our main target) if anyone wants to extend the XferCommand functionality to also have this capability, they are welcome to do so. For now, I think this is done...

Loading...