FS#15043 - Need better parsing of PKGBUILDs

Attached to Project: AUR web interface
Opened by Tomas Mudrunka (harvie) - Wednesday, 10 June 2009, 13:26 GMT
Last edited by Lukas Fleischer (lfleischer) - Friday, 03 May 2013, 08:55 GMT
Task Type Feature Request
Category PKGBUILD Parser
Status Closed
Assigned To Lukas Fleischer (lfleischer)
Architecture All
Severity High
Priority Normal
Reported Version 1.5.6.2
Due in Version Undecided
Due Date Undecided
Percent Complete 100%
Votes 25
Private No

Details

== 1.) It would be nice to see all common arrays from PKGBUILD in AUR. especially optdepends should be displayed next to the regular dependencies.


== 2.) Source array is not parsed in some cases. i don't know why... compare this:
http://aur.archlinux.org/packages.php?ID=25384 (sources are listed)
http://aur.archlinux.org/packages.php?ID=24117 (sources are not listed)

so there is some obvious bug... BASH (makepkg) is able to parse those PKGBUILDs without problem. You can use some bash script to parse PKGBUILDs for PHP when new PKGBUILDs are uploaded.


== 3.) arch and license arrays can be converted to links like
http://wiki.archlinux.org/index.php/AUR-Licences#$LICENSE
http://wiki.archlinux.org/index.php/AUR-Archs#$ARCH

so we can have nice connection between AUR and Wiki...


== 4.) add some "long description" to the PKGBUILDs that will be used to display more information (longer than that 80 letters) about packages on WEB ONLY?
This task depends upon

Closed by  Lukas Fleischer (lfleischer)
Friday, 03 May 2013, 08:55 GMT
Reason for closing:  Won't implement
Additional comments about closing:  Please use .AURINFO if the PKGBUILD parser doesn't work for you.
Comment by Tomas Mudrunka (harvie) - Wednesday, 10 June 2009, 22:32 GMT
2.) i think that AUR website have problems with parsing PKGBUILD arrays that are including comments. please take look at this (in most cases there is no real need for using comments, but sometimes it's usefull to make small comment)
Comment by Gergely (imrehg) - Saturday, 08 August 2009, 03:15 GMT
2.) I think that has been fixed by now. When I upload the "broken" package to a test-AUR on my computer, it is parsed correctly. Do you still have problem packages? To check if it works well now please upload the package again so it will be parsed once more.
Comment by Tomas Mudrunka (harvie) - Thursday, 13 August 2009, 16:48 GMT
2.) yea, AUR works great with comments in Arrays now.
But there are still the other 3 subissues...
Comment by Gergely (imrehg) - Friday, 14 August 2009, 00:23 GMT
1) Yeah, probably it would be nice, but you can already read those things just by clicking on the PKGBUILD link, and that does not require to do any fancy parsing on AUR's part. Having said that, the optdepends part would indeed be nice. There's some issue that the optdepeneds don't seem to have a standard format, but with a bit of heuristics that can be handled. (E.g. extra/wine only list the dependencies, extra/transmission-gtk has comments for each of them in the form of "optional: the function it would add". I can't remember, but I think I've seen packages where all the optdepends are in one line, not in an array.) But as I said, this is not too bad, I started to check out how to handle it. Other fields, though, I don't really see the point at the moment.

2) I'm glad it's fine now.

3) Started to check that one out too, hopefully will have some time over the weekend.

4) I don't think there should be any AUR specific field in PKGBUILDs. If the pacman maintainers implement it then AUR should definitely support it, but not before that. Not sure how others feel about this....
Comment by Gavin Bisesi (Daenyth) - Tuesday, 08 September 2009, 22:51 GMT
WRT optdepends, I'm pretty sure the 'official' way is "optional: enables foo support" in an array. Anything else should probably be regarded as an error
Comment by Tomas Mudrunka (harvie) - Wednesday, 09 September 2009, 13:40 GMT
Daenyth: yes, but there can be links to those optdepend packages
<a href="http://aur.archlinux.org/packages.php?O=0&K=foo&do_Search=Go">foo</a>: enables foo support

today you can't see optdepends in aur at all...
Comment by Dino Krtanjek (KRTac) - Thursday, 01 October 2009, 23:40 GMT
Hi.

3) Why not link to the package directly insted of going to the search first. A little query on the database, and there you go.

I can go work on the issue, but is there already somebody working on it? (I'm new to this...just subscribe to the mailing list)
Comment by Dino Krtanjek (KRTac) - Thursday, 01 October 2009, 23:54 GMT
oh....nevermind, I just noticed the task status list on the left of the description...as I said, I'm a n00b.
Comment by Tomas Mudrunka (harvie) - Wednesday, 11 August 2010, 20:23 GMT
Please take a look at  FS#16394 

I think that BASH itself is the best for parsing BASH scripts (such as PKGBUILDs), so we should use BASH code base to parse PKGBUILDs. Problem is that BASH itself is insecure, because it needs to execute the code and we cannot execute untrusted PKGBUILDs on AUR server. I've suggested to use some scripts like this:
http://aur.pastebin.com/B6x1tfV1
with a little bit modified (crippled & secured) BASH (which will not execute anything and will not access the filesystem - i guess this can be reached by some kind of small "lobotomy" commited on original BASH source).

Maybe something also going on at those projects:
- http://github.com/Daenyth/Bashful
- http://github.com/sebnow/aur2
Comment by Gavin Bisesi (Daenyth) - Wednesday, 11 August 2010, 20:25 GMT
As I said in the other bug report, it will not work. I've stated this enough times that I'm not going to bother explaining it yet again. Bashful also is not correct for this, as you'd notice if you read the README file (specifically the "It will not [...] be correct (for potentially large numbers of cases)"). There's also a grand total of 0sloc currently. Consider it vaporware.
Comment by C Anthony Risinger (extofme) - Sunday, 26 September 2010, 20:07 GMT
yes i completely agree with Daenyth; modifying bash/parsing bash is nonsense for several, already touched reasons.

a small, metadata text file is not "bloat"... it's... a small, highly compressible _text file_. in the AUR implementation i am working on [https://bbs.archlinux.org/viewtopic.php?id=99839], SRCINFO will be _required_, as the AUR will run as pure python, as a desktop app, or translated to javascript as a webapp. i am also looking to the future, a distibuted p2p version of AUR, so it has to work on _every_ installed machine. i need a way to get the information i need, in python; text files are a common way to perform this. i'm not even going to attempt trying to parse a pkgbuild.

if anything, SRCINFO could simply be in a common format with high language support such as YAML/etc., this would allow for easy mapping to language constructs.

in my opinion, the entire concept of "parsing bash" is an absolute non-starter, and a complete waste of time; you don't parse code (bash) unless you're a compiler. an easy meta file is the way to go, that can be read by ANY language; this is how we do things in the development world.

that said, i really like the idea of wiki integration; integrate as much as possible.

C Anthony
Comment by unforken (unforken) - Friday, 22 October 2010, 04:21 GMT
meta data generated by makepkg --source command would be nice. website can't parse the PKGBUILD and probably it shouldn't. instead it can use a meta-data generated by makepkg(i.e. pre-parsed)
Comment by Tomas Mudrunka (harvie) - Saturday, 30 October 2010, 14:07 GMT
BTW is there possibility to have some simple workaround, so current AUR will be able to accept split-packages? I think it's more important to be able to upload the split-pkgs to AUR than showing the proper informations. We can just take the first package from split-pkg and ignore the others. So users will be able to share their split-pkgs until this get properly fixed...
Comment by Loui Chang (louipc) - Saturday, 18 December 2010, 17:19 GMT
I don't believe we should try to parse bash PKGBUILDs anymore.
Imagine if pacman had to parse them. Hah. There's a reason packages contain easily parsable metadata.

Let's think of the AUR as a repo, like a pacman repo but for source.
We have done pretty well so far without good metadata, but source packages should no longer be considered second-class.

We should support the idea of including parsable metadata in source packages just as it is in binary
packages. So ultimately I think this is really a makepkg issue. Anyone who is interested in this
problem should open a ticket in the pacman bug tracker and contact their mailing list pacman-dev@archlinux.org
Comment by Tomas Mudrunka (harvie) - Sunday, 19 December 2010, 00:25 GMT
louipc: I still think that if we can't parse PKGBUILD then we should change it's syntax and create PKGBUILDv2 or something like that what will be more parseable (while compatible backwards). I believe that there is way to make new PKGBUILD syntax that will be compatible with bash but easy to parse. makepkg can just force maintainers to keep their PKGBUILDs parseable (eg.: not using advanced bash features/expansions that are not guaranteed to be parseable). We can allow PKGBUILDs to use only limited set of features in the section that we will need to parse.

I have to say that i miss the times when all you needed to upload new package to AUR was single PKGBUILD. it was really KISS architecture and i don't think that we need something more complex for many packages. If we are not able to parse PKGBUILDs then there is problem with parser or with PKGBUILD syntax.
Comment by Loui Chang (louipc) - Sunday, 19 December 2010, 03:25 GMT
Yeah I do agree that something probably needs to be done about the
PKGBUILD spec to make it more friendly for maintaining a database
of source packages. As far as I understand, some of the more
advanced features of PKGBUILDs make it difficult or impossible
to get metadata out of them without building the actual binary
package - even when parsing with bash. The PKGBUILD was obviously
not designed with the source repo in mind. We're kind of trying
to squeeze a square box into a round hole here.

Uploading plain PKGBUILDs was really just for testing purposes
introduced sometime in the middle of the AUR's life. Don't miss
it so much. The AUR was meant to be a repository of tarballed
build scripts.

I would not say that it's KISS to upload packages in inconsistent
formats. Now we can have a little more consistency. We can still
do better there though.

There's obviously a problem with the parser. It's an incomplete
bash parser written in PHP. PKGBUILDs are fully bash scripts.
Therein the problem lies. We can't just use bash itself because
that opens a security hole. Maybe it's possible with a complicated
server setup and a fair amount of resources to do it, but I want to
keep the implementation of the AUR simple. So while it's KISS for
makepkg to allow PKGBUILDs to be bash scripts and run with it, it
makes life hard (not KISS) if you want to maintain a database of
source packages, like the AUR. So lets just throw that buzz word
in the trash.

Anyways, this is not exactly an AUR issue any more in my eyes.
I would like to help solve it though because it would make the
AUR that much better. Unfortunately I am short on free time.
Comment by C Anthony Risinger (extofme) - Sunday, 19 December 2010, 12:02 GMT
yeah, if AUR is to become professional, it needs a reliable way to "exhume" these informations... strangling full-blown bash scripts is not a solution.

i recently realized a large piece of what i want out of a "state manager"...

i think we should encourage a move in pacman to use a proven modeling language like puppet:

http://docs.puppetlabs.com/guides/language_tutorial.html

such a move would retain many expressive qualities, and allow an AST object to manipulate freely in the architecture. several things like git/svn/fetch/move/copy/link/etc. are identical between packages...

the format could include range definitions, offer a means to fetch exact numbers on demand (version HEAD for example).

the point is that bash is only going further into nowhere; i want to make the concept of AUR and load/peer sharing fundamental to the the Arch experience. we need to explore richer core data structures, and use high performance persistent storage backends.

C Anthony
Comment by Xilon (Xilon) - Friday, 21 January 2011, 10:12 GMT
I know Loui is against parsing PKGBUILDs, but I thought I'd mention this anyway. I started a parser quite a while ago for exactly this purpose - pkgparse[1]. I don't think I'll continue to develop it as the bash grammar is just too complex and there will be little benefit from actually parsing it like this.

I agree that a better format is the way to go.

[1] https://github.com/sebnow/pkgparse
Comment by Lukas Fleischer (lfleischer) - Friday, 21 January 2011, 13:42 GMT
Well, the problem is that even if we had a full bash parser, it would still fail to do command substituions (like "pkgname=`echo foo | sed 's/foo/bar/'`") properly. To make that work as well, we would have to either implement functionality of every binary in coreutils, sed, awk, ... (basically every userland package from the "base" group) or execute the actual commands, which we already rejected in the discussion of just sourcing PKGBUILDs. Otherwise it wouldn't be a proper solution and if we implement a solution, which again doesn't cover all cases, we'll be at the same point as now. Furthermore, as a full bash parser including functionality of all "base" packages is definitely not KISS, this feature request is likely to never be implemented at all.
Comment by Xilon (Xilon) - Saturday, 22 January 2011, 03:29 GMT
There could be guidelines against doing that sort of thing (that particular example could be achieved via pure bash). What I was aiming for was effectively a "secure bash interpreter". The interpreter could execute commands and such, but it could be limited in what was allowed (`sed` is fine, but `rm` is not). These are edge cases anyway so I didn't think too much about it. I think it should be possible to do the same thing in bash restricted mode anyway.
Comment by Loui Chang (louipc) - Saturday, 22 January 2011, 11:19 GMT
I'm not exactly sure what bash restricted mode does to be honest,.
Callan and Simo investigated it at some point and they
discovered that it didn't actually restrict anything. They could
still run arbitrary commands.

Here's a reference about restricted mode.
http://www.gnu.org/software/bash/manual/html_node/The-Restricted-Shell.html
Comment by Xilon (Xilon) - Saturday, 22 January 2011, 11:29 GMT
It was the fact that you can still write malicious shell scripts (something that could bring down the server) in restricted mode, but it does restrict what commands you can run (not sure if there's a better way but you can change PATH to a specific dir with the allowed programs). The point of having a customer interpreter was that we could control exactly what kind of behaviour the interpreter has, and have some limitations (don't allow infinite loops for instance). Like I said, it's too much effort for little gain.

Having a proper plain-text metadata format (such as PKGINFO, or JSON/YAML) would be infinitely better. The issue with that is that either makepkg needs to be modified to support this format, or the file needs to be generated from the PKGBUILD, which is an extra step.
Comment by Lukas Fleischer (lfleischer) - Sunday, 23 January 2011, 13:45 GMT
Xilon: Even packages in the official repos use these sort of things and sometimes they're inevitable. There's no way to properly build a perfectly secure environment without implementing most of the functionality of the tools we allow in restricted mode, too. Just some examples to illustrate the problematic nature of allowing a subset of typical commands used in PKGBUILDs (use them at your own risk!):

1. awk(1): Can be used to execute arbitrary commands, e.g. `awk 'BEGIN { system("echo Hello world!") }'`.
2. sed(1): Can be used to overwrite arbitrary files, e.g. `echo '# eval scrapt' | sed 's/a/i/g; 1w /some/other/foobar/script/that/will/be/executed/sometime/later'`.
3. test(1) aka `[`: Can be used to check if a file exists, e.g. `[ -e /some/path/to/a/file/owned/by/some/package ]`.
4. cat(1)/head(1)/tail(1): Can be used to read arbitrary files, e.g. "pkgname=`cat /etc/passwd`", "url=`head -2 /etc/passwd | tail -1`".

Those are just some very simple samples. I can think of much more complex ones that couldn't be detected without implementing full parsers for all commands we allow in restricvted mode as well.
Comment by Manuel Tortosa (manutortosa) - Wednesday, 15 June 2011, 23:40 GMT
One of the issues related to this bug is sending a pkgbuild with a PKGDESC like this pkgdesc="Foo # blah" # bar

i solved this part in CCR like this:

function strip_comments($text) {
$pass1 = preg_replace('/(".*"|\s*#.*|\'.*\'|\(.*\))/','',$text);
if ( $pass1 != "" ) {
$pass2 = str_replace($pass1,"",$text);
$pass3 = preg_replace('/(".*"|\'.*\'|\(.*\))/','',$pass2);
$final = str_replace($pass3,"",$text);
} else {
$final = '';
}
return $final;

is a bit tricky but allow the use of ### inside pkgdesc="" , '' , =("") or =('') ,
eg: pkgdesc=("this is a c# test") #some text ---> pkgdesc=("this is a c# test")
Comment by Manuel Tortosa (manutortosa) - Thursday, 16 June 2011, 19:27 GMT
I reformatted a bit the function:

http://paste.kde.org/83557/
Comment by Lukas Fleischer (lfleischer) - Saturday, 25 June 2011, 08:53 GMT
manutortosa: Sorry, but this is crap. It won't work with multi-line strings, doesn't respect escaping ('foo="\"" #nasty"'), backticks ('`echo foo #nasty`'), comments that contain quotes ('foo # bar "nasty"') and much more...
Comment by Manuel Tortosa (manutortosa) - Saturday, 25 June 2011, 09:05 GMT
Well after this message i tested a lot and in fact is not a good solution, i reduced the usage to a single line and only for the PKGDESC line and only btw "" and '' but still does not cover all the possible situations.
Comment by majiq (majiq) - Saturday, 03 March 2012, 05:52 GMT
Is it possible to create a user with a highly limited quota (e.g. 200 bytes), memory limit, process limit, CPU limit, etc., and then take advantage of parse_pkgbuilds.sh from dbtools?
Comment by Alain Kalker (ackalker) - Saturday, 12 January 2013, 20:01 GMT Comment by Lukas Fleischer (lfleischer) - Sunday, 13 January 2013, 16:05 GMT
Alain: Does it support evaluting inline sed(1), awk(1), etc.? Several packages use that.

In my opinion, there are only two viable solutions:

* Execute PKGBUILDs in a sandbox.
* Provide some kind of meta data with source tarballs.
Comment by Alain Kalker (ackalker) - Monday, 14 January 2013, 02:18 GMT
Lukas: AFAICT it does some (limited) emulation of Bash text substitutions e.g. ${var/srch/repl} and such but no sed or awk sadly.
I'm looking into a way of building a super-restricted Qemu -> really-tiny-kernel -> Bash-and-some-tools-in-initrd setup (no drivers or network at all, just an emulated serial port console to communicate with the outside world.
In keeping with the KISS principle I'm thinking about faking, pretending and cheating just enough to trick makepkg into building the barest skeleton of a package, and extracting the .PKGINFO file out of that.
Any help or suggestions on this would be greatly appreciated.
Comment by Dave Reisner (falconindy) - Monday, 14 January 2013, 02:27 GMT
The proper solution to this is a .SRCINFO file. Involving a VM will mean involving makepkg to do the re-parsing of the PKGBUILD (because of split package overrides), which means it could have just been done at 'makepkg --source' time.
Comment by KaiSforza (KaiSforza) - Wednesday, 16 January 2013, 02:47 GMT
Recently, I was sitting in #archlinux-bugs when this bug caught my eye.
Currently, the AUR parses PKGBUILDS minimally, which causes issues for
split packages. There are three paths I can see the AUR taking:

1) Keep doing what it's doing and don't worry about split PKGBUILDs,
but remain safe if there is malicious code in the PKGBUILD.
2) Parse PKGBUILDS more fully but leave the AUR more open to attack
from malicious code
3) Stop parsing PKGBUILDs on the server and use something akin to
the .PKGINFO file in a package file. Have makepkg generate a
.SRCINFO file (as per Dave Reisner's suggestion) and put that in
the src.* file that is uploaded to the server.

The third option seems to be the best option to me, and would help keep
the AUR server safe. Only one thing that I found when trying to patch
makepkg to generate .SRCINFO files for split packages was that the
variables (such as pkgdesc, provides, depends, etc.) are parsed in the
package functions for the specific packages. I was thinking that the
--source flag should generate a .SRCINFO file in two ways depending on
the PKGBUILD:

split package: make sure that at least the package function is run
and generate $pkgname-$full_version$SRCEXT, instead of using the
$pkgbase. Because the .SRCINFO will contain the pkgbase, these can
be grouped together easily. (Or have a single src package with
.SRCINFO_$pkgname all in one file, then split into separate pages on
the AUR with a single location, like the official package
repositories.)

non-split package: allow --source to be run without packaging each
package, because all variables should be able to be written to the
.SRCINFO file easily

Current packages could remain, and the normal way of submitting packages
would be deprecated as pacman 4.1 is released to [core].
While current pacman development builds could still be submitted, as the
same files would be in the src tarball, but would simply include an
extra file.

Thank you for taking the time to read. If anyone wants me to keep
working on this patch or if someone wants to do this way before my
minimal shell scripting skills fail me horribly.
Comment by Lukas Fleischer (lfleischer) - Tuesday, 26 March 2013, 10:14 GMT
We support .AURINFO since 2.1.0 -- any objections against closing this?
Comment by Remy Marquis (Spyhawk) - Friday, 29 March 2013, 17:35 GMT
Is it possible to use this ".AURINFO" file before it is officially supported by makepkg (and renamed to .SRCINFO)?
Comment by Lukas Fleischer (lfleischer) - Saturday, 30 March 2013, 09:46 GMT
Of course, why not? Just try to add an .AURINFO file to a source tarball and upload it to the AUR.
Comment by Remy Marquis (Spyhawk) - Saturday, 30 March 2013, 11:29 GMT
Sorry, I should have added that I got an error message when doing so: "Invalid name: only lowercase letters are allowed."
This is for a splitted package, with "pkgname=('pkg1' 'pkg2')" in the PKGBUILD and "pkgname = pkg1" in the .AURINFO file.

It seems there is a check done before the PKGBUILD value is overriden. But maybe this is belonging to  FS#16394  instead?
Comment by Lukas Fleischer (lfleischer) - Saturday, 30 March 2013, 12:01 GMT
Could you attach your .AURINFO file, please?
Comment by Remy Marquis (Spyhawk) - Saturday, 30 March 2013, 15:15 GMT
This issue is fixed. There was an ending blank space in the AURINFO ("package = value " instead of "package = value").
This won't happen with automatically created .SRCINFO file by makepkg. Thanks, and sorry for the noise.
Comment by Remy Marquis (Spyhawk) - Saturday, 30 March 2013, 15:43 GMT
Seems I've found a parsing issue. For example, in the involved package (etlegacy):

.AURINFO:
pkgname = etlegacy


PKGBUILD:
pkgname=('etlegacy' 'etlegacy-mod')
pkgdesc="Fully compatible client and server for the popular online FPS game Wolfenstein: Enemy Territory
...
source=($pkgname-linux-$pkgver.tar.gz)

The AUR seems to consider $pkgname as being the second element of the pkgname array (etlegacy-mod), despite being overridden in the AURINFO .file.
So the Sources section of the AUR page shows incorrect filename. Have a look at the AUR page (https://aur.archlinux.org/packages/etlegacy/) and the corresponding PKGBUILD.

Makepkg correctly considers $pkgname as being the first element of the array when compiling the package locally.
Comment by Dominik Heidler (asdil12) - Friday, 03 May 2013, 08:31 GMT
I found one more parsing issue:

When a (pacman-4.1 compliant) git-pkg has a fragment in it's source url, the source is not listen in the webui:
Without fragment:
https://aur.archlinux.org/packages/gnuradio-git/

With fragment:
https://aur.archlinux.org/packages/libosmocore/
Comment by Lukas Fleischer (lfleischer) - Friday, 03 May 2013, 08:55 GMT
asdil12: So we should either add a way to specify sources in .AURINFO or remove the sources list from the AUR web interface. I will not try to fix our PKGBUILD parser.

Loading...