Is it possible to download specific files from maven repository using python?

bog0alt · April 26, 2023, 7:57am

Is it possible to search and download specific files (pom or jar) using python?
for example:

        url = "https://search.maven.org/solrsearch/select?q="
        src = "g:javax.servlet a:javax.servlet-api v:4.0.1"

        response = requests.get(url + src)
        result = response.json()["response"]
        match = result["docs"][0]
        print(match)

will print some informations but not links to files.
Any idea?

mfrost · May 1, 2023, 7:36pm

Hi there @bog0alt! I spoke to our team about this and this is what they said.

Search.maven.org is not a repository for files - it serves up metadata about components, like their Maven coordinates. This is why links to files aren’t included in search results. Maven coordinates CAN be translated into file locations/URLs on Maven Central itself.

In the case of your example, the file download link would be: https://repo1.maven.org/maven2/javax/servlet/javax.servlet-api/4.0.1/javax.servlet-api-4.0.1.jar

If you have a set of coordinates you’re searching for, there may be Python libraries or tools out there that are designed for downloading files from Maven Central given a set of Maven coordinates.

If you have any questions, please let me know.

rseddon · May 1, 2023, 8:59pm

You can also use maven to download it:

mvn dependency:copy -Dartifact=javax.servlet:javax.servlet-api:4.0.1 -DoutputDirectory=/some/directory

Rich

bog0alt · May 3, 2023, 6:51am

Hello, thanks for helping.
I cannot use mvn because the code must be self-standing (so no external dependencyes like mvn)

bog0alt · May 3, 2023, 7:05am

Thanks for involving your team
Just a question: is it a standard path or it may defer from one to another package? I mean: how do I guess the link?
Thanks

mfrost · May 3, 2023, 3:31pm

@bog0alt There is a standard way to translate coordinates to paths.

At a high level it looks like this: for groupId:artifactId:version, you would first take the groupId and substitute all the dots for slashes, and then the URL path on Maven Central looks like https://repo1.maven.org/maven2/groupId/artifactId/version/artifactId-version.jar (or .pom or whatever other extension you might need.In the example they provided they had javax.servlet:javax.servlet-api:4.0.1

so javax.servlet turns into javax/servlet & put everything together into a final path that looks like: https://repo1.maven.org/maven2/javax/servlet/javax.servlet-api/4.0.1/javax.servlet-api-4.0.1.jar

They recommend the best way to tackle this is not to solve it from scratch but to look for an existing library that already does this. Also that the full Maven codebase is open-source, so porting the conversion process that Maven itself performs from Java to Python is a possibility.

bog0alt · May 4, 2023, 8:10am

Thanks for your help.
I found this code works for me:

import requests

class Downlaoder:
    def __init__(self):
        self.base = "https://repo1.maven.org/maven2/"

    def download(self, g, a, v, extension="pom"): #looking for pom or jar?
        '''
        This method will manage the different parts of the downloading process
        :param g: groupID
        :param a: artefactID
        :param v: artefact version
        :param ext: set pom if you want to create a link for the pom file,
                    set jar if you want to create a link for the jar file
        :return:
        '''
        
        self.g = g
        self.a = a
        self.v = v

        url = self.gav_to_url(g, a, v, extension)   #create the url from the GAV format
        #print("jar url = ", url)
        if self.is_downloadable(url):
            return self.perform_download(url)       #return the downloaded file
        else:
            print(f"The url = {url} is not a downloadable URL")
            exit(1)
        return 0

    def gav_to_url(self, g, a, v, ext):
        '''
        This method creates the JAR or POM file link from G:A:V coordinates
        :param g: groupID
        :param a: artefactID
        :param v: artefact version
        :param ext: set pom if you want to create a link for the pom file,
                    set jar if you want to create a link for the jar file
        :return: url pointing to the desired file: jar or pom
        '''
        
        gid = g.replace(".", "/")
        return self.base + gid + "/" + a + "/" + v + "/" + a + "-" + v + "." + ext

    def perform_download(self, url):
        '''
        Do the downlaod of the file
        :param url: url of the file to be downloaded
        :return:  downloaded file       
        '''
        
        filename = url.split("/")[-1]
        #print("Filename = ", filename)
        try:
            response = requests.get(url, allow_redirects=True)
            open(filename, "wb").write(response.content)    # overwritting file in case it exists
        except BaseException as be:
            print(f"Something went wrong while downloading: {url} and the following exception was raised: {be}. Exiting!"
                  f"This operation is mandatory, exiting!")
            exit(1)
        return filename

    def is_downloadable(self, url):
        """
        Does the url contain a downloadable resource ? Checking it examining only the header
        I am aiming for a POM or JAR file, other files will be ignored
        :param url: url of the file to be checked
        :return: True if a target file is actually a POM or JAR file, False otherwise
        """
        
        h = requests.head(url, allow_redirects=True)
        header = h.headers
        content_type = header.get('content-type')
        print("Content Type = ", content_type)

        if "text/xml" in content_type.lower() or "application/java-archive" in content_type.lower():
            return True
        else:
            print(f"this url doesn't point nor to a POM file neighter to a JAR file but to a {content_type.lower()}")
            return False

My not be perfect but is a good starting point, I hope, for those who need this.