- How to extract suffixes from filenames with Python
- Defining a
guess_extensions(filename)
utility function that extracts extensions from a filename
Recently I was dealing with filenames of the following form: 3a1d3614.2020-09-27.json.gz
. That is, the filenames consisted of four parts: an alphanumeric ID, a date, and two extensions separated by dots.
Python offers at least two built-in ways to extract extensions from filenames. The os.path.splitext(path)
function splits a path or filename into a pair (root, ext)
:
>>> from os.path import splitext
>>> filename = "3a1d3614.2020-09-27.json.gz"
>>> splitext(filename)
('3a1d3614.2020-09-27.json', '.gz')
As you can see, this gives us only one of the two extensions. You could proceed to recursively split the root
part and collect the extensions, but you wouldn’t know when to stop. This is because splitext
doesn’t check whether an extracted extension represents an actual file type extension such as .jpg
:
>>> splitext(splitext(splitext(filename)[0])[0])
('3a1d3614', '.2020-09-27')
An essentially equivalent behavior can be achieved by using the pathlib.Path
class (available in Python 3.4 and newer):
>>> from pathlib import Path
>>> Path(filename).suffixes
['.2020-09-27', '.json', '.gz']
Although more succinct than splitext
, Path.suffixes
still includes the date part of the filename. In order to exclude this part and other non-extensions from the suffixes
list, we can filter it with the help of the mimetypes
module. This module features a types_map
containing all file extensions known to Python:
>>> mimetypes.types_map
{'.js': 'application/javascript', '.mjs': 'application/javascript', '.json': 'application/json', …}
We can use this map to determine whether Python considers a suffix a file type extension or not. Thus, we arrive at a nice & simple utility function that gives us what we were looking for:
def guess_extensions(filename):
return [s for s in Path(filename).suffixes if s in mimetypes.types_map]
>>> guess_extensions(filename)
['.json', '.gz']