Is it possible to transform one asset into another asset using ops in dagster?

Question

From what I found here, it is possible to use ops and graphs to generate assets. However, I would like to use an asset as an input for an op. I am exploring it for a following use case: I fetch a list of country metadata from an external API and store it in my resource: I use this asset

Accepted Answer

It seems to me like the augmented data that&#8217;s returned from retrieve_and_process_data can (at least in theory) be represented by an asset.So we can start from the standpoint that we&#8217;d like to create some asset that takes in country_names_asset, as well as the source data asset (the thing that has a bunch of rows in it, which we can call big_country_data_asset for now). I think this models the underlying relationships a bit better, independent of how we&#8217;re actually implementing things.The question then is how to write the computation function for this asset in a way that doesn&#8217;t require loading the entire contents of country_data_asset into memory at any point in time. While it&#8217;s possible that you could do this with a dynamic graph, which you then wrap in a call to AssetsDefinition.from_graph, I think there&#8217;s an easier approach.Dagster allows you to circumvent the IOManager machinery both when reading an asset as input, as well as when writing an asset as output. In essence, when you set an AssetKey as a non_argument_dep, this tells Dagster that there is some asset which is upstream of the asset you&#8217;re defining, but will be loaded within the body of the asset function (rather than being loaded by Dagster using IOManager machinery).Similarly, if you set the output type of the function to None, this tells Dagster that the asset you&#8217;re defining will be persisted by the logic inside of the function, rather than by an IOManager.Using both these concepts, we can write an asset which at no point needs to have the entire big_country_data_asset loaded.@asset(non_argument_deps={AssetKey("big_country_data_asset")})def processed_country_data_asset(country_names_asset) -> None:    for name in country_names_asset:        # assuming this function actually stores data somewhere,        # and intrinsically knows how to read from big_country_data_asset        retrieve_and_process_data(name)IOManagers are a very flexible concept however, and it is possible to replicate all of this same batching behavior while using IOManagers (just a bit more convoluted). You&#8217;d need to do something like create a SourceAsset(key="big_country_data_asset", io_manager_def=my_custom_io_manager), where my_custom_io_manager has a weird load_input function which itself returns a function like:def load_input(context):    def _fn(country_name):        # however you actually get these rows        rows = query_source_data_for_name(country_name)        return rows    return _fnthen, you could define your asset like:@assetdef processed_country_data_asset(    country_names_asset, big_country_data_asset) -> None:    for name in country_names_asset:        # big_country_data_asset has been loaded as a function        rows = big_country_data_asset(name)        process_data(rows)You can also handle writing the output of this function in an IOManager using a similar-looking trick: https://github.com/dagster-io/dagster/discussions/9772

Advertisement

Answer