Skip to content
Advertisement

Elasticsearch Bulk insert w/ Python – socket timeout error

ElasticSearch 7.10.2

Python 3.8.5

elasticsearch-py 7.12.1

I’m trying to do a bulk insert of 100,000 records to ElasticSearch using elasticsearch-py bulk helper.

Here is the Python code:

JavaScript

When the json file contains a small amount of documents (~100), this code runs without issue. But I just tested it with a file of 100k documents, and I got this error:

JavaScript

I have to admit this one is a bit over my head. I don’t typically like to paste large error messages here, but I’m not sure what about this message is relevant.

I can’t help but think that I maybe need to adjust some of the params in the es object? Or the configuration variables? I don’t know enough about the params to be able to make an educated decision on my own.

And the last but certainly not least point – it looks like some documents were loaded into the ES index nonetheless. But even stranger, the count shows 110k when the json file only has 100k.

Advertisement

Answer

TL;DR:

Reduce the chunk_size from 10000 to the default of 500 and I’d expect it to work. You probably want to disable automatic retries if that can give you duplicates.

What happened?

When creating your Elasticsearch object, you specified chunk_size=10000. This means that the streaming_bulk call will try to insert chunks of 10000 elements. The connection to elasticsearch has a configurable timeout, which by default is 10 seconds. So, if your elasticsearch server takes more than 10 seconds to process the 10000 elements you want to insert, a timeout will happen and this will be handled as an error.

When creating your Elasticsearch object, you also specified retry_on_timeout as True and in the streaming_bulk_call you set max_retries=max_insert_retries, which is 3.

This means that when such a timeout happens, the library will try reconnecting 3 times, however, when the insert still has a timeout after that, it will give you the error you noticed. (Documentation)

Also, when the timeout happens, the library can not know whether the documents were inserted successfully, so it has to assume that they were not. Thus, it will try to insert the same documents again. I don’t know how your input lines look like, but if they do not contain an _id field, this would create duplicates in your index. You probably want to prevent this — either by adding some kind of _id, or by disabling the automatic retry and handling it manually.

What to do?

There is two ways you can go about this:

  • Increase the timeout
  • Reduce the chunk_size

streaming_bulk by default has chunk_size set to 500. Your 10000 is much higher. I wouldn’t expect a high performance gain when increasing this to more than 500, so I’d advice you to just use the default of 500 here. If 500 still fails with a timeout, you may even want to reduce it further. This could happen if the documents you want to index are very complex.

You could also increase the timeout for the streaming_bulk call, or, alternatively, for your es object. To only change it for the streaming_bulk call, you can provide the request_timeout keyword argument:

JavaScript

However, this also means that elasticsearch node failure will only be detected after this higher timeout. See the documentation for more details

User contributions licensed under: CC BY-SA
5 People found this is helpful
Advertisement