Elasticsearch Bulk insert w/ Python – socket timeout error

Question

ElasticSearch 7.10.2 Python 3.8.5 elasticsearch-py 7.12.1 I&#8217;m trying to do a bulk insert of 100,000 records to ElasticSearch using elasticsearch-py bulk helper. Here is the Python code: When the json file contains a small amount of documents (~100), this code runs without issue. But I just tested it wit…

Accepted Answer

TL;DR:Reduce the chunk_size from 10000 to the default of 500 and I&#8217;d expect it to work. You probably want to disable automatic retries if that can give you duplicates.What happened?When creating your Elasticsearch object, you specified chunk_size=10000. This means that the streaming_bulk call will try to insert chunks of 10000 elements. The connection to elasticsearch has a configurable timeout, which by default is 10 seconds. So, if your elasticsearch server takes more than 10 seconds to process the 10000 elements you want to insert, a timeout will happen and this will be handled as an error.When creating your Elasticsearch object, you also specified retry_on_timeout as True and in the streaming_bulk_call you set max_retries=max_insert_retries, which is 3.This means that when such a timeout happens, the library will try reconnecting 3 times, however, when the insert still has a timeout after that, it will give you the error you noticed. (Documentation)Also, when the timeout happens, the library can not know whether the documents were inserted successfully, so it has to assume that they were not. Thus, it will try to insert the same documents again. I don&#8217;t know how your input lines look like, but if they do not contain an _id field, this would create duplicates in your index. You probably want to prevent this &#8212; either by adding some kind of _id, or by disabling the automatic retry and handling it manually.What to do?There is two ways you can go about this:Increase the timeoutReduce the chunk_sizestreaming_bulk by default has chunk_size set to 500. Your 10000 is much higher. I wouldn&#8217;t expect a high performance gain when increasing this to more than 500, so I&#8217;d advice you to just use the default of 500 here. If 500 still fails with a timeout, you may even want to reduce it further. This could happen if the documents you want to index are very complex.You could also increase the timeout for the streaming_bulk call, or, alternatively, for your es object. To only change it for the streaming_bulk call, you can provide the request_timeout keyword argument:for ok, result in streaming_bulk(        es,        data_generator(),        chunk_size=chunk_size,        refresh=refresh_index_after_insert,        request_timeout=60*3,  # 3 minutes        yield_ok=yield_ok):    # handle like you did    passHowever, this also means that elasticsearch node failure will only be detected after this higher timeout. See the documentation for more details

Advertisement

Answer

TL;DR:

What happened?

What to do?