I’m trying to execute the following code in Azure ML Studio notebook:
from sklearn.linear_model import LogisticRegression from sklearn.metrics import roc_auc_score from sklearn.cross_validation import KFold, cross_val_score for C in np.linspace(0.01, 0.2, 30): cv = KFold(n=X_train.shape[0], n_folds=7, shuffle=True, random_state=12345) clf = LogisticRegression(C=C, random_state=12345) print C, sum(cross_val_score(clf, X_train_scaled, y_train, scoring='roc_auc', cv=cv, n_jobs=2)) / 7.0
and I’m getting this error:
Failed to save <type 'numpy.ndarray'> to .npy file: Traceback (most recent call last): File "/home/nbcommon/env/lib/python2.7/site-packages/sklearn/externals/joblib/numpy_pickle.py", line 271, in save obj, filename = self._write_array(obj, filename) File "/home/nbcommon/env/lib/python2.7/site-packages/sklearn/externals/joblib/numpy_pickle.py", line 231, in _write_array self.np.save(filename, array) File "/home/nbcommon/env/lib/python2.7/site-packages/numpy/lib/npyio.py", line 491, in save pickle_kwargs=pickle_kwargs) File "/home/nbcommon/env/lib/python2.7/site-packages/numpy/lib/format.py", line 585, in write_array array.tofile(fp) IOError: 19834920 requested and 8384502 written --------------------------------------------------------------------------- IOError Traceback (most recent call last) <ipython-input-29-9740e9942629> in <module>() 6 cv = KFold(n=X_train.shape[0], n_folds=7, shuffle=True, random_state=12345) 7 clf = LogisticRegression(C=C, random_state=12345) ----> 8 print C, sum(cross_val_score(clf, X_train_scaled, y_train, scoring='roc_auc', cv=cv, n_jobs=2)) / 7.0 /home/nbcommon/env/lib/python2.7/site-packages/sklearn/cross_validation.pyc in cross_val_score(estimator, X, y, scoring, cv, n_jobs, verbose, fit_params, pre_dispatch) 1431 train, test, verbose, None, 1432 fit_params) -> 1433 for train, test in cv) 1434 return np.array(scores)[:, 0] 1435 /home/nbcommon/env/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.pyc in __call__(self, iterable) 808 # consumption. 809 self._iterating = False --> 810 self.retrieve() 811 # Make sure that we get a last message telling us we are done 812 elapsed_time = time.time() - self._start_time /home/nbcommon/env/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.pyc in retrieve(self) 725 job = self._jobs.pop(0) 726 try: --> 727 self._output.extend(job.get()) 728 except tuple(self.exceptions) as exception: 729 # Stop dispatching any new job in the async callback thread /home/nbcommon/env/lib/python2.7/multiprocessing/pool.pyc in get(self, timeout) 565 return self._value 566 else: --> 567 raise self._value 568 569 def _set(self, i, obj): IOError: [Errno 28] No space left on device
With n_jobs=1
it works fine.
I think this is because joblib
library tries to save my data to /dev/shm
. The problem is that it has only 64M capacity:
Filesystem Size Used Avail Use% Mounted on none 786G 111G 636G 15% / tmpfs 56G 0 56G 0% /dev shm 64M 0 64M 0% /dev/shm tmpfs 56G 0 56G 0% /sys/fs/cgroup /dev/mapper/crypt 786G 111G 636G 15% /etc/hosts
I can’t change this folder by setting JOBLIB_TEMP_FOLDER
environment variable (export
doesn’t work).
In [35]: X_train_scaled.nbytes Out[35]: 158679360
Thanks for any advice!
Advertisement
Answer
The /dev/shm
is a virtual filesystem for passing data between programs that implementation of traditional shared memory on Linux.
So you could not increase it via set up some options on Application Layout.
But for example, you can remount /dev/shm
with 8G size in Linux Shell with administrator permission like root
as follows.
mount -o remount,size=8G /dev/shm
However, it seems that Azure ML studio not support remote access via SSH protocol, so the feasible plan is upgrade the standard tier if using free tier at present.