I’m a newbie in mrjob and EMR and I’m still trying to figure out how things work. So I’m having this error when I’m running my script:
python3 MovieSimilarities.py -r emr --items=ml-100k/u.item ml-100k/u.data > sims2t.txt
No configs found; falling back on auto-configuration
No configs specified for emr runner
Using s3://mrjob-35beccaf67be4929/tmp/ as our temp dir on S3
Creating temp directory /tmp/MovieSimilarities.hostname.20201101.164744.518416
uploading working dir files to s3://mrjob-35beccaf67be4929/tmp/MovieSimilarities.hostname.20201101.164744.518416/files/wd...
Copying other local files to s3://mrjob-35beccaf67be4929/tmp/MovieSimilarities.hostname.20201101.164744.518416/files/
Created new cluster j-320TQKHQJ683U
Added EMR tags to cluster j-320TQKHQJ683U: __mrjob_label=MovieSimilarities, __mrjob_owner=hostname, __mrjob_version=0.7.4
Waiting for Step 1 of 3 (s-1WHEBVTU60KAA) to complete
PENDING (cluster is STARTING)
PENDING (cluster is STARTING)
PENDING (cluster is STARTING: Configuring cluster software)
PENDING (cluster is STARTING: Configuring cluster software)
PENDING (cluster is STARTING: Configuring cluster software)
PENDING (cluster is STARTING: Configuring cluster software)
PENDING (cluster is STARTING: Configuring cluster software)
master node is ec2-44-234-63-159.us-west-2.compute.amazonaws.com
PENDING (cluster is RUNNING: Running step)
RUNNING for 0:00:52
COMPLETED
Attempting to fetch counters from logs
Waiting for cluster (j-320TQKHQJ683U) to terminate
TERMINATING
TERMINATED
Looking for step log in s3://mrjob-35beccaf67be4929/tmp/logs/j-320TQKHQJ683U/steps/s-1WHEBVTU60KAA...
Parsing step log: s3://mrjob-35beccaf67be4929/tmp/logs/j-320TQKHQJ683U/steps/s-1WHEBVTU60KAA/syslog.gz
Counters: 60
File Input Format Counters
Bytes Read=1994689
File Output Format Counters
Bytes Written=1397908
File System Counters
FILE: Number of bytes read=658079
FILE: Number of bytes written=2552888
FILE: Number of large read operations=0
FILE: Number of read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=568
HDFS: Number of bytes read erasure-coded=0
HDFS: Number of bytes written=1397908
HDFS: Number of large read operations=0
HDFS: Number of read operations=13
HDFS: Number of write operations=2
S3: Number of bytes read=1994689
S3: Number of bytes written=0
S3: Number of large read operations=0
S3: Number of read operations=0
S3: Number of write operations=0
Job Counters
Data-local map tasks=4
Killed map tasks=1
Launched map tasks=4
Launched reduce tasks=1
Total megabyte-milliseconds taken by all map tasks=91127808
Total megabyte-milliseconds taken by all reduce tasks=17491968
Total time spent by all map tasks (ms)=29664
Total time spent by all maps in occupied slots (ms)=2847744
Total time spent by all reduce tasks (ms)=2847
Total time spent by all reduces in occupied slots (ms)=546624
Total vcore-milliseconds taken by all map tasks=29664
Total vcore-milliseconds taken by all reduce tasks=2847
Map-Reduce Framework
CPU time spent (ms)=23910
Combine input records=0
Combine output records=0
Failed Shuffles=0
GC time elapsed (ms)=834
Input split bytes=568
Map input records=100000
Map output bytes=1879173
Map output materialized bytes=683872
Map output records=100000
Merged Map outputs=4
Peak Map Physical memory (bytes)=712859648
Peak Map Virtual memory (bytes)=4446281728
Peak Reduce Physical memory (bytes)=230252544
Peak Reduce Virtual memory (bytes)=7088242688
Physical memory (bytes) snapshot=2708877312
Reduce input groups=943
Reduce input records=100000
Reduce output records=943
Reduce shuffle bytes=683872
Shuffled Maps =4
Spilled Records=200000
Total committed heap usage (bytes)=2690646016
Virtual memory (bytes) snapshot=24827822080
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
Terminating cluster: j-320TQKHQJ683U
Traceback (most recent call last):
File "MovieSimilarities.py", line 129, in <module>
MovieSimilarities.run()
File "/home/hostname/PycharmProjects/Taming-Big-Data-with-MapReduce-and-Hadoop/venv/lib/python3.8/site-packages/mrjob/job.py", line 616, in run
cls().execute()
File "/home/hostname/PycharmProjects/Taming-Big-Data-with-MapReduce-and-Hadoop/venv/lib/python3.8/site-packages/mrjob/job.py", line 687, in execute
self.run_job()
File "/home/hostname/PycharmProjects/Taming-Big-Data-with-MapReduce-and-Hadoop/venv/lib/python3.8/site-packages/mrjob/job.py", line 636, in run_job
runner.run()
File "/home/hostname/PycharmProjects/Taming-Big-Data-with-MapReduce-and-Hadoop/venv/lib/python3.8/site-packages/mrjob/runner.py", line 503, in run
self._run()
File "/home/hostname/PycharmProjects/Taming-Big-Data-with-MapReduce-and-Hadoop/venv/lib/python3.8/site-packages/mrjob/emr.py", line 705, in _run
self._finish_run()
File "/home/hostname/PycharmProjects/Taming-Big-Data-with-MapReduce-and-Hadoop/venv/lib/python3.8/site-packages/mrjob/emr.py", line 710, in _finish_run
self._wait_for_steps_to_complete()
File "/home/hostname/PycharmProjects/Taming-Big-Data-with-MapReduce-and-Hadoop/venv/lib/python3.8/site-packages/mrjob/emr.py", line 1570, in _wait_for_steps_to_complete
self._add_steps_to_cluster(
File "/home/hostname/PycharmProjects/Taming-Big-Data-with-MapReduce-and-Hadoop/venv/lib/python3.8/site-packages/mrjob/emr.py", line 1537, in _add_steps_to_cluster
step_ids = emr_client.add_job_flow_steps(**steps_kwargs)['StepIds']
File "/home/hostname/PycharmProjects/Taming-Big-Data-with-MapReduce-and-Hadoop/venv/lib/python3.8/site-packages/mrjob/retry.py", line 108, in call_and_maybe_retry
return f(*args, **kwargs)
File "/home/hostname/PycharmProjects/Taming-Big-Data-with-MapReduce-and-Hadoop/venv/lib/python3.8/site-packages/botocore/client.py", line 357, in _api_call
return self._make_api_call(operation_name, kwargs)
File "/home/hostname/PycharmProjects/Taming-Big-Data-with-MapReduce-and-Hadoop/venv/lib/python3.8/site-packages/botocore/client.py", line 676, in _make_api_call
raise error_class(parsed_response, operation_name)
botocore.exceptions.ClientError: An error occurred (ValidationException) when calling the AddJobFlowSteps operation: A job flow that is shutting down, terminated, or finished may not be modified.
Here’s the code:
from mrjob.job import MRJob
from mrjob.step import MRStep
from math import sqrt
from itertools import combinations
class MovieSimilarities(MRJob):
def __init__(self, args=None):
super().__init__(args)
self.movieNames = {}
def configure_args(self):
super(MovieSimilarities, self).configure_args()
self.add_file_arg('--items', help='Path to u.item')
def load_movie_names(self):
# Load database of movie names.
with open("u.item", encoding='ascii', errors='ignore') as f:
for line in f:
fields = line.split('|')
self.movieNames[int(fields[0])] = fields[1]
def steps(self):
return [
MRStep(mapper=self.mapper_parse_input,
reducer=self.reducer_ratings_by_user),
MRStep(mapper=self.mapper_create_item_pairs,
reducer=self.reducer_compute_similarity),
MRStep(mapper=self.mapper_sort_similarities,
mapper_init=self.load_movie_names,
reducer=self.reducer_output_similarities)]
def mapper_parse_input(self, key, line):
# Outputs userID => (movieID, rating)
(userID, movieID, rating, timestamp) = line.split('t')
yield userID, (movieID, float(rating))
def reducer_ratings_by_user(self, user_id, itemRatings):
# Group (item, rating) pairs by userID
ratings = []
for movieID, rating in itemRatings:
ratings.append((movieID, rating))
yield user_id, ratings
def mapper_create_item_pairs(self, user_id, itemRatings):
# Find every pair of movies each user has seen, and emit
# each pair with its associated ratings
# "combinations" finds every possible pair from the list of movies
# this user viewed.
for itemRating1, itemRating2 in combinations(itemRatings, 2):
movieID1 = itemRating1[0]
rating1 = itemRating1[1]
movieID2 = itemRating2[0]
rating2 = itemRating2[1]
# Produce both orders so sims are bi-directional
yield (movieID1, movieID2), (rating1, rating2)
yield (movieID2, movieID1), (rating2, rating1)
def cosine_similarity(self, ratingPairs):
# Computes the cosine similarity metric between two
# rating vectors.
numPairs = 0
sum_xx = sum_yy = sum_xy = 0
for ratingX, ratingY in ratingPairs:
sum_xx += ratingX * ratingX
sum_yy += ratingY * ratingY
sum_xy += ratingX * ratingY
numPairs += 1
numerator = sum_xy
denominator = sqrt(sum_xx) * sqrt(sum_yy)
score = 0
if (denominator):
score = (numerator / (float(denominator)))
return (score, numPairs)
def reducer_compute_similarity(self, moviePair, ratingPairs):
# Compute the similarity score between the ratings vectors
# for each movie pair viewed by multiple people
# Output movie pair => score, number of co-ratings
score, numPairs = self.cosine_similarity(ratingPairs)
# Enforce a minimum score and minimum number of co-ratings
# to ensure quality
if numPairs > 10 and score > 0.95:
yield moviePair, (score, numPairs)
def mapper_sort_similarities(self, moviePair, scores):
# Shuffle things around so the key is (movie1, score)
# so we have meaningfully sorted results.
score, n = scores
movie1, movie2 = moviePair
yield (self.movieNames[int(movie1)], score),
(self.movieNames[int(movie2)], n)
def reducer_output_similarities(self, movieScore, similarN):
# Output the results.
# Movie => Similar Movie, score, number of co-ratings
movie1, score = movieScore
for movie2, n in similarN:
yield movie1, (movie2, score, n)
if __name__ == '__main__':
MovieSimilarities.run()
Here’s the link to get the data: files.grouplens.org/datasets/movielens/ml-100k.zip
I have exported my aws_access_key_id and aws_secret_access_key in my .bashrc
and restarted my shell.
I need help to undertand what I’m doing wrong, what does botocore.exceptions.ClientError: An error occurred (ValidationException) when calling the AddJobFlowSteps operation: A job flow that is shutting down, terminated, or finished may not be modified.
means ?
Advertisement
Answer
the botocore package is actually deprecated, any since that module relies on the botocore package, that module is now broken. Sorry for the inconvenience.