Regression forecasting and predicting – Practical Machine Learning Tutorial with Python p.5

In this video, make sure you define the X’s like so. I flipped the last two lines by mistake:

X = np.array(df.drop([‘label’],1))
X = preprocessing.scale(X)
X_lately = X[-forecast_out:]
X = X[:-forecast_out:]

To forecast out, we need some data. We decided that we’re forecasting out 10% of the data, thus we will want to, or at least *can* generate forecasts for each of the final 10% of the dataset. So when can we do this? When would we identify that data? We could call it now, but consider the data we’re trying to forecast is not scaled like the training data was. Okay, so then what? Do we just do preprocessing.scale() against the last 10%? The scale method scales based on all of the known data that is fed into it. Ideally, you would scale both the training, testing, AND forecast/predicting data all together. Is this always possible or reasonable? No. If you can do it, you should, however. In our case, right now, we can do it. Our data is small enough and the processing time is low enough, so we’ll preprocess and scale the data all at once.

In many cases, you wont be able to do this. Imagine if you were using gigabytes of data to train a classifier. It may take days to train your classifier, you wouldn’t want to be doing this every…single…time you wanted to make a prediction. Thus, you may need to either NOT scale anything, or you may scale the data separately. As usual, you will want to test both options and see which is best in your specific case.

With that in mind, let’s handle all of the rows from the definition of X onward.

30 comments

  1. Roland Gemayel says:

    +sentdex Just another important observation. Correct me if I’m wrong but I
    think the script doesn’t answer the “problem” that you intended to answer.
    When you shift by “forecast_out” days, you are NOT predicting the price FOR
    THE NEXT number of “forecast_out” days, but instead you are predicting the
    price at T + number of forecast_out days.

    So basically, if forecast_out = 5, the code as it is says that if I want to
    predict the price tomorrow, then I should look at the features 5 days ago.

    The plot should have the “label” (not the “Adj. Close”) and the forecast.

  2. Azzi says:

    Why are the forecast dates in the past, while we’re predicting the future?
    I counted it, and my last forecast day is 12 days behind the present day.
    Your last forecast day is 18 days before uploading this video. That doesn’t
    really make sense to me?

  3. Heng Soon says:

    +sentdex I might be missing something, but Is there a specific reason you
    do Y = np.array([‘label’]) twice?

  4. Weiyi Liao says:

    Could you please show me how to set the end_date of Quandl.get to get the
    exactly same data as you do. Cause we run the Quandl.get in different
    dates, the data we get could be slightly different which leads the result
    different.
    I try the end_data in Quandl.get, but it doesn’t work.

  5. Lance Dacey says:

    Hello – I had a question about incorporating the Day of Month (1 – 31) and
    the Weekday (Mon-Sun) into machine learning and how I would shape the data.
    At a glance, it is apparent that values in my data are lower towards the
    end of the month and Mondays have the highest values in general. I would
    like to use this historical data and test for these relationships and
    predict future values.

    Since I am interested in future values based on the date, I assume that
    that the date would be my label. For features I would have my value column,
    but maybe I would need 5 separate columns for the values per weekday? Or
    maybe even (5 weekdays * 31 days) of columns if I want to look at the
    weekday and the day of month? I hope that my description is clear.

  6. Gareth Griffiths says:

    Ok, this code has a bug.
    X = np.array(df.drop([‘label’], 1)) does not do what you intend as you are
    keeping ‘Adj. Close’
    If you do this instead:
    X = np.array(df[[‘Adj. Volume’,’HL_PCT’,’PCT_change’]])
    You then have the features you are after (run both scripts and look at
    len(X))
    The score then drops to low 30%, no longer circa 96%
    Furthermore, when you chart the prediction, your indexing is wrong
    We’d all be millionaires if that google prediction was correct

  7. Romain Vincent says:

    When I compare the content of “forecast_set” to the actual stock prices
    (that is y[-forecast_out:], I notice an average 40$ difference. It seems a
    bit odd since we have a 95% accuracy, right?

  8. Gareth Griffiths says:

    Ok yes I understand. However, for a score of circa 96%, try doing
    clf.predict(X[0]). The label is 66.29, yet we get a prediction of 42.59.
    How is this large difference possible for such a high score?

  9. Dan Hunt says:

    For anyone in python 2.7, you can use:
    last_unix = time.mktime(last_date.timetuple())

  10. L Radhakrishna Rao says:

    I am using Python 3.5, still the timestamp is not working, and therefore, I
    used this method:
    last_unix =
    time.mktime(datetime.datetime.strptime(last_date,”%d%m%y”).timetuple())

    But in the plot, I am not getting dates properly, instead, it is showing
    the labels of Price.

  11. Akisha Dilshani says:

    Thank you very much for the wonderful tutorial series.
    I have a problem about the equation when applying it to my data set .I’m
    building a system to predict water quality in a river by inputting data
    from several points. I have data about rainfall and data from points which
    is contamination from point to point. All together I have 5 data coloums
    including the rainfall. Please help me to solve this.

  12. Santhosh Dasari says:

    Okay. But you forgot to account for week ends which are declared as
    holidays and also other national holidays.

  13. Tina Davis says:

    y’all prob know this but if your substituting files from your computer you
    gots to index the date —- df = pd.read_csv(‘C:\Python\table(1).csv’,
    index_col=’Date’, parse_dates=True) —-else trouble ensues ,,
    also changed – last_date = df.iloc[0].name —- & — last_unix =
    time.mktime(last_date.timetuple())

  14. Sai Vinod krishna Bonthala says:

    hey,i am using python 2.7 and when i use last_unix=last_date.timestamp()
    it is giving me an error like
    ‘AttributeError: ‘Timestamp’ object has no attribute ‘timestamp’

  15. Chatzistamou Aimilios says:

    Hey, I have a quick question 🙂 I am a bit confused about what linear
    regression actually is.. Linear, line, .. Is it not supposed to be a line
    being the closest estimate? Then how is it possible that it estimates that
    many variations (as we see in the plot). It’s not a rough approx at all!

    Thanks in advance if anyone can clarify this 😉
    PS: You’re the best

  16. Raouf Gnda says:

    can I get help to solve this error! AttributeError: ‘Timestamp’ object has
    no attribute ‘timestamp’

  17. Minjun Kim says:

    Sentdex, sorry to say this but, it is not a prediction to the one month
    future. At the last step, when you plot them, you made a mistake. df[‘Adj.
    Close’] runs from the first row to the “-30” row of the original dataframe.
    It is 30 days shorter than the original dataframe because of
    df.dropna(inplace=True) right before you define ‘y’ in order to make the X
    len and y len the same. And the “forecast set” should run from the very
    next date of the last date of the original dataframe and 30 days from that
    date.

    What you plot is: shorter dataframe’s Adj. Close + df[Forecast_set]
    What you should have plotted is: original dataframe’s Adj. Close +
    df[Forecast_set]
    Because what you have predicted is the whole new next 30 days.

    Please check with it and give me some feedback on this. I appreciate it.

  18. Andreas Simons says:

    Hey! I don’t fully understand what “X_lately = X[-forecast_out:]
    X = X[:-forecast_out]” seems to do, I know it is a bit of a luxury problem
    since I don’t have any errors but I’d like to learn a bit more

  19. Kenan Sooklall says:

    clf.predict(X_lately) works fine; however he prediction doesn’t start from
    the current data. There is a gab on the graph. Do you know the reason for
    that?

  20. Paul Kwok says:

    Can I use the CLF to predict more than 30 days forecast. I think the
    example here is not forecast the price for future. (there is a little bit
    confusing)

  21. Owen Spottiswoode says:

    Hi Harrison, thanks for the fantastic series. Quick Q if I may: you’re
    making predictions in this vid using data that has been scaled, but what do
    you do if you’re using raw, live data to make a prediction? You can’t scale
    a single sample, and surely if the classifier is trained on scaled data
    you’ll get some funky predictions if you feed it unprocessed stuff. Is
    there a way to scale a single sample relative to the classifier
    (particularly if you’re loading from a pickle)? Thanks again!

  22. Scott Mandel says:

    Where does n_jobs=-1 come from in the line, clf =
    LinearRegression(n_jobs=-1)?
    Python is giving me the following error… TypeError: *_init_*() got an
    unexpected keyword argument ‘n_jobs’.
    I am using Python 2.7 and all modules are up to date.

  23. Nonton Anime says:

    error AttributeError: ‘Timestamp’ object has no attribute ‘timestamp’
    i change to last_unix = time.mktime(last_date.timetuple()) error
    SyntaxError: invalid syntax
    i’m use python version 2.7.6

  24. J Bazi says:

    There is something seriously wrong with the logic of this algorithm I
    think. OR I am completely on the wrong track.

    The algorithm “forecasts” such a good indication of the stock price that it
    knows when the market dropped with the Brexit. Seems very fishy.

    I do really like your video’s and I am learning a lot, but I do feel
    something is not right here.

  25. devesh aggrawal says:

    there is a discontinuous space bet Adj. Close and forecast if i plot them
    differently, but if i plug the values of forecast_set in the Adj. Close
    column then there is no discontinuous. Can you explain why and how to
    tackle it.

  26. devesh aggrawal says:

    And why did matplotlib plotted the graph agains graph without specifying
    it, you only labeled it date but did not specified it.
    What about that?

  27. PN says:

    What exactly do we use the classifier for?
    In the clf.fit(), is the “fit” thing to find the best fitting line for the
    given dataset or something? If not what does it do?

  28. Jeff van Geete says:

    Is there anyway you can post the final script you’re using – I made the
    corrections noted both in the video and in the comments below, and am still
    getting an error on the cross_validation stage saying the rows aren’t
    lining up…

  29. Jeff van Geete says:

    Sorry to ask another so quickly but do you have the py2.7 fix for the unix
    time stamp section? I am having difficulty returning a unix result when try
    to use the workaround

    last_date = datetime.datetime(df.iloc[-1].name)
    last_unix = last_date.timedelta(seconds = one_day)

  30. Lingobol :) says:

    Dear Sir,

    Thanks for the wonderful tutorial. I have a doubt.

    My understanding of what is being done:

    Features are of a date suppose K and label is of the date K+30

    So while predicting when we are using features of date say 1 October,
    aren’t we predicting label(price) of 31st October ??

    So we are actually not predicting for next 30 days but 30 days after the
    next 30 days.

    Please clarify.

Comments are closed.