https://arxiv.org/pdf/1605.04812.pdf
The difference between Off-policy and On-policy methods is that with the first you do not need to follow any specific policy, your agent could even behave randomly and despite this, off-policy methods can still find the optimal policy. On the other hand on-policy methods are dependent on the policy used. In the case of Q-Learning, which is off-policy, it will find the optimal policy independent of the policy used during exploration, however this is true only when you visit the different states enough times.
Similar answer from the same post:
The reason that Q-learning is off-policy is that it updates its Q-values using the Q-value of the next state s′ and the **greedy action a′**. In other words, it estimates the return (total discounted future reward) for state-action pairs assuming a greedy policy were followed despite the fact that it's not following a greedy policy.
The reason that SARSA is on-policy is that it updates its Q-values using the Q-value of the next state s′s′ and the **current policy's action a″**. It estimates the return for state-action pairs assuming the current policy continues to be followed.